A ledger looks like a database table you insert rows into. That illusion survives exactly until the first retry storm. The interesting version of this problem is a ledger that must never lose a cent — not under a thundering herd of retries, not when one account goes viral and absorbs 200,000 writes per second, not when a process dies between claiming an idempotency key and committing the journal entry it was supposed to guard. We are designing the idempotent journal, the write-sharded hot account, the CQRS balance cache, and the WORM audit trail that enforces the double-entry invariant — debits and credits sum to zero, always — while one bestselling tweet detonates a write storm against a single seller account.
The problem, the scale, and the one invariant that cannot bend
Before any boxes, tell me the one property this system has that a normal CRUD app does not.
The double-entry invariant: for every transaction, the sum of all its journal entries is exactly zero. Every cent debited from one account is credited to another. If that sum is ever non-zero, money was created or destroyed — and unlike a dropped notification, you cannot paper over it after the fact. Stripe's ledger enforces exactly this: immutable journal entries that net to zero per transaction, and they verify 99.99% of dollar volume within 4 days. The invariant is the product.
The second property: idempotency is not optional. Coinbase says it plainly — idempotency is the single most critical principle, every transaction is processed exactly-once or not at all. A double-charge isn't a bug, it's a chargeback, a regulatory incident, and a lost customer.
Give me the numbers we're designing for.
A bestselling-tweet event drives a peak of $200{,}000$ transactions/second. Each transaction is double-entry, so it produces at least two journal entries:
$$ R_{entries} = 2 \times 2 \times 10^{5} = 4 \times 10^{5}\ \text{journal entries/second at peak.}$$
Average is far lower — call it $50{,}000$ TPS sustained. Storage: at $\sim$200 bytes/entry and $4\times10^{5}$ entries/s peak, a one-hour viral spike alone is
$$ 4 \times 10^{5} \times 200\ \text{B} \times 3600\ \text{s} \approx 288\ \text{GB/hour at peak.}$$
At the $50$k sustained average that is $\sim$$1.7\times10^{10}$ entries/day across the platform. Stripe processes $5 \times 10^{9}$ events daily ($\sim$57,870/s) — we are designing for a spikier, burstier version of the same shape. The journal grows without bound and is retained for 7 years. That growth profile drives half the design.
The naive ledger, and the three ways it loses money
Start with the thing a junior engineer would build. One table.
The naive design is a single accounts table with a balance column, and a charge does UPDATE accounts SET balance = balance - 100 WHERE id = 'A'. It's tempting because it's one row and the balance read is free. It loses money three different ways.
One: no audit trail. A mutable balance has no history — you cannot answer "why is this balance what it is?" which is the entire job of a ledger. Two: no idempotency — a retried request runs the UPDATE twice and double-debits. Three: it makes balance the source of truth, so a single corrupt write is unrecoverable; there's nothing to re-derive it from.
Immutable append-only sounds expensive to query. Hold that thought — first show me the idempotency, because that's where most ledgers actually bleed.
Idempotency — and the race that beats the naive check
Every command carries an idempotency key. The naive implementation: "look up the key; if it exists, return the prior result; if not, process and store it." That read-then-write check has a fatal race.
A client times out and retries; now two identical requests are in flight. Both execute the check at nearly the same instant:
$$ T_1: \text{read key} \rightarrow \text{absent} \qquad T_2: \text{read key} \rightarrow \text{absent}$$
Neither has written yet, so both see "absent", both proceed, and the account is debited twice. At $2\times10^{5}$ TPS with even a $0.5\%$ retry rate, that's $1{,}000$ duplicate-eligible requests/second — a steady drip of double-charges. The window is the gap between read and write, and at this volume that gap is always being hit by someone.
The fix is to make the check-and-claim a single atomic operation. The very first thing a command does is a conditional insert: PutItem with attribute_not_exists(idempotency_key) on DynamoDB. This is compare-and-set at the storage layer — exactly one of the two retries wins the partition, the other gets ConditionalCheckFailedException and returns the stored result. The race is closed because there is no gap; the read and the claim are one operation.
The key itself is tenant_id#sha256(client_id + request_id + amount + currency + timestamp_bucket). Including amount and currency means a client reusing a request ID for a different charge isn't silently suppressed; the timestamp bucket bounds the dedup window so the table doesn't have to remember forever. The tenant_id# prefix is deliberate: an opaque hash has no structure an IAM policy can clamp, but a composite key that starts with the tenant lets dynamodb:LeadingKeys fence each tenant's role to its own partition space — we'll lean on that hard in the security beat.
You said it's a cache-like store. It is not a cache. What happens when DynamoDB is unavailable?
We fail closed. The idempotency gate is correctness-critical, not best-effort. If we cannot claim the key, we cannot safely process the command, so we reject with a retryable error and let the client back off. A ledger that processes when it cannot dedup is a ledger that double-charges — "open on failure" is the wrong default for money. DynamoDB's availability SLA and multi-AZ design make this rare, but the behaviour on the rare path is non-negotiable: no claim, no charge.
The journal schema and the deferred SUM=0 constraint
Show me the table that holds the truth. And tell me where the floats are, because I know there's a float bug coming.
No floats — that's the bug, and we don't get to have it. Amounts are integer cents in a BIGINT. Floating point cannot represent $0.10$ exactly; sum a million of them and you accumulate error that violates the invariant. Integer cents sum exactly.
journal_entries (
journal_entry_id UUID PRIMARY KEY,
transaction_id UUID NOT NULL,
account_id TEXT NOT NULL,
amount_cents BIGINT NOT NULL, -- signed: debit negative, credit positive
direction ledger_dir NOT NULL, -- ENUM(debit, credit)
currency CHAR(3) NOT NULL,
posted_at TIMESTAMPTZ NOT NULL DEFAULT now()
)
Now the invariant. For each transaction_id, $\sum amount\_cents = 0$. The trap: a row-level CHECK can't express it, and the debit and credit legs are inserted as separate rows — after the first insert the transaction sums to a non-zero value, which is a valid intermediate state.
For a transaction $t$ with entries $e_1 \dots e_k$:
$$ \sum_{i=1}^{k} a_i = 0 \quad \text{where } a_i \text{ is the signed } amount\_cents \text{ of entry } e_i.$$
A simple charge has $k=2$: $a_1 = -100$ (debit the buyer), $a_2 = +100$ (credit the seller). A transfer with a fee has $k=3$ and still nets to zero. Integer cents make the equality exact — there is no $\epsilon$.
We enforce it with a DEFERRABLE INITIALLY DEFERRED constraint trigger in Aurora PostgreSQL. The check runs at COMMIT, not per row — so the transaction writes both legs, and only when it tries to commit does Postgres verify $\sum = 0$ for every affected transaction_id. A half-written transaction that never commits never violates anything. A transaction that commits with a non-zero sum is rejected by the database itself, not by trusting application code.
The hot window — keeping 7 years of journal off Aurora's expensive storage
You said the journal grows without bound and is kept 7 years. Aurora storage is not cheap. Multiply it out before I do.
It's brutal if you keep everything in Aurora. Storage alone, left unbounded, dwarfs the compute bill by year 7. The fix is that Aurora holds a hot window, not the archive — the archive already lives in S3 as the WORM Parquet export we build for audit anyway. So we partition the journal by month and drop cold partitions out of Aurora once they're safely in S3.
At $1.7\times10^{10}$ entries/day $\times$ 200 B, the journal accretes $\sim$3.4 TB/day. Keep all 7 years in Aurora:
$$ 3.4\ \text{TB/day} \times 365 \times 7 \approx 8.69\ \text{PB} \Rightarrow \sim\$890{,}000/\text{month in storage by year 7.}$$
Keep only a 90-day hot window and the steady state is flat:
$$ 3.4\ \text{TB/day} \times 90 \approx 306\ \text{TB} \Rightarrow \sim\$68{,}850/\text{month at the I/O-Optimized \$0.225/GB rate.}$$
And the I/O side: at $\sim$$10^{5}$ entries/s sustained, per-request I/O billing under Aurora Standard mode runs $\sim$$90{,}720/\text{month}$. I/O-Optimized folds I/O into a flat storage rate and erases that line — it pays for itself the moment write I/O exceeds ~25% of the bill, which at this rate it does from day one.
The hot partition — when one account eats the write storm
The bestselling tweet. One seller account is now receiving the entire $200{,}000$ TPS as credits. What happens to your beautiful Aurora row?
It melts. Every one of those credits touches the same seller account, and if there's any per-account contention — a balance row, a hot index page, a single partition key — it serialises. Even setting Aurora aside, the same shape kills DynamoDB: a single partition key caps around $1{,}000$ writes/second.
The viral seller absorbs $200{,}000$ credit writes/second to one account_id. Against a per-partition ceiling of $\sim$1,000 writes/s:
$$ N_{shards} = \frac{2 \times 10^{5}}{10^{3}} = 200\ \text{shards needed to stay under the ceiling.}$$
So we provision $N = 256$ virtual shards for a flagged hot account — $256 > 200$ gives $\sim$20% headroom on the napkin and is a clean power of two for the hash. There is no "overflow tier": $N=256$ already covers the worst case the math allows, so we don't hand-wave a second mechanism that would itself need a capacity argument. The write storm is divided across 256 partition keys instead of contending on one.
Write-sharding. A hot account is split into virtual sub-accounts: account_id#000 through account_id#255. Each incoming credit round-robins (or hashes the journal_entry_id) across the 256 shards, so 256 partition keys share the load. The account's true balance is the sum across all 256 shards — and that fan-in never hits the hot path because the CQRS projector materialises the summed balance, so the read side is unchanged regardless of $N$. Crucially we don't shard every account — 99.9% of accounts are cold and a single row is fine. A control table flags which accounts are hot and stores $N$ per account. We pre-shard known-hot accounts — major platform accounts, marketplace float accounts, sellers with a history of going viral — at creation or on early detection, not reactively mid-storm when re-sharding is a migration.
Reading a balance — CQRS, because summing history is O(entries)
A customer opens the app and wants their balance. You have an immutable, append-only, sharded journal. How fast is that read, naively?
Naively, unusably slow. Balance = SELECT SUM(amount_cents) FROM journal_entries WHERE account_id = ?. That's O(entries), and a high-velocity account has millions of rows.
A merchant with 6 months of activity at a modest 50 transactions/second has accumulated
$$ 50 \times 2\ \text{entries} \times 86{,}400 \times 180 \approx 1.5 \times 10^{9}\ \text{journal entries.}$$
Summing 1.5 billion rows on every balance check, for thousands of concurrent users, is a non-starter — even at 1 GB/s scan that's seconds per read. Balance reads are the most frequent operation in the system; they must be O(1).
This is textbook CQRS. The write side is the immutable journal. The read side is a materialised balance projection kept in ElastiCache (Valkey): one balance per account, already summed across all 32 shards. A balance read is a single cache lookup. The projection is maintained incrementally: a CQRS projector Lambda consumes a Kinesis stream of journal entries — fed from Aurora by DMS change-data-capture — and applies each entry as a delta to the cached balance. On a cache miss we fall back to a bounded Aurora aggregate: the latest balance snapshot plus only the entries since that snapshot, never the full history.
Why Kinesis to drive the projector and not just Lambda triggered straight off Aurora?
Because I want a durable, replayable, ordered log of journal entries between the write side and every read-side consumer — the projector is only the first of them; analytics snapshots and the audit export are others. Kinesis gives me retention (replay the last 24 hours, extendable to 365 days, to rebuild a corrupted projection), ordering per partition key, and a native Lambda event-source mapping with batching and checkpointing. Critically we run it in On-Demand mode: at $4\times10^{5}$ journal entries/s peak, a provisioned stream would need $\sim$400 shards (10,000 records/s each), and pre-provisioning 400 shards for a spike that's mostly idle is both wasteful and a throttle waiting to happen if the viral event overshoots the estimate. On-Demand auto-scales shard capacity with the load, so the stream is never the bottleneck for DMS CDC. The partition key is the logical account_id (not the account_id#NNN shard) — that way every delta for one logical account lands on one Kinesis partition and applies in LSN order, which the projector needs to fold shard writes into a single balance correctly. DMS captures the change stream off Aurora's logical replication and lands it on Kinesis. The alternative — firing Lambda directly per row — gives me no replay buffer and no ordering guarantee, which I need to rebuild the cache after an incident.
Cross-account transfers — the saga and the stranded cent
A transfer: debit account A, credit account B. The process dies after the debit. Where's the money?
Stranded — destroyed from A, never created in B — and the SUM=0 invariant is violated for that transaction. The naive two-step write has no atomicity across the legs. The good news: when both accounts live in the same Aurora cluster, the transfer is a single database transaction with both legs, and the deferred SUM=0 constraint gives me atomicity for free. Either both legs commit or neither does. Most transfers are this case.
And when they're not the same case — A and B in different shards, services, or regions?
Then it's a saga orchestrated by Step Functions. A saga-state record tracks progress. If the credit to B fails after the debit to A committed, the saga doesn't try a destructive rollback of an immutable journal — it can't, entries are append-only. Instead it posts a compensating transaction: a new credit back to A that reverses the debit. The audit trail then shows the full path — debit, failed credit, compensating reversal — which is exactly what an auditor wants: the ledger records attempts, not just successes. And the whole saga is keyed on the same idempotency key, so a retried orchestration never double-applies a leg.
Reconciliation — the sweeper, the dead-man's switch, and the drift sweep
You claimed the idempotency claim and the journal commit are two phases across two systems. Two phases means a window. A process can claim the key, then die before the journal commits. Now there's a claimed key with no entry — a charge the client thinks failed but can never retry, because the key is taken. How long is that window, and who closes it?
I'm not going to pretend DynamoDB and Aurora commit as one unit — they're separate transactional domains and there is no 2PC between them. So yes, there's a window, and the honest design is to bound and reconcile it, not to wish it away. Phase one: the conditional PutItem claims the key with status=PENDING. Phase two: the Aurora journal transaction commits. Phase three: we flip the DynamoDB item to status=COMPLETE keyed on transaction_id. If the process dies after phase one and before phase two, the claim is stranded PENDING.
A sweeper Lambda closes it. EventBridge schedules it every 5 minutes; it scans for PENDING items older than a deadline, looks up the matching transaction_id in Aurora, and resolves: if the journal entry exists, complete the claim; if it doesn't, the charge genuinely failed, so release the claim so the client can retry. The window a stranded claim can block a retry is therefore $\sim$5 minutes, not the 30-day TTL.
Detection is the nightly reconciliation sweep: Step Functions orchestrating Athena re-sums every account from the journal of record — the immutable Parquet export — and compares each total to the materialised balance in ElastiCache. Any account where journal-sum $\neq$ cache-balance is a discrepancy, written to S3, and drift exceeding $0.01 pages on-call. This is the independent check that catches a projector bug, a dropped Kinesis record, or a stranded claim before it compounds.
Suppose $5\times10^{8}$ active accounts and $1.7\times10^{10}$ entries/day. The reconcile sums from the Parquet audit export partitioned by date, plus the running snapshot — a daily incremental reconcile processes only the day's entries against yesterday's snapshot:
$$ 1.7 \times 10^{10}\ \text{entries} \times 200\ \text{B} \approx 3.4\ \text{TB/day to scan.}$$
That's a single Athena GROUP BY account_id over 3.4 TB of Parquet — $\sim$$17/run$ at $5/TB scanned, no cluster to manage, comfortably inside a nightly window. A full re-sum from genesis runs weekly or on-demand after an incident, reading the entire Object-Lock archive. Incremental nightly, full periodic: drift is bounded to one day.
The audit trail — WORM, because "trust me" is not a control
Your compliance team needs to prove to an auditor that the journal was never tampered with. A DBA with write access can alter Aurora. How do you make the record immutable?
You move the durable, queryable record out of any system a human can mutate. DMS streams journal entries to S3 as Parquet, partitioned by date, under S3 Object Lock in compliance mode — write-once-read-many, immutable for the 7-year retention term, and not deletable even by the root account until retention expires. That's the control: not "the DBA promises not to edit it" but "the storage layer physically forbids it." The compliance team queries it with Athena — serverless SQL straight over the Parquet in S3, no cluster to run.
Security, multi-tenancy, and the compliance map
This is multi-tenant. Tenant X must never read tenant Y's ledger, not through a bug, not through a malicious query. And map me to the frameworks an auditor will ask about.
Tenant isolation is enforced at multiple layers, defence in depth. The DynamoDB idempotency table's partition key begins with tenant_id#, so an IAM policy with dynamodb:LeadingKeys: ['${aws:PrincipalTag/tenant}#*'] restricts each request to its own partition space — a query for another tenant's keys is denied by IAM before it executes. That only works if the principal carries a trustworthy tenant tag: the Fargate service injects it by calling STS AssumeRole with a session tag per request, so the credential the DynamoDB call runs under is fenced to exactly one tenant — the application can't widen its own blast radius. In Aurora, row-level security policies scope every query to the session's tenant. ElastiCache balances are keyed with the tenant prefix so a cache lookup cannot cross tenants. The principle: isolation is a property of the access path, never of application discipline alone.
GDPR erasure versus 7-year financial retention is a real collision. You waved at it — nail it down.
The resolution is data separation by legal basis. The journal holds money movements between opaque account IDs and integer cents — financial records, retained 7 years under a legal obligation that overrides the erasure right (GDPR Art. 17(3)(b)). All actual personal data — name, email, address — lives in a separate customer-profile store that is erasable on request. When a data subject invokes erasure, we wipe the profile store and break the link, leaving the journal as pseudonymised financial records that are no longer personal data. The journal's WORM immutability and GDPR's erasure right never touch because no personal data ever enters the journal. That separation is designed in at beat 01, not bolted on.
Failure & recovery — the 3am scenarios
It's 3am. Walk me through three failures and exactly what happens.
Aurora primary fails (AZ-level). Multi-AZ promotes a standby in another AZ; RDS Proxy holds client connections open and re-points at the promoted writer. In-flight transactions that hadn't committed roll back cleanly — and because they hadn't committed, no journal entry exists, so the client's retry (same idempotency key, whose stranded PENDING claim the sweeper releases) is processed fresh, exactly once. RTO is the failover time, single-digit minutes; RPO is zero within the region thanks to six-way synchronous storage replication across three AZs.
The CQRS projector falls behind or crashes. Balances go stale, not wrong — the journal of record is untouched. The IteratorAge alarm fires; we scale the projector's concurrency or replay from Kinesis. If the cache is corrupted, we flush it and rebuild from the journal — the read model is disposable by design.
The retry storm itself. The viral tweet drives a thundering herd of retries. The idempotency gate absorbs them — every duplicate hits the conditional PutItem, loses, and returns the cached result without ever touching the journal. DynamoDB scales to the spike; ECS Fargate scales on request count; the write-sharded hot account spreads the genuine new writes across 256 shards. The storm hits a wall of idempotent no-ops, which is exactly the design intent.
Cost — what 200k peak / 50k average actually costs per month
Put a number on it. Monthly, at 200k TPS peak and 50k sustained.
The first-cut estimate was off by 6–8$\times$ — it under-counted Aurora I/O and storage, used a 10-shard Kinesis stream that serves 2.5% of peak, and bundled Step Functions at a price only Standard Workflows would charge. Corrected, with every fix from this review applied:
Aurora I/O-Optimized (306 TB hot window) — writer + reader r6g.2xlarge plus flat-rate I/O-Optimized storage on the 90-day window; the archival cap (beat 04b) is what keeps this finite $\rightarrow$ $\sim$$69{,}000/month.
DynamoDB idempotency (provisioned + autoscaling, 50k–200k WCU) — provisioned capacity for the predictable baseline with autoscaling into the viral tail; provisioned is materially cheaper than on-demand once the floor is known $\rightarrow$ $\sim$$28{,}000/month.
Kinesis On-Demand — auto-scaling shard capacity at the stated load, no 400-shard pre-provision $\rightarrow$ $\sim$$2{,}700/month.
Step Functions Express — the saga on Express, $\sim$2,000 sagas/s; Standard would have been $\sim$$648{,}000$ $\rightarrow$ $\sim$$8{,}000/month.
ElastiCache Valkey — 3 × r6g.large $\rightarrow$ $\sim$$440/month. Lambda projector $\rightarrow$ $\sim$$300/month. ECS Fargate + API Gateway + ALB $\rightarrow$ $\sim$$2{,}000/month.
Athena + Glue + S3 + CloudTrail + Macie (reconcile and audit) $\rightarrow$ $\sim$$3{,}000/month. DMS (2 instances) $\rightarrow$ $\sim$$1{,}600/month. RDS Proxy $\rightarrow$ $\sim$$500/month. VPC Interface Endpoints (Gateway endpoints for DynamoDB and S3 are free) $\rightarrow$ $\sim$$200/month.
$$ \text{Total year 1} \approx \mathbf{\$116{,}000/month.}$$
Three drivers dominate: Aurora storage and compute ($\sim$59%), DynamoDB provisioned capacity ($\sim$24%), and Step Functions Express plus Kinesis ($\sim$9%). The rest is rounding.
Did we ever leave AWS?
Anywhere in this design, did you reach for something that isn't AWS-managed? Be honest.
No. Every component is an AWS-managed service or AWS primitives composed: API Gateway, ALB, ECS Fargate, RDS Proxy, DynamoDB for idempotency, Aurora PostgreSQL I/O-Optimized for the journal, ElastiCache Valkey for the balance read model, Kinesis On-Demand + Lambda for the CQRS projector, Step Functions Express for sagas, Athena for the nightly reconciliation re-sum (Glue only for the genesis scan), DMS for change capture, S3 Object Lock + Athena for the audit trail, EventBridge for the sweeper and dead-man's switch, KMS, Secrets Manager, STS session tags, VPC endpoints, CloudTrail, and Macie for security and compliance.
The two places a team might leave AWS — a purpose-built ledger engine like TigerBeetle, or self-managed Kafka instead of Kinesis — didn't clear the bar. TigerBeetle is genuinely excellent at exactly this double-entry workload and would be the answer if the hard requirement were millions of transactions/second on a single primary with deterministic latency. At our 200k-peak envelope, Aurora's deferred-constraint enforcement, managed Multi-AZ durability, and zero-ops failover win, and the SUM=0 invariant is expressible natively in Postgres. Kafka/MSK would hand us back broker sizing and partition rebalancing for an append-only stream that Kinesis serves with less ops burden. We stayed on AWS not by default but because at every rung the managed service met the hard requirement — and the one requirement that could push us off, deterministic single-node throughput in the millions/second, isn't one we have.