Financial ledger and double-entry accounting at scale

A ledger looks like a database table you insert rows into. That illusion survives exactly until the first retry storm. The interesting version of this problem is a ledger that must never lose a cent — not under a thundering herd of retries, not when one account goes viral and absorbs 200,000 writes per second, not when a process dies between claiming an idempotency key and committing the journal entry it was supposed to guard. We are designing the idempotent journal, the write-sharded hot account, the CQRS balance cache, and the WORM audit trail that enforces the double-entry invariant — debits and credits sum to zero, always — while one bestselling tweet detonates a write storm against a single seller account.

The problem, the scale, and the one invariant that cannot bend

Principal

Before any boxes, tell me the one property this system has that a normal CRUD app does not.

Staff

The double-entry invariant: for every transaction, the sum of all its journal entries is exactly zero. Every cent debited from one account is credited to another. If that sum is ever non-zero, money was created or destroyed — and unlike a dropped notification, you cannot paper over it after the fact. Stripe's ledger enforces exactly this: immutable journal entries that net to zero per transaction, and they verify 99.99% of dollar volume within 4 days. The invariant is the product.

The second property: idempotency is not optional. Coinbase says it plainly — idempotency is the single most critical principle, every transaction is processed exactly-once or not at all. A double-charge isn't a bug, it's a chargeback, a regulatory incident, and a lost customer.

Principal

Give me the numbers we're designing for.

Napkin math — the scale envelope

A bestselling-tweet event drives a peak of $200{,}000$ transactions/second. Each transaction is double-entry, so it produces at least two journal entries:

$$ R_{entries} = 2 \times 2 \times 10^{5} = 4 \times 10^{5}\ \text{journal entries/second at peak.}$$

Average is far lower — call it $50{,}000$ TPS sustained. Storage: at $\sim$200 bytes/entry and $4\times10^{5}$ entries/s peak, a one-hour viral spike alone is

$$ 4 \times 10^{5} \times 200\ \text{B} \times 3600\ \text{s} \approx 288\ \text{GB/hour at peak.}$$

At the $50$k sustained average that is $\sim$$1.7\times10^{10}$ entries/day across the platform. Stripe processes $5 \times 10^{9}$ events daily ($\sim$57,870/s) — we are designing for a spikier, burstier version of the same shape. The journal grows without bound and is retained for 7 years. That growth profile drives half the design.

The naive ledger, and the three ways it loses money

Principal

Start with the thing a junior engineer would build. One table.

Staff

The naive design is a single accounts table with a balance column, and a charge does UPDATE accounts SET balance = balance - 100 WHERE id = 'A'. It's tempting because it's one row and the balance read is free. It loses money three different ways.

One: no audit trail. A mutable balance has no history — you cannot answer "why is this balance what it is?" which is the entire job of a ledger. Two: no idempotency — a retried request runs the UPDATE twice and double-debits. Three: it makes balance the source of truth, so a single corrupt write is unrecoverable; there's nothing to re-derive it from.

Principal

Immutable append-only sounds expensive to query. Hold that thought — first show me the idempotency, because that's where most ledgers actually bleed.

Idempotency — and the race that beats the naive check

Staff

Every command carries an idempotency key. The naive implementation: "look up the key; if it exists, return the prior result; if not, process and store it." That read-then-write check has a fatal race.

Napkin math — the idempotency race

A client times out and retries; now two identical requests are in flight. Both execute the check at nearly the same instant:

$$ T_1: \text{read key} \rightarrow \text{absent} \qquad T_2: \text{read key} \rightarrow \text{absent}$$

Neither has written yet, so both see "absent", both proceed, and the account is debited twice. At $2\times10^{5}$ TPS with even a $0.5\%$ retry rate, that's $1{,}000$ duplicate-eligible requests/second — a steady drip of double-charges. The window is the gap between read and write, and at this volume that gap is always being hit by someone.

Staff

The fix is to make the check-and-claim a single atomic operation. The very first thing a command does is a conditional insert: PutItem with attribute_not_exists(idempotency_key) on DynamoDB. This is compare-and-set at the storage layer — exactly one of the two retries wins the partition, the other gets ConditionalCheckFailedException and returns the stored result. The race is closed because there is no gap; the read and the claim are one operation.

The key itself is tenant_id#sha256(client_id + request_id + amount + currency + timestamp_bucket). Including amount and currency means a client reusing a request ID for a different charge isn't silently suppressed; the timestamp bucket bounds the dedup window so the table doesn't have to remember forever. The tenant_id# prefix is deliberate: an opaque hash has no structure an IAM policy can clamp, but a composite key that starts with the tenant lets dynamodb:LeadingKeys fence each tenant's role to its own partition space — we'll lean on that hard in the security beat.

Principal

You said it's a cache-like store. It is not a cache. What happens when DynamoDB is unavailable?

Staff

We fail closed. The idempotency gate is correctness-critical, not best-effort. If we cannot claim the key, we cannot safely process the command, so we reject with a retryable error and let the client back off. A ledger that processes when it cannot dedup is a ledger that double-charges — "open on failure" is the wrong default for money. DynamoDB's availability SLA and multi-AZ design make this rare, but the behaviour on the rare path is non-negotiable: no claim, no charge.

The journal schema and the deferred SUM=0 constraint

Principal

Show me the table that holds the truth. And tell me where the floats are, because I know there's a float bug coming.

Staff

No floats — that's the bug, and we don't get to have it. Amounts are integer cents in a BIGINT. Floating point cannot represent $0.10$ exactly; sum a million of them and you accumulate error that violates the invariant. Integer cents sum exactly.

journal_entries (
  journal_entry_id  UUID         PRIMARY KEY,
  transaction_id    UUID         NOT NULL,
  account_id        TEXT         NOT NULL,
  amount_cents      BIGINT       NOT NULL,   -- signed: debit negative, credit positive
  direction         ledger_dir   NOT NULL,   -- ENUM(debit, credit)
  currency          CHAR(3)      NOT NULL,
  posted_at         TIMESTAMPTZ  NOT NULL DEFAULT now()
)

Now the invariant. For each transaction_id, $\sum amount\_cents = 0$. The trap: a row-level CHECK can't express it, and the debit and credit legs are inserted as separate rows — after the first insert the transaction sums to a non-zero value, which is a valid intermediate state.

Napkin math — the invariant, formally

For a transaction $t$ with entries $e_1 \dots e_k$:

$$ \sum_{i=1}^{k} a_i = 0 \quad \text{where } a_i \text{ is the signed } amount\_cents \text{ of entry } e_i.$$

A simple charge has $k=2$: $a_1 = -100$ (debit the buyer), $a_2 = +100$ (credit the seller). A transfer with a fee has $k=3$ and still nets to zero. Integer cents make the equality exact — there is no $\epsilon$.

Staff

We enforce it with a DEFERRABLE INITIALLY DEFERRED constraint trigger in Aurora PostgreSQL. The check runs at COMMIT, not per row — so the transaction writes both legs, and only when it tries to commit does Postgres verify $\sum = 0$ for every affected transaction_id. A half-written transaction that never commits never violates anything. A transaction that commits with a non-zero sum is rejected by the database itself, not by trusting application code.

RDS Proxy is not optional at this write rate

At $2\times10^{5}$ TPS, hundreds of Fargate tasks each holding direct Aurora connections will blow past the writer's safe ceiling (~1,500 connections on an r6g.2xlarge) — connection exhaustion is the first failure mode you hit scaling this, before CPU or I/O. RDS Proxy multiplexes the task fleet onto a small pool of backend connections and survives a failover by holding client connections open while it re-points at the promoted writer. The catch: our DEFERRABLE INITIALLY DEFERRED constraint spans multiple statements in one transaction, so the proxy must session-pin (transaction-mode multiplexing would scatter the legs across backend connections and break the deferred check). We accept slightly lower pool efficiency for correctness. Cost is ~$500/month.

Why Aurora over self-managed Postgres on EC2, and why not DynamoDB for the journal

The ladder: there's a managed relational service, so we take it. Aurora gives us Multi-AZ failover, automated backups, point-in-time recovery, and a storage layer that replicates six ways across three AZs — durability we'd otherwise hand-build. Self-managed Postgres on EC2 means owning failover, replication lag, and patching for the most correctness-critical store in the system; no thank you. And why not DynamoDB for the journal itself? Because the SUM=0 invariant is a multi-row transactional constraint across a transaction's legs, and that is precisely what a relational engine with deferred constraints does natively. DynamoDB has transactions but no cross-item aggregate check; we'd be re-implementing the invariant in application code, which is the layer we just decided not to trust.

The hot window — keeping 7 years of journal off Aurora's expensive storage

Principal

You said the journal grows without bound and is kept 7 years. Aurora storage is not cheap. Multiply it out before I do.

Staff

It's brutal if you keep everything in Aurora. Storage alone, left unbounded, dwarfs the compute bill by year 7. The fix is that Aurora holds a hot window, not the archive — the archive already lives in S3 as the WORM Parquet export we build for audit anyway. So we partition the journal by month and drop cold partitions out of Aurora once they're safely in S3.

Napkin math — archive in Aurora vs. hot window

At $1.7\times10^{10}$ entries/day $\times$ 200 B, the journal accretes $\sim$3.4 TB/day. Keep all 7 years in Aurora:

$$ 3.4\ \text{TB/day} \times 365 \times 7 \approx 8.69\ \text{PB} \Rightarrow \sim\$890{,}000/\text{month in storage by year 7.}$$

Keep only a 90-day hot window and the steady state is flat:

$$ 3.4\ \text{TB/day} \times 90 \approx 306\ \text{TB} \Rightarrow \sim\$68{,}850/\text{month at the I/O-Optimized \$0.225/GB rate.}$$

And the I/O side: at $\sim$$10^{5}$ entries/s sustained, per-request I/O billing under Aurora Standard mode runs $\sim$$90{,}720/\text{month}$. I/O-Optimized folds I/O into a flat storage rate and erases that line — it pays for itself the moment write I/O exceeds ~25% of the bill, which at this rate it does from day one.

The hot partition — when one account eats the write storm

Principal

The bestselling tweet. One seller account is now receiving the entire $200{,}000$ TPS as credits. What happens to your beautiful Aurora row?

Staff

It melts. Every one of those credits touches the same seller account, and if there's any per-account contention — a balance row, a hot index page, a single partition key — it serialises. Even setting Aurora aside, the same shape kills DynamoDB: a single partition key caps around $1{,}000$ writes/second.

Napkin math — the hot account

The viral seller absorbs $200{,}000$ credit writes/second to one account_id. Against a per-partition ceiling of $\sim$1,000 writes/s:

$$ N_{shards} = \frac{2 \times 10^{5}}{10^{3}} = 200\ \text{shards needed to stay under the ceiling.}$$

So we provision $N = 256$ virtual shards for a flagged hot account — $256 > 200$ gives $\sim$20% headroom on the napkin and is a clean power of two for the hash. There is no "overflow tier": $N=256$ already covers the worst case the math allows, so we don't hand-wave a second mechanism that would itself need a capacity argument. The write storm is divided across 256 partition keys instead of contending on one.

Staff

Write-sharding. A hot account is split into virtual sub-accounts: account_id#000 through account_id#255. Each incoming credit round-robins (or hashes the journal_entry_id) across the 256 shards, so 256 partition keys share the load. The account's true balance is the sum across all 256 shards — and that fan-in never hits the hot path because the CQRS projector materialises the summed balance, so the read side is unchanged regardless of $N$. Crucially we don't shard every account — 99.9% of accounts are cold and a single row is fine. A control table flags which accounts are hot and stores $N$ per account. We pre-shard known-hot accounts — major platform accounts, marketplace float accounts, sellers with a history of going viral — at creation or on early detection, not reactively mid-storm when re-sharding is a migration.

Reading a balance — CQRS, because summing history is O(entries)

Principal

A customer opens the app and wants their balance. You have an immutable, append-only, sharded journal. How fast is that read, naively?

Staff

Naively, unusably slow. Balance = SELECT SUM(amount_cents) FROM journal_entries WHERE account_id = ?. That's O(entries), and a high-velocity account has millions of rows.

Napkin math — the naive balance read

A merchant with 6 months of activity at a modest 50 transactions/second has accumulated

$$ 50 \times 2\ \text{entries} \times 86{,}400 \times 180 \approx 1.5 \times 10^{9}\ \text{journal entries.}$$

Summing 1.5 billion rows on every balance check, for thousands of concurrent users, is a non-starter — even at 1 GB/s scan that's seconds per read. Balance reads are the most frequent operation in the system; they must be O(1).

Staff

This is textbook CQRS. The write side is the immutable journal. The read side is a materialised balance projection kept in ElastiCache (Valkey): one balance per account, already summed across all 32 shards. A balance read is a single cache lookup. The projection is maintained incrementally: a CQRS projector Lambda consumes a Kinesis stream of journal entries — fed from Aurora by DMS change-data-capture — and applies each entry as a delta to the cached balance. On a cache miss we fall back to a bounded Aurora aggregate: the latest balance snapshot plus only the entries since that snapshot, never the full history.

Why ElastiCache Valkey over MemoryDB — and why not DAX

The balance cache is a derived projection we can always rebuild from the journal, so we don't need MemoryDB's durable multi-AZ transaction log — if the cache is lost, the projector replays from the journal of record and rebuilds it. ElastiCache Valkey gives us the in-memory speed and cluster mode (sharded across r6g.large nodes) at lower cost than MemoryDB's durability premium. We reserve MemoryDB for stores that must survive failover without a rebuild source; this one has a rebuild source by construction. DAX was considered and rejected: DAX is a DynamoDB read-cache, not a write-path accelerator; the projector's hot path is an atomic in-memory INCRBY amount_cents on the balance key, and DAX provides no atomic increment — it would force an extra DynamoDB round-trip on every single delta. Valkey's atomic integer increment is the correct primitive.

Principal

Why Kinesis to drive the projector and not just Lambda triggered straight off Aurora?

Staff

Because I want a durable, replayable, ordered log of journal entries between the write side and every read-side consumer — the projector is only the first of them; analytics snapshots and the audit export are others. Kinesis gives me retention (replay the last 24 hours, extendable to 365 days, to rebuild a corrupted projection), ordering per partition key, and a native Lambda event-source mapping with batching and checkpointing. Critically we run it in On-Demand mode: at $4\times10^{5}$ journal entries/s peak, a provisioned stream would need $\sim$400 shards (10,000 records/s each), and pre-provisioning 400 shards for a spike that's mostly idle is both wasteful and a throttle waiting to happen if the viral event overshoots the estimate. On-Demand auto-scales shard capacity with the load, so the stream is never the bottleneck for DMS CDC. The partition key is the logical account_id (not the account_id#NNN shard) — that way every delta for one logical account lands on one Kinesis partition and applies in LSN order, which the projector needs to fold shard writes into a single balance correctly. DMS captures the change stream off Aurora's logical replication and lands it on Kinesis. The alternative — firing Lambda directly per row — gives me no replay buffer and no ordering guarantee, which I need to rebuild the cache after an incident.

Hardening the projector's event-source mapping

A poison record must not block an account's shard forever, and records must not age out unprocessed. The ESM runs with BisectBatchOnFunctionError=true, MaximumRetryAttempts bounded, an on-failure SQS dead-letter destination for record metadata, and a CloudWatch alarm on IteratorAgeMilliseconds > 30000 so projection lag trips an alert before it becomes a stale-balance incident. The projector applies each entry idempotently — keyed on journal_entry_id — because Kinesis delivery is at-least-once and we must not double-apply a delta to the cached balance.

Aurora Database Activity Streams (DAS) was evaluated as a zero-instance CDC alternative to DMS and rejected for the CDC role: DAS emits audit-log-format activity records, not structured row-change events — the projector's delta computation needs amount_cents and direction as first-class fields, which DMS's structured JSON target delivers cleanly and DAS would force us to parse fragilely out of audit text. (We do still use DAS, but in its proper place — capturing privileged DBA SQL for the control-plane audit, beat 10.)

Cross-account transfers — the saga and the stranded cent

Principal

A transfer: debit account A, credit account B. The process dies after the debit. Where's the money?

Staff

Stranded — destroyed from A, never created in B — and the SUM=0 invariant is violated for that transaction. The naive two-step write has no atomicity across the legs. The good news: when both accounts live in the same Aurora cluster, the transfer is a single database transaction with both legs, and the deferred SUM=0 constraint gives me atomicity for free. Either both legs commit or neither does. Most transfers are this case.

Principal

And when they're not the same case — A and B in different shards, services, or regions?

Staff

Then it's a saga orchestrated by Step Functions. A saga-state record tracks progress. If the credit to B fails after the debit to A committed, the saga doesn't try a destructive rollback of an immutable journal — it can't, entries are append-only. Instead it posts a compensating transaction: a new credit back to A that reverses the debit. The audit trail then shows the full path — debit, failed credit, compensating reversal — which is exactly what an auditor wants: the ledger records attempts, not just successes. And the whole saga is keyed on the same idempotency key, so a retried orchestration never double-applies a leg.

Decision — Express, not Standard, Workflows

We run the saga on Step Functions Express Workflows. The arithmetic forces it: at $\sim$2,000 cross-boundary sagas/s $\times$ 5 transitions, Standard Workflows bill per state transition and land at $\sim$$648{,}000/month$; Express bills per execution duration and lands at $\sim$$8{,}000/month$ — an 80$\times$ saving. Express gives at-least-once execution rather than Standard's exactly-once, but exactly-once belongs on the idempotency key at each saga step, not on the orchestrator: every step is keyed transaction_id#step_name, so a re-driven Express execution that re-runs a step is an idempotent no-op. We already built that correctness; Express just stops us paying the orchestrator to duplicate it. Express handles 100,000 starts/s, far past our peak.

Reconciliation — the sweeper, the dead-man's switch, and the drift sweep

Principal

You claimed the idempotency claim and the journal commit are two phases across two systems. Two phases means a window. A process can claim the key, then die before the journal commits. Now there's a claimed key with no entry — a charge the client thinks failed but can never retry, because the key is taken. How long is that window, and who closes it?

Staff

I'm not going to pretend DynamoDB and Aurora commit as one unit — they're separate transactional domains and there is no 2PC between them. So yes, there's a window, and the honest design is to bound and reconcile it, not to wish it away. Phase one: the conditional PutItem claims the key with status=PENDING. Phase two: the Aurora journal transaction commits. Phase three: we flip the DynamoDB item to status=COMPLETE keyed on transaction_id. If the process dies after phase one and before phase two, the claim is stranded PENDING.

A sweeper Lambda closes it. EventBridge schedules it every 5 minutes; it scans for PENDING items older than a deadline, looks up the matching transaction_id in Aurora, and resolves: if the journal entry exists, complete the claim; if it doesn't, the charge genuinely failed, so release the claim so the client can retry. The window a stranded claim can block a retry is therefore $\sim$5 minutes, not the 30-day TTL.

Staff

Detection is the nightly reconciliation sweep: Step Functions orchestrating Athena re-sums every account from the journal of record — the immutable Parquet export — and compares each total to the materialised balance in ElastiCache. Any account where journal-sum $\neq$ cache-balance is a discrepancy, written to S3, and drift exceeding $0.01 pages on-call. This is the independent check that catches a projector bug, a dropped Kinesis record, or a stranded claim before it compounds.

Napkin math — can we re-sum the whole ledger nightly?

Suppose $5\times10^{8}$ active accounts and $1.7\times10^{10}$ entries/day. The reconcile sums from the Parquet audit export partitioned by date, plus the running snapshot — a daily incremental reconcile processes only the day's entries against yesterday's snapshot:

$$ 1.7 \times 10^{10}\ \text{entries} \times 200\ \text{B} \approx 3.4\ \text{TB/day to scan.}$$

That's a single Athena GROUP BY account_id over 3.4 TB of Parquet — $\sim$$17/run$ at $5/TB scanned, no cluster to manage, comfortably inside a nightly window. A full re-sum from genesis runs weekly or on-demand after an incident, reading the entire Object-Lock archive. Incremental nightly, full periodic: drift is bounded to one day.

The audit trail — WORM, because "trust me" is not a control

Principal

Your compliance team needs to prove to an auditor that the journal was never tampered with. A DBA with write access can alter Aurora. How do you make the record immutable?

Staff

You move the durable, queryable record out of any system a human can mutate. DMS streams journal entries to S3 as Parquet, partitioned by date, under S3 Object Lock in compliance mode — write-once-read-many, immutable for the 7-year retention term, and not deletable even by the root account until retention expires. That's the control: not "the DBA promises not to edit it" but "the storage layer physically forbids it." The compliance team queries it with Athena — serverless SQL straight over the Parquet in S3, no cluster to run.

Security, multi-tenancy, and the compliance map

Principal

This is multi-tenant. Tenant X must never read tenant Y's ledger, not through a bug, not through a malicious query. And map me to the frameworks an auditor will ask about.

Staff

Tenant isolation is enforced at multiple layers, defence in depth. The DynamoDB idempotency table's partition key begins with tenant_id#, so an IAM policy with dynamodb:LeadingKeys: ['${aws:PrincipalTag/tenant}#*'] restricts each request to its own partition space — a query for another tenant's keys is denied by IAM before it executes. That only works if the principal carries a trustworthy tenant tag: the Fargate service injects it by calling STS AssumeRole with a session tag per request, so the credential the DynamoDB call runs under is fenced to exactly one tenant — the application can't widen its own blast radius. In Aurora, row-level security policies scope every query to the session's tenant. ElastiCache balances are keyed with the tenant prefix so a cache lookup cannot cross tenants. The principle: isolation is a property of the access path, never of application discipline alone.

Compliance map

SOC 2 Type II: immutable audit log in S3 Object Lock, CloudTrail on every API call, the nightly reconciliation as a continuous control with evidence in S3. ISO 27001: encryption at rest via KMS (customer-managed keys, per-tenant key policies), TLS 1.3 in transit, field-level encryption for any token/PII at the boundary. GDPR: account closure is a soft-delete — mark the account closed; journal entries are retained 7 years for financial-record obligations and contain no PII, so the retention obligation and the erasure right don't collide (the PII lives in a separate, erasable customer-profile store). NIST: least-privilege IAM roles per service, all data-plane components in private VPC subnets, credentials in Secrets Manager with rotation. PCI-DSS: no card data in the ledger — token only — and a Macie scan continuously verifies no PAN-shaped data leaked into the S3 archive.

Principal

GDPR erasure versus 7-year financial retention is a real collision. You waved at it — nail it down.

Staff

The resolution is data separation by legal basis. The journal holds money movements between opaque account IDs and integer cents — financial records, retained 7 years under a legal obligation that overrides the erasure right (GDPR Art. 17(3)(b)). All actual personal data — name, email, address — lives in a separate customer-profile store that is erasable on request. When a data subject invokes erasure, we wipe the profile store and break the link, leaving the journal as pseudonymised financial records that are no longer personal data. The journal's WORM immutability and GDPR's erasure right never touch because no personal data ever enters the journal. That separation is designed in at beat 01, not bolted on.

Failure & recovery — the 3am scenarios

Principal

It's 3am. Walk me through three failures and exactly what happens.

Staff

Aurora primary fails (AZ-level). Multi-AZ promotes a standby in another AZ; RDS Proxy holds client connections open and re-points at the promoted writer. In-flight transactions that hadn't committed roll back cleanly — and because they hadn't committed, no journal entry exists, so the client's retry (same idempotency key, whose stranded PENDING claim the sweeper releases) is processed fresh, exactly once. RTO is the failover time, single-digit minutes; RPO is zero within the region thanks to six-way synchronous storage replication across three AZs.

The CQRS projector falls behind or crashes. Balances go stale, not wrong — the journal of record is untouched. The IteratorAge alarm fires; we scale the projector's concurrency or replay from Kinesis. If the cache is corrupted, we flush it and rebuild from the journal — the read model is disposable by design.

The retry storm itself. The viral tweet drives a thundering herd of retries. The idempotency gate absorbs them — every duplicate hits the conditional PutItem, loses, and returns the cached result without ever touching the journal. DynamoDB scales to the spike; ECS Fargate scales on request count; the write-sharded hot account spreads the genuine new writes across 256 shards. The storm hits a wall of idempotent no-ops, which is exactly the design intent.

Trade-off accepted — Multi-AZ is not multi-region

Let's be honest about the blast radius: this is a single-region Multi-AZ design. A regional impairment takes the ledger offline — that is a declared outage, and recovery time is AWS region recovery, not an automatic failover we control. The mitigation is point-in-time recovery (restore to any second) plus a disaster-recovery runbook, not a phantom "active-active" we don't have. Aurora Global Database is the path to RPO $\sim$1s cross-region at roughly 2.5$\times$ the compute cost — the right upgrade if a regulatory SLA mandates sub-minute RTO across regions. We don't have that requirement, so we don't pay for it; we name the exact trigger that would flip the decision. There is no active cross-region failover in this design, and the transcript no longer claims one.

Cost — what 200k peak / 50k average actually costs per month

Principal

Put a number on it. Monthly, at 200k TPS peak and 50k sustained.

Napkin math — corrected monthly cost, year 1

The first-cut estimate was off by 6–8$\times$ — it under-counted Aurora I/O and storage, used a 10-shard Kinesis stream that serves 2.5% of peak, and bundled Step Functions at a price only Standard Workflows would charge. Corrected, with every fix from this review applied:

Aurora I/O-Optimized (306 TB hot window) — writer + reader r6g.2xlarge plus flat-rate I/O-Optimized storage on the 90-day window; the archival cap (beat 04b) is what keeps this finite $\rightarrow$ $\sim$$69{,}000/month.

DynamoDB idempotency (provisioned + autoscaling, 50k–200k WCU) — provisioned capacity for the predictable baseline with autoscaling into the viral tail; provisioned is materially cheaper than on-demand once the floor is known $\rightarrow$ $\sim$$28{,}000/month.

Kinesis On-Demand — auto-scaling shard capacity at the stated load, no 400-shard pre-provision $\rightarrow$ $\sim$$2{,}700/month.

Step Functions Express — the saga on Express, $\sim$2,000 sagas/s; Standard would have been $\sim$$648{,}000$ $\rightarrow$ $\sim$$8{,}000/month.

ElastiCache Valkey — 3 × r6g.large $\rightarrow$ $\sim$$440/month. Lambda projector $\rightarrow$ $\sim$$300/month. ECS Fargate + API Gateway + ALB $\rightarrow$ $\sim$$2{,}000/month.

Athena + Glue + S3 + CloudTrail + Macie (reconcile and audit) $\rightarrow$ $\sim$$3{,}000/month. DMS (2 instances) $\rightarrow$ $\sim$$1{,}600/month. RDS Proxy $\rightarrow$ $\sim$$500/month. VPC Interface Endpoints (Gateway endpoints for DynamoDB and S3 are free) $\rightarrow$ $\sim$$200/month.

$$ \text{Total year 1} \approx \mathbf{\$116{,}000/month.}$$

Three drivers dominate: Aurora storage and compute ($\sim$59%), DynamoDB provisioned capacity ($\sim$24%), and Step Functions Express plus Kinesis ($\sim$9%). The rest is rounding.

Did we ever leave AWS?

Principal

Anywhere in this design, did you reach for something that isn't AWS-managed? Be honest.

Staff

No. Every component is an AWS-managed service or AWS primitives composed: API Gateway, ALB, ECS Fargate, RDS Proxy, DynamoDB for idempotency, Aurora PostgreSQL I/O-Optimized for the journal, ElastiCache Valkey for the balance read model, Kinesis On-Demand + Lambda for the CQRS projector, Step Functions Express for sagas, Athena for the nightly reconciliation re-sum (Glue only for the genesis scan), DMS for change capture, S3 Object Lock + Athena for the audit trail, EventBridge for the sweeper and dead-man's switch, KMS, Secrets Manager, STS session tags, VPC endpoints, CloudTrail, and Macie for security and compliance.

The two places a team might leave AWS — a purpose-built ledger engine like TigerBeetle, or self-managed Kafka instead of Kinesis — didn't clear the bar. TigerBeetle is genuinely excellent at exactly this double-entry workload and would be the answer if the hard requirement were millions of transactions/second on a single primary with deterministic latency. At our 200k-peak envelope, Aurora's deferred-constraint enforcement, managed Multi-AZ durability, and zero-ops failover win, and the SUM=0 invariant is expressible natively in Postgres. Kafka/MSK would hand us back broker sizing and partition rebalancing for an append-only stream that Kinesis serves with less ops burden. We stayed on AWS not by default but because at every rung the managed service met the hard requirement — and the one requirement that could push us off, deterministic single-node throughput in the millions/second, isn't one we have.

Final decision

A fully AWS-managed, single-region Multi-AZ double-entry ledger: DynamoDB idempotency gate (atomic conditional claim, two-phase with a 5-minute reconciling sweeper), Aurora PostgreSQL I/O-Optimized immutable journal with a deferred SUM=0 invariant, a 90-day hot window, and 256-way write-sharding for hot accounts, Kinesis On-Demand + Lambda CQRS projector into an ElastiCache Valkey balance read model, Step Functions Express sagas for cross-boundary transfers, nightly Step Functions + Athena reconciliation with a dead-man's switch, and an S3 Object Lock + Athena WORM audit trail — $\sim$$116k/month. The invariant never bends; the claim-to-commit window is bounded and observable, not atomic across two systems we can't 2PC; the storm hits a wall of idempotent no-ops; and we never left AWS.

↓ podcast script (.txt)