Distributed payment ledger with idempotent settlement

One HTTP retry without an idempotency key once charged a customer's card twice and cost a marketplace $4M in refunds and chargebacks before anyone noticed the pattern. The naive payment flow is two UPDATEs: debit the source, credit the destination. The interesting version is effectively-once — defense-in-depth against double-charges through retries, failovers, lost responses, and a stuck saga at 3am — across 1,000 authorizations/second and a ledger you cannot edit for seven years.

The problem, and the numbers we design to

Principal

Before any boxes — what are we building, and at what scale?

Staff

A payment ledger for a large marketplace. Design point: 1,000 payment authorizations/second at peak, $10B/year gross payment volume, 10,000 merchants each with isolated accounts, p99 authorization latency under 200 ms, and a ledger retained 7 years for IRS and PCI obligations. Settlements are multi-step: hold the funds, verify the merchant, release to them, collect the platform fee.

The one requirement that dominates every decision: no double-charges. Not a soft SLA — a business invariant. Money moved twice is a refund, a chargeback, a regulator, and a headline.

Principal

Be precise. "Zero double-charges" — is that a hard guarantee, or marketing?

Staff

It is effectively-once with defense-in-depth, and I will use that phrase consistently. Exactly-once delivery does not exist across systems. What exists is two independent guards that both have to fail in the same way to double-charge: a soft guard (the DynamoDB idempotency gate, which stops re-execution) and a hard guard (the Aurora UNIQUE constraint, which rejects the duplicate at commit even if the gate is bypassed). I will not claim a single magic invariant; I will claim that the probability of both failing on the same key is the product of two small numbers, and that the hard guard is database-enforced and unconditional.

Napkin math — why retries are not rare

At $1{,}000$ authorizations/second with a network error rate of just $0.1\%$, the expected retry rate is

$$ R_{retry} = 1{,}000\ \text{req/s} \times 0.001 = 1\ \text{retry/second} $$

That is one retry every second, around the clock:

$$ 1\ \text{retry/s} \times 86{,}400\ \text{s/day} = 86{,}400\ \text{duplicate-charge attempts/day} $$

Without idempotency, that is 86,400 chances to move money twice per day. Retries are not an edge case here; they are a continuous load. Stripe processes roughly 5 billion ledger events/day and treats idempotency as a core primitive, not a feature.

The naive double-debit, and why it is insidious

Principal

Start simple. Debit one account, credit another. What is wrong with that?

Staff

The textbook version is two statements:

UPDATE accounts SET balance = balance - 100 WHERE id = :source;
UPDATE accounts SET balance = balance + 100 WHERE id = :destination;

People reach for this because it looks atomic if you wrap it in a transaction — and within one database it is. But atomicity is not the problem. The problem is what the client sees. Even with a perfect single-database commit, there are three failure windows, and two of them are indistinguishable from success to the caller.

Principal

So the fix has to live above the database. The server needs to recognize "I have seen this exact request before."

Staff

Exactly. And "this exact request" has to be defined by the client, before the first attempt, so every retry carries the same identity. That is the idempotency key.

Idempotency keys: the client UUID and the conditional write

Staff

The client mints a UUID before the first attempt and resends the same UUID on every retry. The server treats that key as a claim. Before moving any money, it does a conditional write to a DynamoDB idempotency table:

PutItem(
  Item   = { pk: "tenant#42#key#<uuid>", status: "PENDING", ... },
  Condition = attribute_not_exists(pk)     # I am the first to claim this key
)

The status walks a small state machine: PENDING → PROCESSING → COMPLETE. When the payment finishes, we store the full response body alongside COMPLETE. A retry that loses the conditional write reads the winner's row and returns their stored response, byte for byte. The caller cannot tell a retry from the original.

Decision — a lease on the PROCESSING row, so a crashed worker does not orphan a key

The dangerous gap: a Fargate task claims the key at PENDING, moves it to PROCESSING, then crashes before the Aurora commit. Without a lease, that key is stuck at PROCESSING for the full 48-hour TTL and every retry gets a 409 — the payment is wedged. So the PROCESSING row carries a leaseExpiry (e.g. now() + 30 s). A retry that finds PROCESSING with an expired lease re-claims the key with a conditional UpdateItem (leaseExpiry < now()), bumps the lease, and re-runs. Because the forward step is itself idempotent against the Aurora UNIQUE constraint, a re-claim after a real-but-slow commit cannot double-charge — the duplicate insert is rejected. The lease turns an orphaned key from a 48-hour wedge into a 30-second self-heal.

Principal

Two requests with the same key land in the same millisecond. Walk me through the race.

Staff

That is the time-of-check-to-time-of-use race, and the conditional write is exactly what closes it. Both requests attempt the PutItem with attribute_not_exists. DynamoDB serializes them on the partition key: exactly one succeeds and gets PENDING; the other gets ConditionalCheckFailed. The loser does not proceed — it reads the winner's row and either returns the cached COMPLETE response or, if the winner is still PROCESSING, returns a 409 with a Retry-After so the client polls rather than re-executes.

Napkin math — idempotency table size and cost

Each row is roughly $200$ bytes (key, status, stored response pointer, timestamps). With a 48-hour TTL at $1{,}000$ writes/second:

$$ N = 1{,}000\ \text{req/s} \times 48 \times 3600\ \text{s} = 1.728 \times 10^{8}\ \text{rows} $$

$$ S = 1.728 \times 10^{8} \times 200\ \text{B} \approx 34.6\ \text{GB} \text{ at peak retention} $$

(Steady state with TTL expiry hovers lower; budget ~35 GB.) On DynamoDB on-demand, $1{,}000$ writes/s is $86.4\text{M}$ writes/month; at $0.625 per million write-request units that is about $1,620/month in write capacity alone, plus consistent reads on the retry path and storage — round the gate to ~$1,800/month. Still trivial next to what a single double-charge costs. TTL of 24–48 h matches how long a client could plausibly retry.

Principal

One tenant#42 prefix can be 10–20% of traffic. That is a hot partition — you can exhaust one partition's write budget on a spike while the table looks idle.

Staff

Right, and that is real at our skew. We shard the partition key: tenant#{id}#shard#{mod_N} where the shard suffix is a hash of the idempotency key over $N$ buckets. Writes for a single hot tenant spread across $N$ partitions instead of one. The gate read on a retry knows the key, so it recomputes the same suffix — no scatter-gather on the hot path. The reconciliation and any "all keys for a tenant" query does a bounded scatter-gather across the $N$ shards, which is fine off the critical path.

Principal

Why DynamoDB and not Redis? Redis is faster and you already need a cache.

Staff

Because the idempotency record is the consistency boundary. Stripe stores idempotency results with synchronous replication, not an eventual-consistency cache, for exactly this reason: a Redis primary failover can lose the last few writes, and a lost idempotency record is a permitted double-charge. DynamoDB gives me a conditional write with serialized, durable, strongly-consistent reads as a managed primitive — no Lua, no failover-loss window, single-digit-millisecond. That is the AWS-managed answer over a self-managed Redis I would have to make durable myself.

The double-entry ledger on Aurora PostgreSQL

Principal

The gate stops re-execution. Now where does the money actually live, and how do you know it is never corrupt?

Staff

Double-entry accounting. Every transaction creates exactly two entries — a debit and a credit — that sum to zero. You never edit a balance; you append entries, and a balance is the sum of an account's entries. The invariant across the whole ledger is

$$ \sum_{i} \text{entry}_i = 0 $$

If that sum is ever non-zero, money was created or destroyed, and the reconciliation job screams. The schema is append-only:

CREATE TABLE ledger_entries (
  id              BIGSERIAL PRIMARY KEY,
  account_id      UUID NOT NULL,
  transaction_id  UUID NOT NULL,
  entry_type      ledger_entry_type NOT NULL,   -- DEBIT | CREDIT
  amount          BIGINT NOT NULL,              -- minor units, never float
  currency        CHAR(3) NOT NULL,
  idempotency_key UUID NOT NULL,
  tenant_id       UUID NOT NULL,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (idempotency_key, account_id, entry_type)
);
-- No UPDATE. No DELETE. Corrections are reversing entries.

Amounts are integer minor units — cents, not floats — because floating-point money rounds, and rounded money does not sum to zero.

Principal

Why not Cassandra? It scales writes horizontally and you have 7 years of growth coming.

Staff

Monzo tried exactly that. They ran their ledger on Cassandra, hit its eventual-consistency model, and found it fundamentally unsafe for money — they bolted on etcd for distributed locking as a workaround and then recognized the real lesson: ledger writes demand ACID or an explicit quorum, never BASE. A read that might not see the last committed debit is, again, a license to double-spend. Coinbase reached the same conclusion in their 2019 migration: strongly consistent, append-only, across all products.

So: Aurora PostgreSQL. ACID transactions for the debit/credit pair, Multi-AZ with synchronous replication to a standby in a second AZ for RPO = 0. Real PostgreSQL ledgers hit ~8.8 ms average and 58 ms p99 at ~1,875 TPS — comfortably inside our 200 ms budget at 1,000/s.

Principal

Do the connection arithmetic. An r6g.2xlarge tops out near 900 max_connections. At 1,000 TPS and 8.8 ms a txn that is ~88 concurrent — fine. At 10× it is ~880 — you are at the wall, and a Fargate scale-out event blows past it. Where is the pooler?

Staff

Between Fargate and Aurora goes RDS Proxy — the managed pooler, so I am not running PgBouncer myself. Fargate tasks open cheap connections to the proxy; the proxy multiplexes them onto a bounded set of Aurora connections and survives a failover by holding client connections while the writer is promoted. One caveat I bake in: the proxy must run in transaction pinning / transaction mode, not session mode. My tenant isolation uses SET LOCAL app.tenant_id inside each transaction — SET LOCAL is scoped to the transaction, so transaction-mode pooling is safe. Session-mode pooling would let one tenant's session settings leak onto another's connection — a cross-tenant data leak. So the pooling mode is a security requirement, not just a performance one.

Principal

You have global merchants. Why not Aurora Global Database so writes are close to everyone?

Staff

Because cross-region replication adds 50–100 ms of lag, and that lag would land directly on the authorization critical path — half my latency budget spent on geography. The ledger write stays single-region and synchronous. I use Aurora Global Database the other way: as read replicas in other regions serving settlement reports, balance queries, and analytics — the read-heavy, latency-tolerant traffic — while the authoritative write path stays one region, RPO 0. Monzo's lesson again: no eventual consistency in the ledger write path.

Events without two-phase commit: the transactional outbox

Principal

The ledger committed. Now email, analytics, and the settlement trigger all need to know. How do they find out without you losing or fabricating an event?

Staff

The naive answer is: after the Aurora commit, publish to SQS. The failure is obvious once you say it: Aurora commits, the SQS publish times out or the process dies — event lost forever, ledger correct, downstream blind. So people flip it: publish to SQS first, then commit. Now the SQS publish succeeds, Aurora rolls back — phantom event for money that never moved. There is no ordering of "commit DB" and "publish to broker" that is safe, because they are two systems and you cannot atomically commit both without two-phase commit, which I am not running across a database and a queue.

Decision — the transactional outbox

Write the event into an outbox row in the same Aurora transaction as the ledger entries. One commit, one atomic unit — either both the ledger rows and the outbox row land, or neither does. The broker is no longer part of the commit.

CREATE TABLE outbox (
  id            BIGSERIAL PRIMARY KEY,
  payload       JSONB NOT NULL,
  idempotency_key UUID NOT NULL,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  published_at  TIMESTAMPTZ            -- NULL until a poller ships it
);

A separate Lambda polls WHERE published_at IS NULL ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED every 100 ms, publishes each row to SQS with exponential-backoff retries, and stamps published_at = now(). The FOR UPDATE SKIP LOCKED is load-bearing: if two Lambda instances ever run concurrently (a retry overlapping a slow run), each claims a disjoint batch instead of both publishing the same rows. Belt and suspenders, I also set the poller's reserved concurrency to 1. Delivery is at-least-once; consumers dedupe on the idempotency key in their own inbox.

Principal

Why a Lambda poller and not Debezium or DMS change-data-capture? That is the standard outbox shipping mechanism.

Staff

CDC is the right tool at higher volume, but it carries standing infrastructure. To be fair about the alternatives: DMS can write CDC directly to SQS without MSK — my earlier "CDC needs an MSK cluster" line was wrong, and Debezium specifically is what wants MSK Connect. The honest comparison is that DMS is a standing replication instance to provision, monitor, and pay for around the clock, where at 1,000 events/second a single Lambda polling a batch of 100 handles the load with one CloudWatch alarm. So I keep the Lambda poller on its own merits, not on a false claim about DMS.

Principal

Batch 100 every 100 ms is exactly $1{,}000$ rows/s — you are publishing at precisely the rate you ingest, with zero headroom. And ORDER BY id means one poison row blocks every later row behind it. That is head-of-line blocking on the money-event stream.

Staff

Both correct, and both have managed fixes. Headroom: the batch size is a tunable — raise LIMIT to 500 and the same poller clears 5,000 rows/s, so we run with a comfortable margin and alarm on outbox backlog (oldest unpublished row age). At true 10× sustained volume, the standing infrastructure finally pays for itself and we switch to DMS → Kinesis CDC — still no MSK required. Poison row: the outbox carries a publish_attempts counter. After $N$ attempts the poller stamps published_at anyway (so the head advances) and routes the payload to a DLQ row, then alerts. The money is already committed in the ledger; a stuck event must never wedge the stream behind it. We trade a delayed-and-investigated event for a stalled pipeline.

Napkin math — poller and SQS cost, corrected

Polling every 100 ms is $864{,}000$ Lambda invocations/day, $\approx 2.6 \times 10^{7}$/month. With invocation plus duration billing the poller is on the order of $5–21/month — not the $0.40 I waved at earlier; the duration of each batch publish dominates. SQS carries $2.592 \times 10^{9}$ messages/month; at $0.40 per million that is ~$1,037/month for a single consumer, and it multiplies by fan-out. SQS is no longer a rounding error — it is a real line item, and fan-out is a cost decision, not a free one.

Saga orchestration for multi-step settlement

Principal

Settlement is not one ledger write. Hold, verify the merchant against KYC, release, collect the fee. Some of that calls an external API. How do you keep it correct when step three fails?

Staff

One ACID transaction cannot span an external KYC call, so this is a saga: a sequence of local transactions, each with a compensating transaction that reverses it. The marketplace flow:

Hold — DEBIT the buyer into an escrow account.
Verify — call the external KYC provider (no ledger write).
Release — CREDIT the merchant from escrow.
Collect fee — DEBIT the merchant, CREDIT the platform.

If step 4 fails after step 3 committed, I must reverse only the committed steps — release and hold — and never touch steps that never ran. That is a Step Functions Standard Workflow: each state is durably persisted, failures retry with exponential backoff, and a Catch block triggers the compensating state machine.

Principal

Step 2 is an external KYC call. A slow KYC provider — not down, just gray — will pin your saga workers indefinitely. What protects you?

Staff

The KYC state gets an explicit TimeoutSeconds so a hung call fails fast instead of pinning a worker, retries with JitterStrategy: FULL so a recovering provider does not get a thundering herd, and a circuit breaker keyed on the KYC endpoint in DynamoDB: once the failure rate crosses a threshold the breaker opens and the saga parks new verifications in a retry queue rather than hammering a sick dependency. Gray failure is the dangerous one — a timeout plus a breaker turns it into a bounded, observable backlog instead of an unbounded pile of stuck executions.

Principal

Do the transition arithmetic. Five states, and at 10× load you will pierce the default 2,000-transitions/second quota.

Staff

Two things. First, not every authorization runs the full saga — a simple card payment is the idempotency gate plus the ledger pair, no settlement state machine. Only the marketplace settlements do, maybe 10–20% of authorizations, so call it 5–10 sagas/s at 1×. Second, we still request a StateTransition quota increase and put a CloudWatch alarm on ExecutionsStarted and on transition throttling, so we see the ceiling coming rather than hitting it. And on the saga path I will reconsider Express vs Standard on cost grounds in a moment — the choice is not free.

Principal

It is 3am. The compensation itself is failing — you are trying to reverse a merchant payout but the refund service is down. What happens to the money?

Staff

That is the worst case in saga literature: a stuck compensation. The hold committed, the release committed, the verify failed, and now the reversal will not go through — funds are in limbo. The wrong move is to silently retry forever or drop it; both leave money in an undefined state. So: after bounded retries, the saga flips the transaction to a LIMBO state, freezes the merchant account so no further movement compounds the problem, and pages a human — EventBridge → SNS → AWS Systems Manager Incident Manager, the AWS-native on-call and escalation service. A person reconciles it. Money in limbo is never resolved automatically; it is escalated.

Principal

Why Step Functions and not your own orchestrator on Fargate? You already have compute.

Staff

Because the hard part of a saga is durability and restartability across hours-long external waits, and Step Functions Standard Workflows give me that as a managed primitive: every state transition is persisted, so a worker crash mid-saga resumes exactly where it left off — I do not have to build and operate a durable workflow store myself. A hand-rolled orchestrator means I own the state persistence, the retry timers, the visibility, and the failure semantics. That is the managed-over-self-hosted answer.

Failure and recovery

Principal

Aurora primary dies mid-commit. Walk me through it, with numbers.

Staff

Aurora primary failover. Multi-AZ promotes the synchronous standby in 30–60 s, RTO under 60 s. Because replication is synchronous, RPO is 0 — if the ledger entry and the idempotency record committed before the failure, they are on the standby. The client times out, retries with the same key, the DynamoDB gate (separate system, untouched by the Aurora failover) sees the key already at COMPLETE, and returns the cached response. The customer is charged once. The failover is invisible to correctness.

DynamoDB unavailable. We fail closed. If the gate returns a 503, the payment service returns 503 with a Retry-After to the client — never a 200, never a money movement we cannot deduplicate. Failing open here would mean accepting a payment we cannot protect from a retry. The gate being up is a precondition for accepting money, and we accept that availability coupling deliberately.

Outbox poller crashes. Unpublished rows sit in the outbox until Lambda auto-restarts; worst-case event delay is cold start plus poll interval, around 5 s. This is not a correctness issue — the ledger is already correct, the events are merely delayed and will all ship once the poller is back.

Principal

Bigger blast radius: the writer's whole region is gone. What is your RTO and RPO, and is it tested?

Staff

This is the one I had left implicit, so let me make it explicit. The authoritative writer is single-region by design — the primary region is chosen per the merchant's regulatory data-residency requirement, not for latency. For region loss I run Aurora Global Database with a secondary region that already streams writes at typically sub-second replica lag; on a region loss we do a managed planned failover with RPO on the order of ~1 s and promote the secondary to writer. Critically, this is a game-day-tested procedure, not a paragraph — we rehearse the promotion, the DNS cutover of the writer endpoint, and the DynamoDB-gate behavior on a schedule, because an untested DR plan is RPO infinity. Our RPO target is 1 hour, not zero, so a ~1 s managed failover clears it with margin.

Recovery — daily reconciliation, bounded and watched

A reconciliation Lambda scans the ledger nightly for two anomalies: (a) a DEBIT with no corresponding CREDIT (or a transaction whose entries do not sum to zero), and (b) duplicate (idempotency_key, account_id, entry_type) rows that somehow slipped both guards. It does not full-scan seven years of history — at 1,000 TPS that table is on the order of $10^{11}$ rows. The job is bounded to WHERE created_at >= now() - interval '1 day' on the created_at index; a periodic deep scan over older partitions runs separately and off-hours. Any discrepancy — even one cent — alerts via EventBridge and freezes the affected account pending review. And the job watches itself: an EventBridge Scheduler dead-man alarm pages if reconciliation does not emit its "ran successfully" metric inside the window, so a silently-failing reconciler cannot hide a discrepancy. The invariant $\sum \text{entry}_i = 0$ is checked, not assumed.

Security and multi-tenancy

Principal

This is regulated money. PCI, SOC 2, GDPR. And 10,000 merchants who must never see each other's data. Convince me.

Staff

PCI DSS. Card numbers and CVVs never enter the ledger. Tokenization happens at the edge via AWS Payment Cryptography — a dedicated HSM, PCI DSS Level 1 certified. The token vault maps token → real PAN; only the vault can detokenize. The ledger holds tokenized references and anonymous account UUIDs. This shrinks PCI scope: the ledger is out of the cardholder data environment because it never touches a raw PAN.

SOC 2 and audit. ledger_entries is append-only by design — no UPDATE, no DELETE grants on the table. Any privileged row-level access requires a signed audit justification and is logged to CloudTrail, with quarterly access reviews. CloudTrail is mutable by default, which is useless for audit, so the trail's S3 bucket has Object Lock in compliance mode and log-file integrity validation enabled — the audit log itself cannot be altered or deleted, even by an administrator, within the retention window. The append-only structure is the audit trail, and the trail is tamper-evident.

ISO 27001. Everything encrypted at rest — Aurora, DynamoDB, and S3 archives — under KMS, TLS 1.3 in transit, key rotation every 365 days.

Principal

Stop — you wrote "per-tenant CMK" for a shared Aurora cluster and a shared DynamoDB table. You cannot bind a per-tenant CMK to per-tenant rows in pooled storage. And 10,000 CMKs at $1/key/month is $10,000/month — five times your whole compute budget. That claim is wrong.

Staff

You are right, and I am correcting it. Pooled storage gets a single service CMK, full stop — the storage engine cannot encrypt row-by-row under different keys. The per-tenant cryptographic isolation, where we want it, is per-tenant envelope encryption at the application layer: each tenant has a data key, the data key is encrypted under the service CMK and cached, and tenant PII is sealed under that data key. Crypto-shredding on erasure then means deleting that one tenant's data key from Secrets Manager — the encrypted PII becomes unrecoverable without touching the shared cluster. So the per-tenant isolation claim survives, but it belongs to the isolation pattern, not to a literal per-tenant KMS key on shared storage. The service CMK's key policy denies use to everyone except the specific service roles, and key administrators are separated from key users — no single principal both manages and uses the key.

Principal

SSRF on the Fargate tasks — an attacker hits the instance metadata endpoint and steals role credentials. IMDSv2?

Staff

That finding mostly does not apply here, and I want to be precise about why rather than reflexively bolt on IMDSv2. Fargate tasks do not use the EC2 IMDS at 169.254.169.254 — they get role credentials from the task metadata endpoint via ECS_CONTAINER_METADATA_URI, which is a different, per-task surface. The classic "SSRF to IMDS steals the node role" attack on EC2 does not have a clean analogue on Fargate. The part of the finding that does apply — and that I accepted above — is egress: I pin the KYC call to an allowlist so even if a task is coerced into making outbound requests, it cannot reach an attacker-controlled host. So: reject the IMDSv2 framing, accept the egress hardening.

Principal

A merchant sends a GDPR erasure request. You also have a 7-year IRS retention obligation on the same data. Those conflict.

Staff

They only conflict if PII lives in the ledger, so it does not. PII — cardholder name, email — lives outside the ledger, in a separate customer service. The ledger holds only anonymous account UUIDs. On an erasure request we crypto-shred the customer record — destroying both the encrypted PII and the UUID↔PII mapping by deleting that tenant's data key — while the ledger entries, keyed by UUID, are retained for the 7-year financial-reporting window.

Principal

Careful — GDPR "singling out". A UUID plus exact amounts plus timestamps can re-identify a person even with the name gone. Is a retained UUID really anonymous?

Staff

Only if the link back to the person is gone, which is the whole point of destroying the UUID↔PII mapping. After crypto-shredding, the ledger holds an unlinkable UUID plus amounts and timestamps with no surviving path to a natural person — the mapping that would let you single someone out has been cryptographically destroyed, not merely soft-deleted. That is what moves the retained record from pseudonymous to effectively anonymized under the singling-out standard. If we had only deleted the customer row and left the mapping recoverable, your objection would stand — so the requirement is destroy-the-key, not delete-the-row.

Decision — tenant isolation in depth

Each merchant is a row-level-security tenant in Aurora: a policy account.tenant_id = current_setting('app.tenant_id'), and the application sets SET LOCAL app.tenant_id = :id inside every transaction — so a query without a tenant context returns nothing, fail-safe. (This is why RDS Proxy runs in transaction mode: SET LOCAL must not leak across a pooled connection.) DynamoDB keys are prefixed tenant#{id}# and backed by the dynamodb:LeadingKeys IAM condition so no scan can cross tenants even if the app forgets the prefix. Tenant PII is sealed under per-tenant envelope encryption (a data key per tenant under one service CMK), giving cryptographic per-tenant isolation and crypto-shredding without 10,000 KMS keys. A forgotten WHERE clause, a leaked credential, a pooled connection — each is contained by a different layer.

Cost

Principal

What does the core of this cost per month at 1,000 txns/second?

Principal

Before you answer — your draft numbers are fantasy. Step Functions at $13, DynamoDB at $200, no Aurora I/O line, no API Gateway line. Do it again, correctly.

Staff

Fair hit. The pipeline is still cheap relative to volume, but my earlier line items were off by one to three orders of magnitude. The corrected monthly picture at 1,000 TPS:

API Gateway, $2.592 \times 10^{9}$ req/month (tiered $3.50/$2.80/$2.38 per million): ~$4,600/month. The biggest line, and I had it at zero.
Aurora PostgreSQL db.r6g.2xlarge Multi-AZ, 1-yr reserved, instance only: ~$1,500/month — plus I/O I had omitted: $\approx 3{,}000$ I/Os/s is $7.776 \times 10^{9}$/month at $0.20/million $\approx$ $1,555/month. At this I/O intensity I switch to Aurora I/O-Optimized: ~20% instance premium but $0 per-I/O, which is cheaper than paying $1,555 of standard I/O. Call Aurora ~$1,800/month on I/O-Optimized.
DynamoDB idempotency gate: $86.4\text{M}$ writes/month at $0.625/million $\approx$ $1,620/month for writes, plus consistent reads and storage — round to ~$1,800/month, not $200.
SQS, $2.592 \times 10^{9}$ msg/month at $0.40/million: ~$1,037/month per consumer, times fan-out.
Step Functions: Standard at the full authorization rate would be $250$ transitions/s $= 648\text{M}$/month at $0.025/1{,}000 \approx$ $16,200/month — the number I originally wrote as $13 was wrong by ~1,250×. Two fixes: only 10–20% of authorizations run the saga (~$1,620–3,240/month), and where saga duration fits inside 5 minutes I use Express Workflows — $129.6\text{M}$ executions/month at $1/million $\approx$ $130/month plus duration. I take Express for the short settlements and keep Standard only for the long-running, human-in-the-loop limbo cases.
AWS Payment Cryptography: tokenization is per unique card presentation, not per transaction — if ~5% of transactions present a new card, that is ~$50$ calls/s $= 129.6\text{M}$/month, on the order of $200–2,000/month depending on operation mix. I had this at zero.
Lambda outbox poller, $\approx 2.6 \times 10^{7}$ invocations/month: ~$5–21/month (invocation plus duration), not $0.40.

Honest core total lands around $11,000–13,000/month — roughly five to six times my hand-wave, dominated by API Gateway, Aurora I/O, DynamoDB, and SQS, with Step Functions tamed from $16,200 to a few hundred by Express plus saga-only routing. Against $10B/year in volume that is still well under a basis point — and one prevented $4M double-charge incident pays for the platform for roughly 25 years. The lesson stands; the arithmetic now survives a CFO.

Did we ever leave AWS?

Principal

Anywhere in here did a hard requirement force you off AWS?

Staff

No. Every component is an AWS managed service: API Gateway, ECS Fargate, DynamoDB, Aurora PostgreSQL, RDS Proxy, Lambda, SQS, Step Functions, EventBridge, SNS, Payment Cryptography, KMS, Secrets Manager, CloudTrail. The one place I had drifted off-AWS was on-call: the draft paged PagerDuty, which contradicts the very claim I am making here. Replaced with AWS Systems Manager Incident Manager — the AWS-native on-call and escalation service that wires straight into SNS, CloudWatch, and EventBridge — so the escalation path never leaves the account either. The idempotency gate that a less AWS-native design might hand to self-managed Redis stays on DynamoDB precisely because I need durable, conditional, strongly-consistent writes without owning a failover story. The ledger that someone might put on Cassandra stays on Aurora because money demands ACID.

The only non-AWS components are the ones that cannot be AWS: the payment networks and external dependencies — the card networks, ACH, SWIFT, the KYC provider. Those are upstream of the infrastructure, not an infrastructure choice. The platform never left AWS.

↓ podcast script (.txt)