Idempotent payment gateway

Charging a card looks like one line: POST /charge, the PSP returns a charge ID, done. That one line has a network timeout that turns a successful charge into a phantom failure, a double-click that races two identical requests into the card network, and a saga that crashes after the money moved but before the order was written. The interesting version charges a customer exactly once — the literal money-once guarantee, not a marketing slogan — through every one of those failures, at 50,000 payments/second across two regions, and can prove to an auditor it never charged twice.

The naive charge, and the timeout that charges twice

Principal

Simplest thing that works. Client calls POST /charge with an amount and a card token, we call the PSP, return the result. What breaks?

Staff

The thing that breaks is the one you can't see from the client: a timeout after the side effect. We call the PSP, the PSP charges the card successfully, and then the response is lost — our process is slow, a load balancer idle-timeout fires, a TCP connection resets. The client never gets a 200. From the client's point of view the request failed, so it does the only sane thing: it retries. The second request charges the card again. The money moved twice; the customer sees two line items; we eat a chargeback and a trust hit.

Principal

How often does that actually happen? Timeouts are rare.

Napkin math — rare times huge is not rare

Peak throughput is $R = 5 \times 10^{4}$ payments/second. Suppose only $p = 0.1\%$ of charges hit a post-side-effect timeout and get retried:

$$ D = R \times p = 5 \times 10^{4} \times 10^{-3} = 50\ \text{duplicate charges/second.} $$

Over a day that is

$$ 50 \times 86{,}400 = 4.32 \times 10^{6}\ \text{double-charges/day.} $$

At an average ticket of \$40 that is ~\$173M/day of wrongful charges, every one a refund, a support ticket, and a regulatory exposure. "Rare" times "50k/s" is a catastrophe. Idempotency is not a nicety here; it is the product.

Principal

Fine — before we charge, check whether a charge already exists for this customer and amount. If it does, return it. Done.

Staff

That's a time-of-check-to-time-of-use race, and it's the trap. Two concurrent retries both run "does a charge exist?" at the same instant, both read "no," both proceed to charge. The check and the charge aren't atomic, so the read tells you nothing about what the other in-flight request is about to do. Worse, "same customer and amount" is a guess at identity — a customer legitimately buying the same \$40 item twice in a minute is indistinguishable from a retry. We need an explicit, caller-supplied identity for the operation, and we need the check-and-act to be atomic against concurrent duplicates. That's exactly what an idempotency key plus a lock gives us, and it's why Stripe and Airbnb both built their payment paths around one.

Idempotency key design

Principal

So what is the key, who makes it, and what does the server actually store?

Staff

The client generates it — a V4 UUID — once per logical operation, and sends it as an Idempotency-Key header. This is Stripe's contract exactly: every mutating POST accepts the header, and the key identifies the intent, not the request bytes. The client reuses the same key across all retries of that one charge. New "Pay" click means a new key; retry means the same key. The client owns intent because only the client knows whether this is a fresh purchase or a repeat of one it already tried.

Server-side, on first sight of a key we INSERT a row into an idempotency table:

tenant_id         TEXT   -- from authenticated principal
idempotency_uuid  TEXT   -- the client V4 UUID
request_hash      TEXT   -- sha256 of canonicalised request body
status            TEXT   -- PROCESSING | SUCCEEDED | FAILED
response_code     INT    -- cached HTTP status
response_body     JSONB  -- cached response payload
created_at        TIMESTAMPTZ
expires_at        TIMESTAMPTZ  -- created_at + jittered 23-25h
lease_expires_at  TIMESTAMPTZ  -- PROCESSING lease deadline
-- PRIMARY KEY (tenant_id, idempotency_uuid)

The insert is a conditional upsert: INSERT ... ON CONFLICT (tenant_id, idempotency_uuid) DO NOTHING. If we inserted the row, we're the first; we set status = PROCESSING, stamp lease_expires_at, and run the charge. If the row already exists, we don't execute — we read its outcome and replay it.

The DynamoDB mirror uses the same composite shape: partition key tenant_id, sort key idempotency_uuid — never a concatenated tenant#uuid string. Two reasons. A composite key lets a single tenant's keys spread across DynamoDB's partition space by the sort key rather than collapsing onto one partition (a large merchant pushing >1,000 writes/s would otherwise hit the single-partition WCU ceiling). And the tenant_id partition key is exactly what a dynamodb:LeadingKeys IAM condition scopes against, which is the cross-tenant isolation control in beat 06.

Principal

Replay what? The first call might still be in flight.

Staff

Three cases, decided by the stored status. If SUCCEEDED or FAILED, we return the cached response_code + response_body verbatim — the caller gets the exact same answer the first call got, including the same charge ID. If PROCESSING, a concurrent duplicate is mid-flight; we return 409 Conflict with a "retry shortly" hint rather than executing. The status field is the lock state, and the row-level lock (next beat) is what makes the transition atomic.

One more guard, and it's the one people forget: request-hash mismatch. If the same key arrives with a different body — same key, different amount — that's a client bug or an attack, and we must not silently return the old result. We return 422 Unprocessable Entity, the same way Stripe rejects a reused key with changed parameters. The key binds to one specific request payload, not just to the key string. The hash has to be over a canonical form or it's a footgun: amounts in integer minor units (no floats), fields sorted by name, UTF-8 encoding, and a canon_version tag so we can evolve the rule without breaking in-flight keys.

And the PROCESSING status carries a lease, not just a flag. A worker that crashes mid-saga would otherwise leave the row stuck in PROCESSING forever — the row-level lock only holds for the milliseconds of the DB transaction, not the whole multi-second saga. So we stamp lease_expires_at = now() + saga_max_duration + margin. A later retry that finds PROCESSING with an expired lease is allowed to reclaim the row and re-drive (the PSP idempotency key keeps that re-drive from double-charging). A reconciliation sweeper covers the rest — that's beat 05.

Principal

One Postgres table at 50k/s of inserts plus the charge write — that's a hot table. How do you keep it from being the bottleneck, and what about key cardinality?

Staff

We shard the idempotency table by a hash prefix of the UUID — this is Airbnb's Orpheus design. Because the key is a UUID, hashing its prefix into $N$ shards gives near-perfect uniform distribution with no hot shard, and the access is always a point lookup by exact key, so we never need a cross-shard query. We size $N = 6$ shards minimum: at 50k writes/s spread over six writer instances that's well inside an r6g.4xlarge's envelope with headroom for the lock traffic.

The managed-first question the principal will ask: why route shards in the application instead of letting Aurora Limitless Database do it? Limitless is now GA and would eliminate our shard router — and it supports distributed transactions, so our load-bearing SELECT ... FOR UPDATE lease semantic survives. That's the one thing we'd have to verify before adopting it: the lease is the correctness primitive, not a nicety. For a green-field build we'd start on Limitless; we keep the explicit $N=6$ router in the doc because it makes the lock and routing semantics legible, and the migration path to Limitless is a drop-in for a point-lookup workload.

On expiry: we do not treat key reuse as a feature. Client UUIDs are never reused, so "safe to reuse after 24h" is a claim we don't need and won't make — it's a trap, because DynamoDB TTL deletion is best-effort and can lag up to 48 hours. The correctness rule is instead: filter on expires_at at read time in application code. A row past its expires_at is treated as absent regardless of whether the sweeper has physically deleted it. TTL deletion is a storage-reclamation optimisation, never a correctness boundary. And we jitter the expiry: expires_at = created_at + random(23h, 25h), so a flash sale's worth of keys written in one minute don't all expire in the same minute 24 hours later and stampede the sweeper.

Saga orchestration for the multi-step charge

Principal

A charge isn't one write. You validate, reserve funds, call the PSP, write a ledger entry, emit events. Walk me through what happens when the middle of that crashes.

Staff

This is the \$86K-duplicate-payout failure class, and it's worth naming precisely. The saga is: (1) validate the request, (2) reserve funds / risk-check, (3) call the PSP (the external side effect — money moves here), (4) settle the ledger entry, (5) emit payment.completed. The naive implementation is a single Lambda that does all five in sequence. Now step 3 succeeds — the PSP charged the card — and the Lambda crashes before step 4 writes the ledger. The retry sees no ledger entry, concludes the payment failed, and charges again. The root cause is brutal and specific: saga state was not persisted to a durable store before the external call, so on restart we couldn't tell "charge already happened" from "charge never happened."

Principal

So persist a row before each step. The instinct is Step Functions Express — one execution per charge, checkpointed. Talk me out of it or into it.

Staff

I'll talk us out of it, on a hard number. Step Functions Express has a 5,000 start/s account quota. At 50k charges/s, even if only 10% are genuinely new-key first executions, that's 5,000 starts/s — we're at the ceiling with zero headroom, and any burst above 10% new keys breaches it. Express also bills per execution plus GB-second, not the cheap per-transition number people assume; at this volume the orchestrator alone is roughly \$197k/month. And Express is at-least-once with no per-state checkpointing — that's Standard's feature, not Express's — so the durability story we wanted from it isn't even there. We'd be paying a premium for a guarantee we don't get and a quota we breach.

So the saga runs in-process inside the gateway container — an ECS Fargate tier behind an ALB (20 tasks of 4 vCPU), which also replaces the Lambda + API Gateway entry tier the first draft assumed, because at 50k req/s × 200ms that's 10,000 concurrent Lambda executions, the regional ceiling with zero headroom for jitter — with DynamoDB conditional writes as the durable checkpoint. We're not hand-rolling coordination from nothing — we're using DynamoDB's conditional-write primitive, which is the managed durability we actually need, instead of an orchestrator we'd outgrow at this quota. The saga:

Acquire the lease — SELECT ... FOR UPDATE in a short Aurora transaction, write PROCESSING + lease_expires_at. A DynamoDB conditional PutItem (attribute_not_exists or lease-expired) gates the launch, so two duplicates can't both start the saga.
Validate / risk-check.
Call the PSP with the idempotency key — the external side effect, money moves here.
On success: commit SETTLED to Aurora, write the DynamoDB cache entry, append the ledger row, publish payment.completed to EventBridge.
On PSP failure: compensating refund if the PSP was reached, no-op if it wasn't, write VOIDED.

Each step persists its state to DynamoDB before proceeding. That's explicit per-step checkpointing — strictly stronger than Express's at-least-once whole-workflow re-run, because a re-drive resumes from the last persisted step rather than replaying the side effect. The idempotency key threads through every step and into the PSP as the PSP's own idempotency key (Stripe, Braintree, Adyen all accept one), so even a re-driven step 3 deduplicates at the card network.

Principal

The PSP itself wobbles — elevated error rate, then a recovery. Your re-drives all hit it at once. And what if a re-drive happens a day later, past the PSP's dedup window?

Staff

Two guards. First, the retry path has full jitter backoff (not plain exponential — synchronised waves are what kill a recovering PSP), a per-attempt call timeout, a bounded MaxAttempts, and a circuit breaker: a DynamoDB/CloudWatch-driven flag that, when the PSP error rate crosses a threshold, sheds new charges to 503 instead of piling retries onto a sick dependency. Second — and this is load-bearing — the PSP idempotency window is finite. Stripe's is 24 hours. If a re-drive could land after that window, the PSP no longer recognises the key and would treat the retry as a fresh charge. So we hold our key-store TTL shorter than the PSP window (22h against Stripe's 24h), and before any re-drive that risks exceeding it we query the PSP for charge status by our reference ID rather than blindly replaying the key. Charge-status-then-act, not retry-and-hope.

Principal

And when step 4 genuinely fails after the PSP succeeded — the ledger DB is down for an hour. The money's gone. Now what?

Staff

That's where compensation earns its place. The saga retries the settle with jittered backoff first — a transient ledger outage usually heals inside the retry budget, and because step 3's outcome is checkpointed in DynamoDB we never re-charge while waiting. If settle exhausts retries, the compensating branch issues a refund/void via the PSP (again keyed by the idempotency key, so the void is itself idempotent) and writes the ledger entry as VOIDED once the ledger recovers. The invariant we hold is: either the customer is charged and the ledger says SETTLED, or they are made whole and the ledger says VOIDED. There is no terminal state where money moved and the books don't know.

DynamoDB idempotency cache for the hot path

Principal

Every single request now does a synchronous read against a sharded Aurora table before it can do anything. At 50k/s, is Aurora the new bottleneck?

Staff

It can be, and the dangerous part is where the load lands. Aurora is the source of truth for the key store — it has to be, because the row-level lock that serialises concurrent duplicates is a relational primitive. But the overwhelmingly common case is a cache-decidable lookup: "have I seen this key before, and if so what was the answer?" For a key we've already completed, that's a pure read of an immutable result. Forcing 50k/s of those through a relational primary, fighting for buffer pool and connections, is wasteful and slow.

Napkin math — how much of the load is cache-decidable

Of $5\times10^{4}$ requests/second, the first-execution-of-a-new-key path needs Aurora's lock. The replay path — retries and concurrent duplicates — only needs the stored outcome. If even the original-first-execution fraction is, say, 95% new keys and 5% replays at steady state, the 5% replays at $2.5\times10^{3}$/s are pure reads that never need to touch the lock. A DynamoDB point read is single-digit milliseconds — well inside our budget — versus a relational round-trip plus connection acquisition. We move the decision to a store built for $10^{6}$+ reads/second so Aurora only sees the writes that genuinely mutate key state.

Staff

So in front of Aurora I put a DynamoDB read-through, write-through idempotency cache — and the same DynamoDB table is the saga's checkpoint store from beat 03, so it earns double duty. The flow: on a request, read the key from DynamoDB (filtering on expires_at at read time). Cache hit on a terminal status → return the cached response immediately, no saga, no Aurora. Cache miss → fall through to Aurora, take the lock, run the in-process saga on first execution, writing each step back through DynamoDB so the next retry is a cache hit.

I'm not putting DAX in front of it, and that's a deliberate reversal. DAX doesn't participate in Global Tables, the first-execution path bypasses it entirely, and our Fargate tasks already hold warm in-memory connection pools to DynamoDB — so DAX would add a cache fleet and its cost without a proportional latency win on the path that matters. Single-digit-millisecond DynamoDB point reads are inside budget for the replay path. We co-locate the Fargate tasks and the DynamoDB VPC gateway endpoint, and pin the Aurora writer in the same AZ, to keep the hot path's network hops short and cross-AZ transfer off the bill.

Trade-off accepted

We run the key state in two stores — DynamoDB (fast path, decision cache, and saga checkpoint) and Aurora (source of truth, lock authority) — and accept the dual-write complexity and a narrow window where DynamoDB lags Aurora. We resolve correctness by never letting the cache create a terminal state: only a request that successfully took the Aurora lock and completed the saga may write a SUCCEEDED/FAILED entry. A cache miss is always safe (fall through to the authoritative lock); only a cache hit short-circuits, and a hit can only exist because Aurora already committed that outcome. We gave up single-store simplicity for a hot-path read budget Aurora alone can't meet — and gave up DAX's microsecond reads because they don't pay for themselves once Fargate holds the connection pool.

Failure and recovery — the four hard races

Principal

The zombie request: the client gave up and the user walked away, but our request is still running and the charge succeeds. The user retries from a new device. What stops the double charge?

Staff

The idempotency table is the memory the client lost. The zombie eventually writes its outcome to the key row (SUCCEEDED + the charge ID). The user's retry carries the same idempotency key — the client SDK persisted it — so it cache-hits or row-hits and replays the stored response: the same charge ID, no new charge. The client gave up; the server didn't forget. The only thing that makes this work is that the key is durable the instant we start, not when we finish — the PROCESSING row is written before the PSP call, which is the exact lesson of the \$86K incident.

Principal

Double-click. Two identical requests, same key, arrive 5 milliseconds apart and race into two different gateway instances. Both miss the cache. Now what?

Staff

Both fall through to Aurora, and Aurora is where the race is decided. The first transaction does INSERT ... ON CONFLICT DO NOTHING and wins the row, setting status = PROCESSING; it then runs the saga. The second transaction's insert conflicts, so it issues SELECT ... FOR UPDATE on that key row — a row-level lock. It blocks until the first transaction commits, then reads the now-terminal status and returns the cached response. This is Airbnb's Orpheus lease: the row lock grants exactly one request the right to proceed, and the loser blocks then replays. Two clicks, one charge, no application-level coordination — the database's lock is the coordinator.

Principal

The PSP goes through a network split mid-charge. We don't know if it charged. The saga re-drives. Doesn't that double-charge?

Staff

No, and this is why we threaded the key into the PSP. The saga re-drives step 3 with full-jitter backoff, but every attempt carries the same idempotency key to the PSP. The PSP deduplicates on it: if the first attempt actually charged before the split, the retry returns the original charge, not a second one. If it never charged, the retry charges once. If the split outlasted the saga and a re-drive risks the PSP's 24h dedup window, we query the PSP for the charge by reference ID before acting rather than blindly replaying. The saga records whichever terminal outcome the PSP reports, writes it to the ledger and the key store, and we're consistent. The split is invisible to the customer's statement.

Principal

A Fargate task crashes mid-saga and never comes back. The key row sits in PROCESSING. Who unsticks it?

Staff

The lease does, two ways. The lease_expires_at we stamped means a later retry carrying the same key finds PROCESSING with an expired lease and is allowed to reclaim and re-drive — safely, because the PSP key still dedups the charge. For keys that get no retry, a reconciliation sweeper — an EventBridge Scheduler rule firing a Lambda — queries for orphaned PROCESSING rows past their lease, re-drives them through the same saga, and resolves them to SETTLED or VOIDED. The same EventBridge Scheduler drives the Aurora key-expiry reaper. No row stays stuck.

Principal

The PSP sends us an asynchronous webhook — charge succeeded, or a dispute. How do we know it's really the PSP and not someone forging events?

Staff

We verify the webhook's HMAC signature against the signing secret in Secrets Manager before acting on a single byte of it — an unsigned or mis-signed event is dropped, full stop. Inbound PSP events then flow through the same idempotency discipline (the event carries a PSP event ID we dedup on), and out to consumers via EventBridge rules that target SQS queues, one per consumer. EventBridge's PutEvents quota is 10,000/s by default, so at 50k payments/s we request the increase to 50k/s and route to SQS so each consumer polls at its own rate rather than us fanning Lambda concurrency 5-10x straight off EventBridge. Each queue has a DLQ with a redrive policy and an alarm on DLQ depth, so one poison payment.completed event lands in the DLQ instead of blocking its FIFO message group head-of-line. One caveat we write down explicitly: SQS FIFO's dedup window is 5 minutes — that is not our correctness boundary. A replay arriving beyond 5 minutes is caught by the consumer checking the idempotency table, not by FIFO dedup.

Principal

An entire region fails. Aurora primary, the lot. What's the RTO, the RPO, and do we double-charge during failover?

Staff

Aurora Global Database for the key store: a secondary region kept in sync with typical cross-region replication lag around a second, and managed failover promoting the secondary with an RTO under 30 seconds and an RPO around 1 second — non-zero. I want to be honest that unplanned cross-region failover is not magic-automatic: it requires explicit automation we configure and, more importantly, game-day testing — an untested failover is a hope, not a control. DynamoDB Global Tables for the idempotency cache, active in both regions. The correctness guarantee through failover is again the PSP key: even if the last second of key-store writes hadn't replicated (the non-zero RPO) and a retry lands in the new region as an apparent cache miss, the retry re-runs the saga with the same idempotency key, and the PSP deduplicates the charge. The replication lag can cost us a re-execution of the saga; it cannot cost the customer a second charge, because the money-once invariant lives at the PSP, not in our replication.

The PSP is a single point of failure — and we name the policy

The whole design rests on one PSP, so a PSP outage halts 100% of charges — no region failover saves us, because the dependency is external to AWS entirely. We do not build multi-PSP routing here: it doubles reconciliation surface, splits the idempotency guarantee across two dedup windows, and the requirement is "charge exactly once," not "charge despite Stripe being down." The graceful-degradation policy we do commit to is explicit: when the circuit breaker is open, we hard-decline with a 503 and a retry-after rather than queue-and-settle-later — silently queuing a charge the customer thinks failed, then settling it minutes later, is its own trust incident in payments. If a future SLA demanded charge-availability through a PSP outage, multi-PSP with a per-PSP idempotency-key namespace is the design, and it's a real project, not a config flag.

Security and multi-tenancy

Principal

This handles card data and runs many merchants on one platform. Walk me through PCI scope, cross-tenant isolation, and what an auditor checks.

Staff

The first rule of PCI-DSS scope is don't store card data. We tokenise at ingress — the card PAN goes straight to the PSP's SDK / hosted fields and we receive back an opaque token. We store the token, never the PAN, CVV, or track data. That keeps the primary account number out of our systems entirely and collapses our PCI scope to the token-handling path. The token itself is encrypted at rest with KMS envelope encryption (a data key per record, the data key wrapped by a KMS CMK), so even the token store is defence-in-depth. PSP credentials — API keys, signing secrets — live in Secrets Manager with automatic rotation, never in env vars or code. That's PCI-DSS Req. 3 (protect stored data — by not storing it) and Req. 8/3.6 (key management).

Encryption in transit is the other half people skip. TLS 1.2+ on every hop: rds.force_ssl on Aurora, TLS to the DynamoDB and KMS endpoints, and TLS from the gateway to the PSP — and on that last hop we verify the PSP webhook's HMAC signature against the Secrets Manager signing secret before acting on any inbound PSP event. The PSP-calling tier is also the SSRF surface: IMDSv2 required with hop-limit 1 on any EC2/host in the path, and an egress allowlist (VPC endpoint policy / egress proxy) that restricts the gateway tier's outbound traffic to the PSP's domains only, so a compromised container can't be turned into an internal scanner.

On KMS hygiene: annual automatic key rotation is enabled, and the key policy is least-privilege — key users (the roles allowed kms:Decrypt / GenerateDataKey) are separated from key administrators (who manage the key but can't decrypt with it). We use a shared CMK with a per-record data key rather than a per-tenant CMK: combined with LeadingKeys IAM scoping, a tenant role still cannot read another tenant's ciphertext, so the shared-CMK blast radius is contained at the IAM layer. The case for per-tenant CMKs — independent key rotation and clean GDPR crypto-erasure at offboarding — is real, and we'd switch if a contractual key-isolation clause demanded it; absent that, shared-CMK-plus-data-key is the defensible default at this tenant count.

Principal

Tenant isolation. Can merchant A replay merchant B's idempotency key and read B's charge response?

Staff

No, because the idempotency key is namespaced by tenant_id — in DynamoDB the partition key is tenant_id and the sort key is the client UUID, and the tenant_id comes from the authenticated principal (the verified API credential / JWT claim), never from a client-supplied field. A replay attempt with B's raw UUID under A's credentials addresses a different partition and finds nothing. This is the cross-tenant replay defence: the partitioning makes A's and B's keyspaces disjoint by construction, so a stolen or guessed UUID is useless across the boundary. IAM roles carry a dynamodb:LeadingKeys condition bound to the caller's tenant, so a role physically cannot read another tenant's partitions. Per-tenant API Gateway / ALB usage-plan throttles plus DynamoDB adaptive capacity guard against a noisy neighbour saturating the shared Aurora and DynamoDB tiers. That's the SOC 2 CC6 / ISO 27001 A.9 logical-access story and NIST AC-4 information-flow enforcement.

The audit trail is its own control, and it's stronger now that the saga is in-process: every state transition — key created, PROCESSING, PSP charged, SETTLED, VOIDED — is persisted to the DynamoDB ledger as an immutable row before the saga proceeds, so there's no orchestrator-history gap to explain to an assessor. (X-Ray is operational telemetry, not the system of record; the ledger is.) DynamoDB Streams feeds those transitions to S3 with Object Lock (WORM) in a dedicated logging account. The cross-account write role is scoped to PutObject on the locked prefix only — no Delete, no BypassGovernanceRetention — and the bucket policy denies BypassGovernanceRetention unconditionally, so the record of who was charged, when, and for how much cannot be altered or deleted within retention even by an admin. Org-level multi-region CloudTrail with log-file integrity validation and S3+KMS data events is delivered to that locked logging account. That's SOC 2 CC7.2 / NIST AU-9/AU-10 (audit integrity) and the evidence a PCI assessor wants for Req. 10. Every service runs under a least-privilege IAM role — no wildcard resource or action policies.

On GDPR, two pieces. Data residency is a hard deployment constraint, not a default: the region pair is chosen per tenant, EU merchants are confined to EU-region pairs, and Aurora Global Database / DynamoDB Global Tables never replicate EU data to a non-EU region. And right-to-erasure versus the WORM ledger is resolved by crypto-shredding: the ledger stores only a pseudonymous token reference — no direct PII ever enters the immutable store — and erasure destroys the per-subject data key, rendering the still-immutable record non-identifying. We hold the key TTL at 22 hours, document the lawful basis (contract performance — you can't fulfil a purchase without processing the payment), and keep the ledger for its legally-required retention as a pseudonymised, crypto-erasable record.

Rejected: within-tenant idempotency-key replay is correct by design

A critic flagged that a caller holding tenant A's authenticated credentials can replay tenant A's idempotency key and get tenant A's cached charge response. That is the defined behaviour, not a vulnerability. Idempotency replay returns the same answer to anyone authenticated as that tenant — that is the entire contract. The cached response body contains only data the authenticated caller is already entitled to, so there is no information disclosure across any trust boundary. Adding a second per-call secret to "protect" the replay would break the retry semantics the whole design exists to provide. The real control here is credential security (the tenant's API key), which is out of scope for the idempotency layer and covered by Secrets Manager rotation and least-privilege IAM. We declined this finding.

Decision

Tokenise at ingress (PAN never stored, collapsing PCI scope); KMS envelope-encrypt tokens at rest with a shared CMK + per-record data key, annual rotation, and key-user/key-admin separation; PSP credentials and webhook signing secret in Secrets Manager with rotation; TLS 1.2+ on every hop and HMAC-verified PSP webhooks; IMDSv2 + egress allowlist on the PSP-calling tier. Idempotency keys partitioned by authenticated tenant_id so cross-tenant replay is impossible by construction; DynamoDB LeadingKeys-scoped IAM plus per-tenant throttles enforce it. Append-only DynamoDB ledger streamed to S3 Object Lock (WORM) in a locked logging account, cross-account write role scoped to PutObject with BypassGovernanceRetention denied, org-trail CloudTrail — tamper-evident audit for PCI Req. 10 / SOC 2 CC7. Data residency is a hard region-pair constraint; GDPR erasure is crypto-shredding of a pseudonymised ledger.

Cost

Principal

What does running exactly-once at 50k/s cost, and where's the money?

Principal

Before you give me a number — the first draft of this costing was off by 20x. What did it miss?

Staff

Two big mistakes and a stale price, all fixed by the redesign. The first draft put the gateway on API Gateway + Lambda and never costed them: API Gateway REST at $4.32\times10^{9}$ req/day is ~\$453k/month and Lambda another ~\$37k/month — half a million dollars hiding in an omitted line. Moving to ECS Fargate behind an ALB (20 tasks of 4 vCPU) replaces that with ~\$23k/month. The second: Step Functions Express was priced on a per-transition number that's actually the Standard model — correctly costed it was ~\$197k/month, and the in-process saga drops orchestration to near zero. And DynamoDB was on a pre-November-2024 price; on-demand writes are \$0.25/M now, not \$1.25/M.

Napkin math — daily volume and the corrected line items

Peak is $5\times10^{4}$ payments/second. Sustained over a day that is

$$ 5 \times 10^{4} \times 86{,}400 = 4.32 \times 10^{9}\ \text{payments/day.} $$

(Real traffic peaks and troughs; this is the worst-case envelope, so the real bill is lower.) The corrected monthly line items at the peak envelope:

ECS Fargate + ALB (gateway tier + in-process saga): ~20 tasks $\times$ 4 vCPU, plus ALB LCUs $\approx$ ~\$23k/month — replacing the ~\$490k/month of API Gateway + Lambda the first draft hid. Connection pooling via RDS Proxy in front of Aurora is bundled here.
DynamoDB (cache + checkpoint, two regions): writes at \$0.25/M, reads cheaper, plus a Global Tables replication line (replicated write units roughly double the write cost, ~\$2,160/day extra). Provisioned + auto-scaling against this predictable floor lands the two-region total near ~\$56k/month. No DAX line — we removed it.
Aurora (key store, $N=6$ shards): six r6g.4xlarge Multi-AZ at ~\$1,670/month each $\approx$ ~\$10k/month. Aurora Serverless v2 is on the table for the spiky payments profile if the floor proves over-provisioned.
EventBridge + SQS + KMS + Secrets Manager: EventBridge fan-out to per-consumer SQS queues, envelope-encryption data-key calls, secret rotation $\approx$ ~\$5k/month combined.
X-Ray (1% success sampling + 100% errors): ~\$6.5k/month. CloudWatch Logs at 30-day retention ~\$3k/month. (At 100% X-Ray sampling this line alone could have been ~\$648k/month — sampling is not optional at this volume.)

VPC Gateway endpoints (free) carry DynamoDB traffic and Interface endpoints carry KMS, and Fargate is co-located with the Aurora writer AZ, so cross-AZ transfer and NAT Gateway charges (which would otherwise add ~\$5.8k/month) largely fall away. Peak-envelope total lands around ~\$103k/month — before the PSP's own per-transaction fees, which dwarf the infrastructure and are the real cost of being in payments.

Staff

The honest reframe: the idempotency machinery is now a defensible ~\$103k/month, an order of magnitude under the first draft's hidden ~\$1M+. The biggest lever is the cache hit ratio — every replay served from DynamoDB is an Aurora lock we didn't take and a saga we didn't run — and the second biggest was simply not running an orchestrator we'd outgrow. The cost we don't list here, the PSP's interchange-plus fee, is 50–200× the infrastructure cost per transaction, which is exactly why double-charging is unacceptable: a wrongful charge isn't just a refund, it's a refunded fee on top.

Did we ever leave AWS?

Principal

You keep saying "the PSP." That's Stripe or Braintree — not an AWS service. Did we leave AWS?

Staff

We left AWS at exactly one boundary, and it's forced by a hard requirement no AWS service can satisfy: you cannot build a card network on AWS. Moving money over Visa/Mastercard rails requires a licensed acquirer/PSP — Stripe, Adyen, Braintree — and that is a regulated financial entity, not a managed service. So the PSP is external by definition, the way APNs and FCM are external to a notification system: every payment platform on Earth terminates at one. That's the named requirement that left no AWS option.

Everything else is AWS: ECS Fargate behind an ALB for the gateway tier and the in-process saga, RDS Proxy pooling connections to Aurora PostgreSQL for the sharded key store (with Aurora Limitless as the managed evolution of the shard router), DynamoDB for the cache and the saga checkpoint, EventBridge fanning out to per-consumer SQS queues (FIFO where ordering matters, its MessageDeduplicationId mapping to the idempotency key for the first 5 minutes), EventBridge Scheduler driving the reconciliation sweeper and key reaper, KMS and Secrets Manager for crypto and credentials, Aurora Global Database and DynamoDB Global Tables for the two-region story, CloudWatch and X-Ray for tracing. We never had to self-host a database, a queue, or an orchestrator — the in-process saga uses DynamoDB conditional writes as its durable checkpoint, which is a managed primitive, not a self-hosted coordinator. The default narrative held: one external dependency, structurally unavoidable, and the AWS-native idempotency key threaded into it is what makes even that boundary safe.

Where we'd leave AWS — and where we didn't have to

The single departure is the PSP, forced by the hard requirement that card-network settlement is a regulated, licensed function with no AWS equivalent. Notably we did not need to leave AWS for the exactly-once guarantee itself: the idempotency key, the Aurora row lock, the in-process saga checkpointed on DynamoDB conditional writes, and the PSP-side idempotency key together deliver money-once without any custom coordination service. If a future requirement demanded a globally-exact, strongly-consistent key store at single-digit-millisecond latency in the hot path, the AWS answer is DynamoDB conditional writes or Aurora DSQL before it's anything off-platform — and idempotent payments don't impose that, because the PSP is the consistency authority for the only thing that must be exact: the money.

Critic response — what we changed and the one we declined

Accepted and folded in: the gateway-tier redesign (Lambda + API Gateway + Step Functions Express + DAX out; ECS Fargate + ALB + in-process DynamoDB-checkpointed saga in) driven jointly by the Lambda 10k-concurrency ceiling, the Express 5k-start/s quota, and the ~20x cost understatement; explicit DynamoDB tenant_id/uuid composite key to kill the hot-partition risk; expires_at read-time filtering and TTL jitter so best-effort TTL deletion is never a correctness boundary; the PROCESSING lease plus an EventBridge Scheduler reconciliation sweeper; PSP circuit breaker, full-jitter backoff, and a charge-status query before any re-drive that risks the PSP's 24h dedup window; corrected DynamoDB pricing and Global Tables replication cost; $N=6$ Aurora shards; EventBridge-to-SQS fan-out with DLQs; TLS-everywhere, HMAC-verified webhooks, IMDSv2 + egress allowlist, KMS rotation and key-policy separation, GDPR crypto-shredding of a pseudonymised ledger, data-residency region-pinning, org-trail CloudTrail, per-tenant throttles; non-zero (~1s) Aurora Global DB RPO with game-day testing called out; Aurora Limitless and EventBridge Scheduler named as the managed evolutions.

Declined: the within-tenant idempotency-key replay finding (a caller authenticated as tenant A replaying tenant A's key gets tenant A's response). That is the defined idempotency contract, not a breach — the response contains only data the authenticated caller already owns, and "protecting" it would break the retry semantics the system exists to provide. Credential security, not the idempotency layer, is the control there. We also held the line on single-region-per-tenant (no multi-region active-active) and single-PSP: our RPO target is ~1s and "charge exactly once," not zero-RPO and charge-through-PSP-outage — multi-region active-active and multi-PSP each double a major cost and reconciliation surface for a requirement we don't have, so we named the degradation policy (503 + retry-after) instead.

↓ podcast script (.txt)