SYSTEM DESIGN STUDIO — Podcast Script Topic: Real-time payment processing with distributed sagas Date: 2026-06-23 | Difficulty: staff Two voices: PRINCIPAL (skeptical interviewer) and STAFF (proposing engineer). For ElevenLabs Projects: assign each speaker label to a distinct voice ID. --- [INTRO] A checkout button looks like one row insert and one HTTPS call. That one-liner has three ways to lose money: it double-charges on a client retry, it splits brain when the gateway says yes but the ledger write never lands, and it strands funds in limbo when the process dies after step 4 of 6. The interesting version moves real money at 10,000 transactions/second, stays under an 800 ms checkout budget, and can still tell you — to the cent — what happened when the network betrayed it. === The problem, and the numbers we design to === PRINCIPAL: Set the scene. What are we building and at what scale? STAFF: A multi-tenant payments platform. 100,000 merchants, 10,000 transactions/second at Black Friday peak, p99 checkout latency budget of 800 ms for synchronous authorization. Settlement is async at T+1; the auth is the synchronous part the shopper waits on. We are PCI DSS Level 1, SOC 2 Type II, ISO 27001. The card networks give the issuing bank a hard 2 s SLA on authorization, so 800 ms is our self-imposed p99 with margin. PRINCIPAL: Draw me the naive design first. One engineer, one afternoon. STAFF: One Lambda: insert a payments row, call Stripe over HTTPS, update the row to captured, return 200. It works in the demo. It has three failure modes that each cost real money. One — double charge on retry. The shopper's phone drops the response. The client retries. Now two Lambdas each insert, each call Stripe, the card is charged twice. This is the literal Stripe-2013 incident before idempotency keys existed. Two — split brain. Stripe returns 200 authorized, but before our ledger write commits, the Lambda times out or the AZ partitions. Stripe thinks the customer paid; our ledger has no record. We will refund a charge we cannot see, or never fulfil an order that was paid. Three — dangling state. The process crashes between "debit pending" and "credit settled." Money is reserved and never released. That is the Robinhood 2020 ghost transfer — leg 1 debited, leg 2 failed, funds in limbo for 72 hours. === Idempotency at the gate === PRINCIPAL: Start with the double charge. The client sends an Idempotency-Key. Where do you store it, and why not Redis? STAFF: The very first thing the handler does — before the ID, before the saga — is a DynamoDB conditional put: pk = sha256(merchant_id + idempotency_key) PutItem( Item = { pk, status: "PROCESSING", ttl: now + 24h }, ConditionExpression = "attribute_not_exists(pk)" ) DynamoDB serializes this on the partition key. The first writer wins the slot and proceeds. Every concurrent retry gets ConditionalCheckFailedException — that is the signal "someone already owns this logical payment." We then read the item: if status = COMPLETED, we return the cached response body; if PROCESSING, we return 409 with a short retry. No second copy ever reaches the gateway. Why not Redis? Redis is volatile. An idempotency record is a money-safety invariant — if it evaporates on a failover, a retry that arrives in that window double-charges. DynamoDB gives durable, strongly-consistent conditional writes with a managed TTL. The key is scoped merchant_id + idempotency_key so two merchants reusing the same client-side key string never collide — that is also our first multi-tenant isolation boundary. PRINCIPAL: The slot says PROCESSING and the saga crashes. Now every retry sees PROCESSING forever. You just wedged the customer. STAFF: Correct, and that is the real edge case. The slot carries a lease_expires_at set to now + 90s — and that number is deliberate. It must be strictly greater than the worst-case saga: a 2 s gateway auth, capture retries, compensation overhead, and clock margin. A 30 s lease is shorter than a realistically slow saga, so it would let a retry reclaim a slot whose original is still alive and in flight — that is itself a bug, and we caught it. 90 s buys headroom over the worst case without wedging the customer for long. A retry that sees PROCESSING and an expired lease takes over the slot with a conditional update on the lease timestamp and re-drives the saga, which is safe because every saga step is idempotent. So a stuck slot self-heals after one lease interval instead of wedging for 24 hours. The 24 hour TTL is just garbage collection of completed slots, not the recovery mechanism. One thing 90 s does not fully close: a slow-but-alive original and a reclaim-retry can both be running, and both will eventually reach step 6. Money movement (steps 3–5) is fenced by the gateway idempotency key and the Aurora optimistic lock, so it cannot double-charge — but the merchant webhook in step 6 has no such guard yet. We close that in beat 07 with a second conditional put that elects exactly one saga to emit the notification. === Payment ID generation === PRINCIPAL: You own the slot. Now you need a payment ID. UUIDv4 — done? STAFF: UUIDv4 is random, and that is exactly wrong for a ledger. Random IDs scatter B-tree inserts across the whole index — every insert dirties a different leaf page, killing write locality and bloating the index. Worse, a time-range query ("all payments in this hour") becomes a full scan because the ID carries no time order. I want a k-sortable ID: roughly time-ordered so inserts append near the right edge of the index and range scans are cheap. A Snowflake variant, 63 bits — but the bit split matters more than it looks: | 41 bits ms timestamp | 14 bits worker ID | 8 bits sequence | since 2024-01-01 epoch 16,384 workers max 256 ids/ms/worker The original cut was 10 worker bits / 12 sequence bits, and that 10-bit worker field is a scalability bug. From the beat-01 math we have $C = R \times d = 10^{4} \times 0.6 = 6{,}000$ concurrent Lambda instances at peak — and 10 bits caps at 1,024 slots. We would exhaust worker IDs long before peak. So we rebalanced: 14 worker bits (16,384 slots) for ~2.7x headroom over the 6,000 concurrent instances, and trim the sequence to 8 bits = 256 IDs/ms/worker = 256k IDs/s/worker, still far past any single worker's real rate. Total stays 63 bits. We add a clock-rollback guard and an alarm when slot utilisation crosses 80%. 41 bits of milliseconds still buys ~69 years from the epoch. We use ULIDs for audit and event IDs — monotonic within a millisecond, Crockford base32, URL-safe — because those want lexicographic sortability in logs and S3 key prefixes, not 63-bit compactness. PRINCIPAL: Two Lambda cold-starts in the same millisecond. They both pick worker ID 7. Now they collide. How do you assign worker IDs in a serverless world with no stable identity? STAFF: That is the clock-skew collision, and the fix is to claim the worker slot, not derive it. At cold-start, each Lambda instance walks a DynamoDB table of 16,384 worker slots and does a conditional put to claim a free one: PutItem( Item = { worker_id: n, instance_arn, heartbeat: now }, ConditionExpression = "attribute_not_exists(worker_id) OR heartbeat < :stale" ) The conditional put is atomic, so two cold-starts cannot both win slot 7 — one gets the slot, the other moves to slot 8. A slot whose heartbeat is stale (the instance was reaped) can be reclaimed. The claim happens once per cold-start, off the hot path, so it never touches the 800 ms budget. PRINCIPAL: There is a coordination-free alternative that deletes that worker-slot table entirely: UUID v7. Why carry the slot table at all? STAFF: UUID v7 is the honest counter-argument. It is a time-ordered UUID — a 48-bit millisecond prefix plus random tail — so it is k-sortable like Snowflake, and it needs no worker slot, no claim, no DynamoDB table, no clock-rollback guard. Strictly less coordination. If our only axis were operational simplicity, UUID v7 wins. We keep Snowflake on one load-bearing property: 64-bit integer key compactness. UUID v7 is 128 bits; our Snowflake fits in a bigint at 63. At billions of ledger rows, every B-tree index on payment_id — primary key, every foreign key in ledger_entries, every covering index — is half the width. Half the index pages, half the buffer-pool pressure, more rows per page on range scans. That is a real, measurable cost at our row count, and it is why we pay the worker-slot complexity. The trade is explicit: we accept the slot table to keep the index narrow. PRINCIPAL: One problem you just created. A Snowflake ID is predictable — 41-bit timestamp, small worker field, sequence. If any read endpoint authorizes on payment_id alone, I can enumerate my neighbour's payments. That is textbook BOLA — OWASP API1. STAFF: Right, and the fix is to never expose the numeric ID. We keep the Snowflake as the internal primary key — it earns its place on B-tree locality — and mint a separate opaque external reference for clients: ext_ref = base62(HMAC-SHA256(payment_id, per_tenant_secret)) We store ext_ref on the payments row and surface that to merchants, never the raw Snowflake. It carries no sequential information and it is forgery-resistant: a client cannot construct a valid reference without the per-tenant HMAC secret, so they cannot guess an adjacent payment's handle. And every read endpoint re-validates merchant_id from the authenticated JWT against the row — the query filters on merchant_id AND ext_ref, never on the ID alone. Enumeration of adjacent payments is dead two ways: the handle is unguessable, and the row is tenant-scoped. === Saga vs 2PC — and orchestration vs choreography === PRINCIPAL: Now the hard part. Three internal services and an external gateway have to agree. The textbook answer is two-phase commit. Why won't you use it? STAFF: Because 2PC has a single point of failure that is fatal for money: the coordinator. In the window between phase-1 prepare and phase-2 commit, every participant holds locks and waits. If the coordinator dies in that window, participants are blocked — they cannot unilaterally commit or abort. That is precisely Robinhood 2020: the coordinator died after phase 1, leg 1 had debited, leg 2 never ran, money sat in limbo 72 hours. And the external gateway is not an XA resource manager at all — Stripe will not enlist in your distributed transaction. XA across a network boundary you do not control is a fantasy. The answer is a saga: a sequence of local transactions, each with an explicit compensating transaction that semantically undoes it. No global lock. If step 4 fails, you run the compensations for steps 3, 2, 1 in reverse. PRINCIPAL: Choreography or orchestration? Events flying between services, or a central brain? STAFF: Orchestration, for this flow. Choreography — each service emits events the next subscribes to — is elegant for loosely-coupled fan-out, but a payment is a linear, ordered flow where I need to know exactly where I am to compensate correctly. With choreography, the saga state is smeared across N event logs and there is no single place that says "we are at step 4, here is the compensation plan." Orchestration puts that brain in one auditable place. For PCI and SOC 2, "show me the exact state and history of payment X" must be a single query, not an archaeology dig across topics. The orchestrator is Step Functions Express, started synchronously: 1 ValidatePayment -> merchant config + Fraud Detector score 2 ReserveBalance -> debit "pending" on Aurora (optimistic lock) 3 AuthorizeGateway -> call Stripe/Adyen (Redlock-fenced) 4 CaptureOrVoid -> capture if auth OK, else void 5 CommitLedger -> settled credit/debit on Aurora 6 NotifyMerchant -> SQS -> SNS fan-out + EventBridge Each step is idempotent. Each has a compensation wired through a Catch: Step 2 fails -> ReleaseBalance Step 3 fails -> ReleaseBalance Step 4 fails -> VoidAuth + ReleaseBalance Step 5 fails -> VoidAuth + ReleaseBalance + DLQ for human review PRINCIPAL: Why Express and not Standard? Standard gives you a year-long execution and exactly-once. STAFF: Because the shopper is waiting. Express supports StartSyncExecution — the workflow runs and returns the result inline, which is what a synchronous auth needs. Standard is async-first and its per-transition pricing and persistence model are tuned for long-running, low-volume workflows; at 10k TPS its cost is far higher. Express bills per request plus duration and is built for high-volume, short workflows. The Express ceiling is 5 minutes; our saga is sub-second, so the ceiling is a non-issue. The trade is that Express is at-least-once, not exactly-once — which is fine only if every step and every compensation is genuinely idempotent. That word is doing a lot of work, and it is worth being precise about what makes it true. PRINCIPAL: "Every step is idempotent" — prove it. Express is at-least-once, so ReleaseBalance can run twice. If it does balance += amount, you just double-credited. Where is the guard? STAFF: Fair challenge — a relative delta is not idempotent, and a naive compensation that does balance += amount would double-credit on replay. The discipline is that every compensation is a guarded state transition, never a relative delta: UPDATE payments SET status = 'released', version = version + 1 WHERE payment_id = :id AND status = 'reserved' AND version = :v; A replay matches zero rows — the status already moved to released, so the second run is a safe no-op, not a second credit. The WHERE status = :expected_status clause is what makes at-least-once safe; idempotency here is a property of the SQL, not a hope. And VoidAuth carries the payment_id as the gateway's idempotency key — identical to the forward AuthorizeGateway call — so a re-run of the void de-dupes on the gateway's side exactly as the forward call does. Same key, both directions. === The gateway call and the Redlock fence === PRINCIPAL: Step 3 calls Stripe. Step Functions can retry a task. Two saga executions can target the same payment. That is an at-least-once call to a card network — a double charge. How do you fence it? STAFF: Belt and suspenders. The belt is a Redis NX lock on ElastiCache Serverless, taken right before the gateway call: SET lock:{payment_id} {execution_id} NX EX 5 Only the holder calls the gateway. Release on success; on crash the 5 s TTL expires so there is no orphan lock blocking recovery. But the lock acquire can fail two very different ways, and the original design collapsed them into one 503 — which was a reliability bug. We split them: Lock held by another saga for this payment — back off with 409. Another execution is actively processing this payment; the client retries shortly. Correct.. Lock infrastructure error — Redis is unreachable, mid-failover, timing out. Here we do not fail closed. We bypass Redis entirely, call the gateway, and rely solely on the gateway idempotency key for de-dup. We emit a redis_bypass metric so the bypass is visible.. The reason is brutal arithmetic: if we fail closed on infrastructure error, a single ElastiCache failover becomes a 100% payment outage for its duration. Failing closed on "can't reach the lock" converts a platform-wide outage into a metered pass-through that the gateway fence already contains — the gateway key still guarantees no double charge. We chose ElastiCache Serverless precisely to shrink that window: sub-millisecond failover, no shard topology to manage, no cluster-failover gap to design around. That is the managed-first answer; the cluster topology discussion is gone. The suspenders is the gateway's own idempotency key. We pass our internal payment ID as Stripe's Idempotency-Key header. So even if Redlock is wrong — and under a partition it can be — Stripe de-duplicates on its side: a second call with the same key returns the original charge, not a new one. Redis is load-shedding for the common case; Stripe's server-side key is the authoritative de-dup. PRINCIPAL: Martin Kleppmann's critique: Redlock is not a correct distributed lock under GC pauses and partitions. You just admitted it can be wrong. So why have it at all? STAFF: Exactly because I am not relying on it for correctness. Kleppmann is right that Redlock cannot be the sole guarantee of mutual exclusion for a money operation — a process can pause past its lease and act while another holds the lock. So I do not make Redis the source of truth. The source of truth is the gateway's idempotency key, which is monotonic and authoritative. Redis prevents the common, cheap case — two near-simultaneous retries both hammering Stripe — without a network round-trip to the gateway to discover the duplicate. It is an optimization with a correctness backstop, not the correctness mechanism. If I needed a real fence I would use a monotonic fencing token, which is precisely what the gateway idempotency key gives me end-to-end. === The ledger — Aurora, not DynamoDB === PRINCIPAL: You reached for DynamoDB twice already. Why is the ledger Aurora and not Dynamo? STAFF: Because the ledger needs multi-row ACID in one transaction. Committing a payment means: update the payments row to captured and insert one or more ledger_entries (double-entry: a debit and a credit) atomically. If the status update commits but the entries do not, the books do not balance. DynamoDB's TransactWriteItems exists but caps at 100 items and a 4 MB transaction and does not give me the relational guarantees, foreign keys, and ad-hoc JOIN-heavy reconciliation queries that an auditor will demand. A ledger is the canonical relational workload. Schema, with optimistic locking: payments(payment_id PK, merchant_id, amount, currency, status, version) ledger_entries(entry_id ULID, payment_id, merchant_id, amount, direction, created_at) -- conditional update guards against lost updates UPDATE payments SET status = 'captured', version = version + 1 WHERE payment_id = :id AND version = :expected_version AND status = :expected_status; The version column is an optimistic lock: if a concurrent saga moved the row, my WHERE matches zero rows and I know I lost the race and must re-read. No pessimistic row locks held across the gateway call, which would be a latency and deadlock disaster. It is Aurora PostgreSQL Global Database: WAL-based replication to read replicas for balance queries (read-heavy merchant dashboards never touch the writer), and a cross-region secondary for DR. Two recovery clocks, not one, and the distinction is load-bearing: in-region AZ failover is seconds, automatic, handled by RDS Proxy holding client connections and re-pointing them at the promoted writer. Cross-region promotion is minutes — a Route 53 ARC runbook, not an automatic flip — and that is fine, because the secondary region is DR, not HA. Our HA story is the in-region AZ failover; the cross-region promotion is the disaster story with a sub-minute RPO and a minutes-scale RTO that a payment DR scenario tolerates. We did not pretend cross-region is seconds. PRINCIPAL: Back to your own napkin math — 6,000 in-flight Lambdas, each wanting a connection. Aurora falls over. Now what? STAFF: RDS Proxy. Lambdas connect to the proxy, which multiplexes thousands of short Lambda connections onto a small pool of long-lived Aurora connections. It also handles failover transparently — when the writer fails over, the proxy holds client connections and re-points them, instead of 6,000 Lambdas all getting connection errors and stampeding the new writer. Without it, the connection storm at 10k TPS is the thing that actually takes us down — not the query load. The writer is Aurora Serverless v2, but with a deliberate floor: minimum 64 ACUs, not the ~0.5 default. Scale-down-to-near-zero is wrong for a payment ledger — a cold buffer pool means cold read latency on the first queries after a quiet spell, and fast scale-up needs a warm starting point. We pay to keep the buffer pool hot. We still scale up into the load, so February is cheaper than November; we just refuse to scale below a working floor. And the honest ceiling: at 3 Aurora writes per payment $\times 10^{4}$ TPS that is 30,000 writes/s on a single writer, which approaches the 256-ACU ceiling. Beyond ~15,000 TPS a single Aurora writer saturates, and the path forward is Aurora Limitless Database — write sharding native to Aurora, no application-level routing — or merchant_id-partitioned independent writers. We size for 10k TPS and name the upgrade path rather than build it now: Limitless is the correct horizontal-write answer above our design point, and adopting it today would be solving a problem we do not yet have. One cost footnote that belongs here: RDS Proxy enforces a minimum charge equivalent to ~8 ACUs even when the database would otherwise idle lower — it is not free pooling. With a 64-ACU floor that is noise, but at the low end it sets a price floor (see beat 10). === Failure and recovery at 3am === PRINCIPAL: It is 3am on Black Friday. The saga authorized the gateway in step 3, captured in step 4, and then the CommitLedger in step 5 throws. Walk me through what happens to the customer's money. STAFF: Step 5's Catch fires the compensation chain: VoidAuth (reverse the gateway capture, same idempotency key as the forward auth), ReleaseBalance (the guarded WHERE status = 'reserved' transition from beat 04), and because a captured-then-voided money movement is serious, it also lands a record in a DLQ for human review. The customer is made whole and an operator sees exactly which payment, which step, and which error. A correction on a claim I made earlier about "durable Step Functions state": Express does not persist execution history to the service — only Standard does. Express writes execution history to CloudWatch Logs at ALL level when enabled, and that is operational, not authoritative. The authoritative payment state is the DynamoDB idempotency slot (its status plus execution_arn), and the tamper-evident source of truth for regulators is the ULID-keyed EventBridge audit trail, not the Step Functions console. So recovery reads the slot, not the workflow history — the compensation is deterministic because the durable state lives in DynamoDB and Aurora, not because Express remembers. The dangerous variant is if VoidAuth itself fails — the gateway is unreachable during compensation. We do not loop forever in the hot path. The void is enqueued to a retry queue with exponential backoff, and the DLQ record stays open until the void confirms. Meanwhile reconciliation (next beat) will independently catch "captured at gateway, voided internally, but void not confirmed" within the hour. PRINCIPAL: The gateway is down entirely — every step 3 is timing out. You are now retrying into a black hole and burning your latency budget. STAFF: A circuit breaker on the gateway call. We track gateway error rate in a sliding window; past a threshold the breaker opens and step 3 fails fast with 503 instead of waiting out the timeout, so we shed load instead of queueing 6,000 hung executions. The breaker state cannot live per-Lambda — a per-instance counter is useless when 6,000 instances each see a few requests. It lives in DynamoDB with a 30 s TTL, shared across every instance, so the whole fleet opens and half-opens together. Half-open recovery sends a single probe with jitter. If we run multi-acquirer (Stripe primary, Adyen secondary), the open breaker is also the trigger to route new auths to the secondary — but failing over an in-flight payment is unsafe, so only new sagas reroute; in-flight ones compensate and let the client retry. PRINCIPAL: Back to the step 6 hole from beat 02. A slow original saga and a lease-reclaim retry both reach NotifyMerchant. The money is fenced, but the merchant gets two webhooks for one payment. How do you elect one emitter? STAFF: A second conditional put, the same primitive as the gate. Before step 6 emits anything, it does a DynamoDB conditional put on a notification-claim key: PutItem( Item = { pk: "notification_claimed#" + payment_id, ttl }, ConditionExpression = "attribute_not_exists(pk)" ) The saga that wins that slot emits the SQS/SNS fan-out. The loser's ConditionalCheckFailedException makes its step 6 a no-op — it returns success without emitting. Exactly one webhook per payment, even when two sagas race to step 6. The gateway key and the optimistic lock protect the money (steps 3–5); this claim protects the side-effect (step 6) with the identical pattern. === Reconciliation — the mandatory detective control === PRINCIPAL: You have idempotency, sagas, locks. Why do you still need reconciliation? Isn't it admitting your design leaks? STAFF: It is admitting that any design that spans a network boundary you do not control can diverge, and that in payments you must prove you can detect it. A dropped async event, a partition between "gateway captured" and "ledger committed," a Monzo-style card reserve the merchant never captures and the bank silently releases at T+7d — these are not bugs in our code, they are properties of distributed money. Reconciliation is the detective control that finds divergence before settlement actually moves cash at T+1. The naive shape — an hourly Lambda that pulls the gateway's reconciliation API and paginates the last hour — does not survive the arithmetic. At 1k TPS sustained that is $10^{3} \times 3{,}600 = 3.6\times10^{7}$ records/hour. Stripe's reconciliation API paginates ~100 records/request at ~100 req/s, so a full hour is 360,000 API calls — far past a 15-minute Lambda timeout and straight into rate-limit hell. So we flip the model from pull to push. Both Stripe and Adyen emit real-time settlement webhook events. We subscribe to that feed and land it in S3 via a second Firehose stream, exactly alongside the ledger_entries snapshots that the first Firehose lands. The reconciliation Lambda becomes an event consumer, not a poller — no timeout risk, no rate-limit exposure, no pagination. The Athena join over S3 Parquet is unchanged; only the gateway-side ingest changed from API polling to webhook events. Any row where gateway and ledger disagree raises an SNS alert to the ops dashboard. The hourly EventBridge Scheduler rule still fires — but only for the T+7/14/30d heartbeat checks on pending reservations. That volume is tiny (stale pending-only, not every transaction), so it never approaches the pagination wall. PRINCIPAL: Hourly batch — why not streaming reconciliation, sub-second? STAFF: Because the deadline that matters is T+1 settlement, not real time. Streaming reconciliation would cost far more (a Flink job running 24/7 instead of an hourly Lambda) to buy latency we do not need — a mismatch found within the hour has ~24 hours of slack before money settles. We spend reconciliation budget on coverage, not latency. The one thing that is time-sensitive is the card-reserve release: a dropped release event can drift the ledger past the bank's auto-release. So a heartbeat schedule re-checks pending reservations at T+7d, T+14d, T+30d against the gateway — catching exactly the Monzo failure mode. === Security and multi-tenancy === PRINCIPAL: PCI DSS Level 1. Where do raw card numbers live in your Lambdas? STAFF: Nowhere. That is the whole point. The card PAN never touches our compute. We use Stripe Elements (or Adyen's equivalent) client-side: the browser sends the PAN directly to the gateway, which returns a token; our Lambda only ever sees the token. This collapses our PCI scope dramatically — we cannot leak what we never hold, and it satisfies GDPR Art. 5 data minimisation by design. The only card-adjacent data we store is the last four and a network token, both non-sensitive. The rest of the controls: Encryption at rest — KMS CMKs for Aurora, DynamoDB, S3; per-tenant key context where the data model allows it. Maps to SOC 2 CC6.1.. Tenant isolation in Aurora — PostgreSQL Row-Level Security, not an IAM claim. A shared Lambda has no per-merchant IAM role, so "IAM session policies isolate tenants" was wrong. The real mechanism is RLS: ALTER TABLE payments ENABLE ROW LEVEL SECURITY; CREATE POLICY tenant_iso USING (merchant_id = current_setting('app.tenant')). The handler sets app.tenant from the verified JWT claim before any query runs. The point of RLS is that it survives an application SQL bug — even a query that forgets its WHERE merchant_id still returns only the authenticated tenant's rows, because the database enforces it below the application. SOC 2 CC6.6.. Tenant isolation in DynamoDB — the pk prefix is the boundary. The idempotency partition key is sha256(merchant_id + key) where merchant_id is derived from the authenticated JWT principal, never a client-supplied field. A cross-tenant collision would require a SHA-256 preimage collision — cryptographically infeasible. That hashing is the isolation mechanism here, not IAM.. Secrets — gateway API keys in Secrets Manager with automatic rotation, never in env vars or code.. Network and edge — Lambdas in private subnets, no public IPs; gateway egress via NAT; WAF on API Gateway. OWASP Managed Rules cover injection and XSS, but they are not enough for a payment auth surface: we add WAF Bot Control and ATP (Account Takeover Prevention) managed rule groups against credential stuffing, plus rate-based rules scoped per API key. That last rule blocks idempotency-key squatting — an attacker pre-seeding sha256(victim_merchant + guessed_key) slots to wedge a victim's payments into perpetual PROCESSING — which only works if the merchant_id in the pk comes from a client field; deriving it from the JWT principal closes the hole, and the rate limit catches the probing. SOC 2 CC7.2.. Audit — every state transition is a ULID-keyed audit event on EventBridge, immutably archived to S3 Object Lock for the retention window. ISO 27001 A.12.4.. GDPR erasure vs WORM — the immutable Object Lock trail collides head-on with GDPR Art. 17 right to erasure: you cannot delete from a locked object. Resolution is crypto-shredding. Any customer PII that must appear in an event payload (name, email, billing address) is encrypted with a per-data-subject KMS data key; deleting that key renders the ciphertext in the locked object permanently unrecoverable — erasure satisfied, WORM integrity intact. The financial-record fields (amounts, merchant IDs, payment references) fall under the Art. 17(3)(b) legal-basis retention exception and stay.. If we ever ran in true acquiring mode (PIN translation, talking to the card network ourselves), AWS Payment Cryptography handles the HSM-backed PIN block translation — a managed FIPS 140-2 Level 3 service — so even then we do not self-host an HSM. === Cost === PRINCIPAL: Put a number on it. What does the saga layer cost at peak, and is there a cheaper shape? STAFF: So the cost stack, corrected and ordered by size: Lambda — \$55k–\$80k/month, the dominant driver (math above). This is the line that decides the compute architecture.. DynamoDB idempotency — ~\$3,400/month provisioned with auto-scaling at the 1k TPS baseline, versus ~\$17,500 on-demand. We run provisioned; predictable traffic makes on-demand a 5x tax (beat 02).. Aurora Serverless v2 scaling into the load above a 64-ACU floor, plus the RDS Proxy ~8-ACU minimum charge that sets a price floor even at idle.. API Gateway — HTTP API, not REST. HTTP API is \$1.00/million requests against REST's \$3.50/million; at 1k TPS that switch alone saves ~\$6,500/month. We do not use REST-only features here, so there is no reason to pay REST pricing.. NAT Gateway data processing — ~\$1,000–\$1,500/month if left unmanaged. We add gateway VPC endpoints for DynamoDB and S3 (free, and they take those high-volume paths off NAT entirely) and interface endpoints for Step Functions, Secrets Manager, and CloudWatch Logs (~\$0.01/hr/AZ x 3 services x 2 AZs ≈ \$88/month) — which pay for themselves in NAT savings within days.. ElastiCache Serverless for the lock, Step Functions Express (low hundreds for the peak window).. The big lever, named honestly: above ~3,000 TPS sustained all month, Lambda's per-invocation model is the wrong economics. Three ECS Fargate tasks (4 vCPU each) running the idempotency gate plus saga orchestration in-process cost ~\$350/month against \$80k+ of Lambda — the DynamoDB, Aurora, and Redis layers stay identical; only the compute tier swaps. We do not redesign to Fargate now, because our load is a Black-Friday spike on a monthly profile, not a perpetual high baseline — Lambda's scale-to-zero off-peak is the right match for bursty traffic. But we name the crossover so the lever is on the table the day the baseline stops dropping. For the fan-out tail, the managed plumbing is now EventBridge API Destinations (beat 08) rather than a Lambda hop — fewer invocations, less code to own. === Did we ever leave AWS? === PRINCIPAL: You said Stripe. That is not AWS. Justify the one departure — and tell me why the rest stayed. STAFF: We left AWS for exactly one component: the payment gateway and its connection to the card networks. AWS does not offer an acquiring bank, a Visa/Mastercard network connection, or a card-present rail. That is a hard requirement with no AWS managed equivalent — rung 4 of the ladder, the rare exception that earns its place. Stripe/Adyen is the acquirer; we integrate over HTTPS and treat their idempotency key as our authoritative de-dup. Everything else stayed in AWS, and I'd defend each non-departure against the usual "why not self-managed" challenge: Why not self-managed Kafka for the event backbone? SQS + SNS + EventBridge cover the messaging shape (queue, fan-out, routing) with zero brokers to operate; we never hit a Kafka-only requirement like log compaction or replay-from-offset that would justify MSK here.. Why not Spanner / CockroachDB for the ledger? Aurora PostgreSQL Global Database gives us the relational ACID, cross-region replication, and read scaling we need; we did not need global synchronous multi-writer consistency, which is the one thing that would push us off Aurora.. Why not Aurora DSQL? DSQL reached GA on 2025-05-27 and is the newer AWS-native distributed-SQL option — exactly the managed-first candidate to check. But it lacks foreign key support and fixes isolation at repeatable read (no serializable). Both are load-bearing for a double-entry ledger: we rely on referential integrity between payments and ledger_entries, and on serializable guarantees for multi-row balance invariants. Aurora PostgreSQL stays correct until DSQL closes those two gaps.. Why not self-managed Temporal/Cadence for the saga? Step Functions Express is the managed orchestrator with sync execution; Temporal would be more flexible but is a stateful cluster we'd have to run and secure for PCI — not worth it for a linear sub-second saga.. Why not self-host the HSM? If we go acquiring, AWS Payment Cryptography is the managed FIPS 140-2 Level 3 HSM service — we never rack our own.. One departure, forced by the absence of an AWS card-network rail. The rest is AWS-managed, end to end.