payments-exactly-once
Idempotency key with cached response replay
- When
- A mutating operation has an external side effect (a charge) that must run exactly once even when the response is lost to a timeout and the client retries, and concurrent retries can race in. You need an explicit identity for the operation - not a guess from request fields - and a way to return the original answer on every retry.
- AWS
- Client generates a V4 UUID once per logical operation and sends it as an Idempotency-Key header, reusing it across retries. The server persists ((tenant_id, idempotency_uuid), request_hash over a canonical form, status, response, expires_at, lease_expires_at) in Aurora PostgreSQL via INSERT ... ON CONFLICT DO NOTHING. First caller executes and stores the outcome; later callers replay the cached response on terminal status, get 409 while PROCESSING, and 422 when the same key arrives with a different request hash. Reads filter on expires_at in application code so best-effort TTL deletion is never a correctness boundary; expiry is jittered to avoid a stampede.
- Trade-off
- Every mutating request now pays a key-store round trip before doing work, and clients must persist and correctly reuse the key across retries (a new key means a new charge). Binding the key to a canonicalised request hash means a legitimate retry with any payload drift is rejected with 422 rather than silently replayed. The PROCESSING lease must be tuned: too short and a slow saga gets reclaimed mid-flight, too long and a crashed saga stays stuck until the sweeper finds it.
messaging
In-process saga with DynamoDB-checkpointed compensation
- When
- A business operation spans multiple steps where a middle step has an irreversible external side effect (money moves), and a crash after that side effect but before the downstream write would otherwise be misread as a failure and retried - charging twice. The classic 86K-dollar duplicate-payout bug. Throughput is high enough (tens of thousands/s) that a managed orchestrator's per-start quota and per-execution price turn against you.
- AWS
- The saga runs in-process inside an ECS Fargate task and persists each step's state to DynamoDB via a conditional write before proceeding - explicit per-step checkpointing, so a re-drive resumes from the last persisted step rather than replaying the side effect. A conditional PutItem gates saga launch so concurrent duplicates cannot both start. The idempotency key is threaded into the external call (the PSP's own idempotency key) so even an in-step retry deduplicates at the side-effect boundary; full-jitter backoff and a circuit breaker protect a recovering dependency, and a charge-status query precedes any re-drive that risks the PSP's dedup window. A failed post-side-effect step routes to a compensating transaction (issue a void, mark the ledger VOIDED).
- Trade-off
- You own the saga loop instead of leaning on a managed state machine, so step orchestration, retries, and the reconciliation sweeper for orphaned leases are your code. The win is no per-start quota ceiling, near-zero orchestration cost, and checkpointing stronger than an at-least-once managed Express workflow. At low throughput or for multi-day workflows needing year-long execution history, a managed Step Functions Standard workflow is the better trade.
coordination
Row-level lease for concurrent duplicate serialisation
- When
- Two identical requests carrying the same operation identity arrive milliseconds apart on different stateless instances, both miss any cache, and a plain check-then-act would let both proceed (a TOCTOU race) and execute the side effect twice - the double-click on Pay.
- AWS
- The first transaction wins the key row via a conditional insert and sets status PROCESSING; the second transaction's insert conflicts, so it takes a SELECT ... FOR UPDATE row-level lock on that key in Aurora PostgreSQL, blocks until the first commits, then reads the now-terminal status and replays the cached response. The database row lock is the coordinator - one request gets the lease to proceed, the other waits and replays - with no application-level locking service.
- Trade-off
- The lock makes the key row a serialisation point, so a pathologically hot key serialises its duplicates and a long-running first execution makes its duplicates wait. You depend on the relational primitive (row lock plus conditional insert in one transaction), which is why this lives in Aurora and not in a pure key-value cache.
caching
Write-through dedup cache fronting a transactional store
- When
- A correctness-critical lookup (have I seen this idempotency key, and what was the answer) must run on every request at a rate the transactional source of truth cannot serve without becoming the bottleneck, but the lock that decides genuine first-execution must stay in the transactional store.
- AWS
- DynamoDB (no DAX) sits in front of sharded Aurora as a read-through, write-through cache and doubles as the in-process saga's checkpoint store. A cache hit on a terminal status replays in single-digit milliseconds without touching Aurora or running the saga; a miss falls through to the Aurora lock, runs the saga on first execution, then writes the result back through DynamoDB. Reads filter on expires_at in application code rather than trusting best-effort TTL deletion, and Global Tables replicate the store cross-region. DAX is omitted because it does not join Global Tables, does not help the first-execution path, and is redundant once Fargate holds a warm connection pool.
- Trade-off
- Key state now lives in two stores with a narrow lag window, so you must enforce that only a request which took the Aurora lock and completed may write a terminal cache entry - a cache miss is always safe (fall through to the authority), only a hit short-circuits. The write-back is non-transactional, so a rising cache-miss rate must be alarmed: correctness survives a failed write-back but the cost and latency value silently degrades. You give up single-store simplicity for a hot-path read budget Aurora alone cannot meet.
payments-exactly-once
Append-only ledger with streamed tamper-evident audit
- When
- A financial system must record every state transition (created, charged, settled, voided) as an immutable system of record, prove to an auditor that no record was altered, and fan the terminal outcome out to downstream consumers without coupling them to the charge path.
- AWS
- Each saga transition is appended to a DynamoDB ledger that is the system of record for SETTLED and VOIDED states. DynamoDB Streams feeds those transitions to S3 with Object Lock (WORM) in a dedicated logging account, so records cannot be altered or deleted within the retention window even by an admin - satisfying PCI Req. 10 and SOC 2 CC7 / NIST AU-9. Terminal outcomes also emit payment.completed / payment.failed to EventBridge for decoupled downstream fan-out.
- Trade-off
- Append-only means the ledger only grows; you pay storage and need a retention/archival strategy, and a correction is a new compensating entry rather than an update, so reads must fold the event history to get current state. The WORM immutability that satisfies auditors also means a genuinely wrong record cannot be deleted within retention - only annotated.