payments-exactly-once
Orchestrate the saga; compensate, never two-phase commit
- When
- A money movement spans three internal services plus an external gateway, and any step can fail independently — but a classic 2PC coordinator that dies after phase-1 commit (Robinhood's 72-hour ghost transfer) leaves funds in limbo with no recovery path.
- AWS
- Model the flow as a Step Functions Express Workflow: each forward step is an idempotent Lambda task, and each has an explicit compensating task (ReleaseBalance, VoidAuth) wired through a Catch into a reverse path. Every compensation is a guarded state transition (UPDATE ... WHERE status = :expected_status AND version = :v), not a relative delta, so an at-least-once replay matches zero rows and is a safe no-op; VoidAuth reuses the forward call's gateway idempotency-key. Sync execution returns the result in under the 800 ms checkout budget; the 5-minute Express ceiling is ample for synchronous auth.
- Trade-off
- You give up the illusion of a single atomic transaction and accept windows where the system is in a known-intermediate state (reserved-but-not-captured) that compensation must unwind; you must make every step AND every compensation idempotent — guarded transitions, never relative deltas — and reason about compensations that themselves fail, which a DLQ for human review backstops. Express does not persist execution history to the service (only CloudWatch Logs), so the authoritative state is the DynamoDB idempotency slot, not the workflow console.
payments-exactly-once
Gate every payment with a DynamoDB conditional put
- When
- A client retry races the original request to completion and both reach the gateway — the Stripe-2013 double-charge — so the very first thing on the path must collapse all copies of one logical request onto one outcome.
- AWS
- On entry, PutItem to a DynamoDB idempotency table with ConditionExpression attribute_not_exists(pk), pk = sha256(merchant_id + idempotency_key). First writer wins the slot (status PROCESSING, 24 h TTL); a ConditionalCheckFailed means the request is already in flight or done, so return the cached response, never re-execute.
- Trade-off
- You add ~5 ms and one strongly-consistent write to every payment, and you inherit a stuck-PROCESSING edge case (caller crashed mid-saga) that needs a lease timeout to reclaim — in exchange for a hard exactly-once boundary that volatile caches like Redis cannot guarantee.
id-generation
k-sortable IDs with a worker slot claimed at cold-start
- When
- You need a high-throughput, collision-resistant payment ID that also range-scans well on the ledger — UUIDv4 is random, so it fragments B-tree inserts and makes time-range queries a full scan — but Lambda has no stable worker identity and two cold-starts in the same millisecond would collide a naive Snowflake.
- AWS
- Generate a 63-bit Snowflake variant: 41-bit ms timestamp (custom epoch) + 14-bit worker ID + 8-bit sequence (16,384 workers, 256 IDs/ms/worker). Each Lambda instance claims a unique worker ID at cold-start via a DynamoDB conditional put on the slot, so no two live workers share a number. Use ULIDs for audit/event IDs (monotonic-in-ms, base32, URL-safe). Expose only an HMAC-derived ext_ref to clients, never the predictable numeric ID, to prevent BOLA enumeration.
- Trade-off
- You cap concurrent workers at 16,384 (14 bits — ample over the ~6,000 concurrent Lambdas at peak, alarmed at 80% utilisation) and depend on roughly-synced clocks (NTP) plus a clock-rollback guard, in exchange for coordinator-free, time-ordered IDs after boot. UUID v7 is the coordination-free alternative that deletes the worker-slot table entirely, but at 128 bits it doubles every B-tree index width versus the 63-bit integer — half the index size is load-bearing at billions of ledger rows, so we keep Snowflake.
messaging
Fence the external gateway call with a Redis NX lock
- When
- Step Functions can retry a task, or two saga executions can target the same payment, and an at-least-once external call to a card network is a real double-charge — the gateway leg is the one step you cannot blindly retry.
- AWS
- Before calling the gateway, SET NX payment_id with a 5 s TTL on ElastiCache Serverless (sub-ms failover, no shard topology to manage); only the lock holder calls Stripe/Adyen, passing the internal payment ID as the gateway's own idempotency-key. Release on success; let the TTL expire on crash. Split the failure modes: lock HELD by another saga returns 409 (back off); a lock INFRASTRUCTURE error bypasses Redis and calls the gateway anyway, emitting a redis_bypass metric.
- Trade-off
- Redlock under partition is not a perfect distributed mutex, so it is a first fence, not the last line of defence — you lean on the gateway's server-side idempotency key as the authoritative de-dup. Critically, fail OPEN on Redis infrastructure failure, not closed: failing closed turns a single ElastiCache failover into a 100% payment outage, whereas bypassing the lock converts a platform-wide outage into a metered pass-through that the gateway fence contains.
payments-exactly-once
Reconcile against the gateway before settlement, on a heartbeat
- When
- Even with idempotency, sagas, and locks, a dropped async event or a partition can leave internal ledger and the gateway's view divergent — a captured charge with no ledger row, or a Monzo-style card reserve the merchant never captures and the bank silently releases at T+7d.
- AWS
- Push, not pull: subscribe to the gateway's real-time settlement webhook feed and land it via a second Kinesis Firehose stream into S3 Parquet, alongside ledger_entries snapshots — pulling 36M records/hour through a paginated API would not fit a Lambda timeout. An Athena join flags divergence to SNS. An EventBridge Scheduler heartbeat re-checks only pending reservations at T+7d, T+14d, T+30d (small volume) so a dropped release event cannot drift the ledger past settlement.
- Trade-off
- Reconciliation is eventually-consistent (hourly, not per-transaction) and is a detective control, not a preventive one — it catches and surfaces divergence rather than stopping it, which is acceptable because settlement is T+1 and gives the window time to resolve before money actually moves.