Patterns from this design

Distributed rate limiting at API gateway scale

rate-limiting

Make the limiter decision atomic with one Lua script

When
A rate-limit decision reads a counter, branches on it, increments, and conditionally sets a TTL — and concurrent requests or a mid-sequence crash can leave a key with no expiry, silently disabling the limit.
AWS
Run the whole read-branch-increment-expire as a single EVAL on ElastiCache for Redis, using redis.call('TIME') as the one authoritative clock.
Trade-off
All decision logic lives in a Lua script you must test and version, in exchange for true atomicity that MULTI/EXEC and WATCH can't give you without retry contention.
rate-limiting

Approximate the window with two counters, not a log

When
You need a smooth rolling limit without the 2x boundary burst of a fixed window, but a per-request timestamp log would OOM the store under attack (tens of GB at the limit).
AWS
Store two integer keys per tenant (current + previous window) on ElastiCache and compute a weighted estimate; hash-tag the keys so the pair lands on one cluster slot; guard EXPIRE to fire only on the first write of the window (cur == 1), not every request.
Trade-off
A bounded approximation error (~half the previous window's rate at the transition) in exchange for ~11,700x less memory than a sliding-window log — not suitable for billing-grade exactness; treat slot-migration TRYAGAIN/ASK/MOVED as fail-open rather than retrying.
rate-limiting

Reward well-behaved bursts with a token bucket

When
Tenants have legitimately bursty traffic (batch jobs, expensive queries) that a flat per-second window would unfairly punish.
AWS
A Redis hash per tenant (tokens, last_refill) refilled lazily in Lua at rate r up to cap b; bill expensive operations (e.g. GraphQL query cost) as more tokens.
Trade-off
You allow bursts up to b above the sustained rate, and must use integer/server-clock arithmetic to avoid float drift and clock-skew bugs, versus the perfectly smooth output of a leaky bucket.
coordination

Limit per cell, accept bounded over-counting

When
A globally exact counter would need a cross-region synchronization point whose round-trip exceeds your entire latency budget.
AWS
Run an independent ElastiCache cluster per cell; a single control-plane authority computes per_cell_limit = global_limit / active_cell_count from a DynamoDB cell-registry and republishes on cell join/leave; pin clients to scaleReads:'master' and assert ROLE=primary on startup (never read replicas — stale = silent bypass).
Trade-off
A tenant balanced across C cells can consume up to C-times their global limit only within the recompute window (alarm at 1.5x, page at 3x), in exchange for local sub-millisecond hops and a cell-scoped blast radius.
rate-limiting

Counter-shard the whale tenants within a cell

When
Hash-tagging pins a tenant's keys to one slot for atomicity, but a single very-high-rate tenant then drives all its EVALs to one shard and saturates it while others sit idle.
AWS
For tenants above an ops threshold, split the key into M sub-keys ({rl:t:s0}..{rl:t:sM-1}) on ElastiCache, route each request to a random sub-shard, and sum the approximate sub-counts.
Trade-off
An extra over-count on whale tenants (the same approximation accepted at the cell level, one level down) in exchange for spreading their load off a single hot slot; the common tenant stays exact.
coordination

Fail open with a local backstop and a circuit breaker

When
The shared rate-limit store can fail or slow down, and a limiter bug must never reject legitimate paying traffic — but going fully open invites abuse during the outage.
AWS
On ElastiCache error/timeout, allow and emit a metric; keep WAF/usage-plan throttles as the volumetric floor that survives a Redis outage; size the in-process backstop bucket as tenant_limit / current_healthy_instances (from EDS) with a hard ceiling; trip a circuit breaker on sustained errors and recover half-open with U(0, 0.2*cooldown) jitter and a single probe.
Trade-off
Brief over-allowance during failover and coarser enforcement in degraded mode, bounded by the edge floor so induced-Redis-failure can't become unbounded pass-through, in exchange for never blocking real customers when the limiter is sick.
coordination

Distribute quota config without a single-coordinator freeze

When
Live config (per-tenant limits) is distributed from a store to many stateless deciders; a single Streams->Lambda->push path can stall (iterator lag, poison record, fan-out storm) and freeze all tenants on stale limits.
AWS
DynamoDB Streams triggers a reserved-concurrency Lambda (DLQ + BisectBatchOnFunctionError, IteratorAge alarm) that publishes one event to SNS; each cell consumes via its own SQS queue; deciders snapshot config from DynamoDB on startup and serve last-known-good on staleness; AppConfig gates staged rollout/rollback.
Trade-off
More moving parts than a direct push, in exchange for bounded fan-out (O(M) publishes + O(N) batched consumers), no forward-progress single point, and a deterministic cold-start config.
coordination

Jitter the reset to defuse the retry stampede

When
Throttled clients all read the same reset timestamp and retry on the same millisecond, turning a transient limit into a self-inflicted synchronized DDoS.
AWS
Return 429 with a jittered Retry-After and prefer no-hard-reset algorithms (sliding window / token bucket); have clients use full-jitter exponential backoff: sleep = min(cap, base*2^n) * U(0,1).
Trade-off
Slightly longer worst-case client wait in exchange for spreading retries across a window instead of concentrating them at a clock boundary.