Architecture review

Distributed rate limiting at API gateway scale

One INCR is easy; one million requests/second across 10k tenants without a race, a thundering herd, or a cross-region round-trip is the interesting version.

staff12 min readrate-limitingcoordination

Rate limiting looks like a one-liner: INCR a counter, reject when it crosses a threshold. That one-liner has a race condition that silently disables itself, a boundary that lets through 2× the limit, and a window reset that turns 10k well-behaved clients into a synchronized stampede. The interesting version runs at a million requests/second across thousands of tenants and still says no at the right moment.

The problem, and the numbers we design to

Principal

Before any boxes — what are we protecting, and at what scale?

Staff

A multi-tenant API gateway. We protect upstream services from abuse and noisy neighbours, and we enforce per-tenant plan quotas. Design point: 1,000,000 requests/second at the edge, 10,000 tenants, plans from 100 to 10,000 req/s per tenant, p99 latency budget of 10 ms for the limiter decision itself. It must fail open — a bug in the limiter must never reject a paying customer's legitimate traffic.

Napkin math

The limiter sits in the hot path of every request, so its op budget is the whole traffic volume. If a decision needs $k$ datastore operations, the backing store must sustain

$$ R_{ops} = R_{req} \times k = 10^{6} \times 4 = 4 \times 10^{6}\ \text{ops/second} $$

for $k=4$ (two reads, one increment, one expire). A single Redis node does ~100k–200k scripted ops/s, so $4\times10^{6} / 1.5\times10^{5} \approx 27$ shards. That number — not the algorithm — is what forces the topology. And every one of those ops adds to the 10 ms budget, which is why we will fight to keep the store one network hop away.

The naive counter, and the race that disables it

Principal

Start simple. One Redis, INCR a per-tenant key, reject above the limit. What's wrong with that?

Staff

Two things, and the second is the dangerous one. First, a fixed window resets on a hard clock boundary, so a tenant can fire the full limit at :59 and again at :012× the limit in two seconds, both windows technically legal. Second, the textbook implementation is two commands:

count = INCR key            # command 1
if count == 1:
    EXPIRE key 60           # command 2 — only on first hit

Those aren't atomic. If the process crashes between them — or two requests interleave such that neither is the "first" that sets the TTL — the key persists with TTL = -1, no expiry. That counter now only ever grows: once it's past the limit it rejects the tenant forever, or if you fail open, it never limits them again. Kong shipped exactly this bug (PR #6150); it surfaced only under concurrent load.

Atomic sliding window in Lua

Principal

So Lua fixes atomicity. But you still owe me the 2×-burst fix. Sliding window log?

Staff

A sliding window log — a sorted set of timestamps per tenant — is exact, but it's a memory trap under attack. Let me show the number before I reject it.

Napkin math — sliding window log memory

Limit 1,000 req/s, 60 s window ⇒ up to 60,000 entries per tenant at peak. A sorted-set entry (score + member + ziplist overhead) is ~39 bytes:

$$ M_{tenant} = 60{,}000 \times 39\ \text{B} \approx 2.34\ \text{MB} $$

$$ M_{total} = 2.34\ \text{MB} \times 10{,}000 = 23.4\ \text{GB} $$

Under a coordinated attack every tenant sits at its limit at once: 23.4 GB of pure rate-limit state, plus ZREMRANGEBYSCORE pruning cost that scales with how many entries expired. The limiter can OOM itself exactly when you need it most.

Staff

So I use the sliding window counter instead: two integer counters per tenant — current window and previous window — and a weighted estimate. Memory drops to a rounding error:

Two keys × ~102 bytes × 10,000 tenants ≈ 2 MB — about 11,700× smaller than the log. The estimated rate over the rolling window is

rate = prev * (window - elapsed)/window + current

which Cloudflare validated against 400M requests at 0.003% incorrect decisions. The Lua script does it atomically against Redis's own clock:

-- KEYS = {curr, prev}; ARGV = {limit, window_s, elapsed_s}
local cur  = tonumber(redis.call('GET', KEYS[1]) or '0')
local prev = tonumber(redis.call('GET', KEYS[2]) or '0')
local w    = tonumber(ARGV[2])
local est  = prev * ((w - tonumber(ARGV[3])) / w) + cur
if est >= tonumber(ARGV[1]) then return {0, est} end
local cur2 = redis.call('INCR', KEYS[1])
if cur2 == 1 then redis.call('EXPIRE', KEYS[1], w * 2) end  -- arm once, not every call
return {1, est + 1}

Note the guarded EXPIRE: it fires only when the counter is the first write of the window (cur2 == 1), not on every request. Re-arming unconditionally is correct but it's a timer-heap mutation per call — at 1M req/s that's 1M heap deletes-and-reinserts/s, a measurable CPU hotspot on a whale's shard. The guard cuts it to one EXPIRE per window per tenant for the same correctness.

Token bucket, and why not leaky bucket

Principal

Some tenants have legitimately bursty traffic — a batch job at the top of the hour. A sliding window punishes that. What do you give them?

Staff

A token bucket. A Redis hash per tenant — tokens and last_refill — refilled lazily on read at $r$ tokens/second up to a cap $b$. A request costs one token; bursts up to $b$ are allowed, then the tenant is shaped to the sustained rate $r$. Memory is ~190 bytes/tenant, ~1.9 MB for 10k tenants — negligible. This is the Stripe / API Gateway default for good reason.

A leaky bucket models the opposite: it drains at a fixed rate and queues or drops the overflow, so output is perfectly smooth but it never lets a genuine burst through. Token bucket allows the burst up to $b$; leaky bucket forbids it. We want to reward well-behaved bursts, so token bucket wins. Shopify's GraphQL limiter is a token bucket where a query's cost (computed from the AST) is the number of tokens — same machine, smarter price.

Distributing the counter: cluster vs. local

Principal

You said 27 shards. So one big Redis Cluster, every gateway node talking to it. At 10× — 10M req/s — does that hold?

Staff

Redis Cluster scales horizontally — 16,384 hash slots over N shards, keys routed by CRC16. The sliding-window pair lives on the same tenant, and a multi-key Lua script must hit one slot, so I hash-tag the keys: {rl:tenant-123}:cur and {rl:tenant-123}:prev. The {...} forces same-slot placement, otherwise the script dies with CROSSSLOT. A shard failure then has a blast radius of $1/N$ of tenants, not all of them.

At 10× I scale shards roughly linearly — 40M ops/s ⇒ ~200–400 shards — but the real problem at that scale isn't shard count, it's the round-trip. If a gateway node in us-east-1 has to consult a counter in another region, that's 50 ms+, which alone blows the 10 ms budget. So the global-vs-local question is the actual fork in the road.

Principal

So pick one. Global strong consistency or local approximate?

Staff

Local, per-cell, and I accept bounded over-counting. The math forbids strong global consistency in the hot path: a single authoritative counter needs a synchronization point, and a cross-region round-trip is larger than my entire latency budget. So each cell (region/AZ) runs its own ElastiCache cluster and enforces a per-cell limit. A tenant spread over $C$ cells can consume up to $C\times$ their global limit in the worst case — Lyft's Envoy ratelimit makes exactly this trade. Cloudflare sidesteps it entirely because anycast pins a client to one PoP, so per-PoP is the global limit under normal routing.

Principal

"Divide by expected cell count" is hand-waving. Who computes the per-cell share, and what happens when a cell joins or leaves?

Staff

Fair — undefined division is unbounded over-admission, not a bounded trade-off. The control-plane Lambda is the single authority for the per-cell math. Active cells live in a DynamoDB cell-registry table; a cell joining or draining writes that table, a Streams event fires, and the Lambda recomputes per_cell_limit = global_limit / active_cell_count and republishes to every ratelimit service. Until that recompute lands, the old (larger) per-cell share is live on the new cell too, so a tenant balanced across cells can momentarily exceed — that window is the bound, not a steady state. We publish active_cell_count and per_cell_limit_effective as CloudWatch gauges and alarm when realized over-admission crosses 1.5×; sustained pages. We do not chase strong global exactness — the requirement is bounded drift, not zero.

Global enforcement: Envoy sidecar + control plane

Principal

Where does the limiter actually run? In the gateway process? A sidecar? And how do tenants get their limits without a redeploy?

Staff

Data plane: an Envoy proxy fronts each cell, and its rate-limit filter calls a stateless ratelimit gRPC service (the Lyft/Envoy open-source one) co-located in the cell, backed by the cell's ElastiCache. Envoy sends a domain and ordered descriptors(tenant, /payments), (tenant), (source_ip) — and the service resolves each to a key and applies the most specific matching limit. Sub-millisecond hop because everything is in-cell. We pick the Lyft service deliberately: AWS has no managed service that resolves multi-dimensional ordered descriptors to per-tenant/per-operation limits over a gRPC API — usage plans throttle one flat API-key dimension and can't express the (tenant, op) hierarchy — so self-hosting on AWS compute is rung 3 of the ladder forced by a real capability gap, not habit.

One correction to the op budget: three descriptors per request, resolved as three separate keyed decisions, is not $k=4$ — it's closer to $k=12$, which triples the required ops to $\sim 12\times10^{6}$/s and the shard count with it (see beat 09). Where the descriptors for one tenant can be co-located (same hash-tag) we collapse them into a single multi-key Lua script and one round trip; the per-IP descriptor lands on a different slot and stays a separate call. The honest starting point is therefore ~80 shards, not 27.

Control plane: limits live in DynamoDB (a tenant→plan→limits table). A Streams event fires on every plan write into a control-plane Lambda, which resolves the limit and publishes a single config-change event to an SNS topic; each cell's ratelimit service subscribes through its own SQS queue and drains updates at its own pace. That turns a bulk migration (onboard 500 tenants → 500 writes) from $O(N\times M)$ simultaneous gRPC pushes into $O(M)$ publishes and $O(N)$ independent batched consumers — no fan-out storm, and each consumer has a queue to absorb bursts. So changing a tenant's plan is a write, not a deploy.

Two hard rules make that control plane safe to depend on. First, the Lambda gets reserved concurrency (so an account-wide Lambda spike can't starve config pushes), a DLQ + BisectBatchOnFunctionError on the event-source mapping (so one poison record can't wedge the stream or silently drop config), and a CloudWatch alarm on Streams IteratorAge > 30 s. Second, the ratelimit service is never blocked on a push it might never receive: on startup it pulls current config directly from DynamoDB (a point-in-time snapshot, ~10k items, one scan) and only then switches to the pushed-update stream, and it serves last-known-good if the stream goes stale, emitting a config_stale metric. We measure write-to-receipt propagation lag as a CloudWatch metric and hold it as an operational SLO, because the latency-sensitive case is a support engineer raising a throttled customer's limit live.

CloudWatch carries the observability, but not as one PutMetricData per decision — at 1M req/s that path is both rate-limited by the API and absurdly priced. The ratelimit service aggregates allow/deny per descriptor in 1-second in-process buckets and flushes them as Embedded Metric Format (EMF) log blobs, from which CloudWatch extracts metrics server-side. That powers dark-launch — a new limit runs in count mode and is watched before it flips to enforce, the Stripe playbook — at thousands of data points/s instead of millions.

Failure & recovery — what pages at 3am

Principal

The cell's Redis primary dies at 03:00. What happens to a request in flight, and what's the page?

Staff

The ratelimit service's call to ElastiCache errors or times out, and we fail open: the request is allowed and a metric increments. We never block a paying customer because our limiter is sick. ElastiCache Multi-AZ promotes a replica — realistically tens of seconds to detect-and-promote, not single-digit; we alarm on measured failover duration rather than claim a number we haven't proven. During that window counters reset to zero (RPO ≈ one window of counts), so tenants briefly get extra allowance — acceptable for protection, and the reason fraud counts don't ride this path. We never read from replicas for decisions: a stale replica count is an under-count, a silent bypass. That prohibition is enforced, not just stated — the ElastiCache client is pinned to scaleReads: 'master' and the service asserts on startup via ROLE that it's connected to a primary, because the default in several Redis client libraries is to scatter reads to replicas.

Fail-open is a deliberate choice, but unbounded fail-open is a denial-of-protection vector: an attacker who can induce Redis timeouts converts the limiter into pass-through for the whole cell. So we bound it. The WAF rate-based rules and API Gateway usage-plan throttles are the enforced volumetric floor — they're independent of ElastiCache, survive a Redis outage, and are tuned to contain abuse, not merely catch floods. Below that floor, a degraded local mode: each Envoy keeps a small in-process token bucket per tenant. Critically, that bucket is sized tenant_limit / current_healthy_instances using the instance count Envoy already learns from EDS — a fixed per-instance budget would hand every tenant $N\times$ their limit (the per-cell fault multiplied by fleet size). It's deliberately coarse (documented as ~2–3× sustained, with a hard ceiling well below upstream capacity), and we emit a requests_served_by_backstop metric so the blast radius is observable.

A circuit breaker on the ElastiCache dependency closes the loop: after a burst of errors we stop calling it for a cooldown and serve from the floor + backstop, which keeps a sick Redis from adding latency to every request. Recovery is half-open with jitter — each node adds $U(0,\,0.2\cdot\text{cooldown})$ to its reset timer and admits a single probe EVAL before fully closing. Without that jitter, hundreds of nodes sharing one cooldown all resume at the same millisecond and re-floor a just-promoted replica — a thundering herd on the recovering primary. The same jitter discipline we apply to client retries, applied to our own reconnect. The page fires on limiter error rate and p99 of the limiter hop (and on error rate simultaneously with upstream saturation, the signature of an induced-fail-open attack), not on deny counts — denies are expected control flow.

Security & multi-tenancy

Principal

Tenant A must not be able to spend tenant B's quota, and an attacker must not be able to dodge an IP limit by spoofing a header. Walk me through isolation.

Staff

The rate-limit identity comes from an authenticated claim — the verified JWT tenant_id from the gateway authorizer — never from a client-supplied header, so A literally cannot address B's key. For IP-based limits we don't trust "leftmost" anything: X-Forwarded-For is appended right-to-left, so the client-controlled portion is the leftmost. We have CloudFront/ALB overwrite (not append) XFF at the edge and set Envoy's xff_num_trusted_hops to the exact count of proxies we own, so the trusted entry is read by hop count and a forged header can't shift it. Per-tenant keys are namespaced ({rl:<tenant>}:...), and quota config in DynamoDB is row-scoped per tenant — no shared mutable bucket. This is the SOC 2 CC6 logical-access story and NIST 800-53 AC-4 information-flow enforcement.

One gap the identity layer doesn't close: the Redis network hop itself. The quota keys are only as private as the wire. We set TransitEncryptionEnabled=true and a Redis AuthToken / ACLs (least-privilege per ratelimit-service principal, token pulled from Secrets Manager, never env or code) on every ElastiCache cluster, and lock the cluster security group to the ratelimit-service SG only — otherwise anyone on the cell network can INCR or reset any tenant's counter directly, behind the JWT we rely on for isolation. That's SOC 2 CC6.1/CC6.7 and NIST SC-8/SC-13.

The audit trail is the part CloudWatch counts don't cover: aggregate allow/deny gauges are operational telemetry, not an attributable record of who weakened a control. A raised quota is a privileged, billing-relevant action, so a second DynamoDB Streams consumer writes an append-only audit log — actor, before/after, timestamp — to an S3 bucket with Object Lock (WORM), with CloudTrail data-events covering the limits table and the control-plane Lambda and least-privilege IAM (no wildcard writes) on the table. That's SOC 2 CC7.2/CC8.1 change management and NIST AU-2/AU-9/AU-10. At-rest encryption on ElastiCache and DynamoDB rounds out the data-protection controls.

Finally, the IP descriptor is personal data under GDPR. IP-keyed Redis state already carries the window's short TTL (make it explicit and bounded — minimization and storage limitation), we don't emit raw IPs into long-lived log streams (the EMF metrics aggregate by descriptor group, not raw IP), the IP-keyed metrics stay in-region, and the lawful basis is documented legitimate interest (abuse prevention). Tenant keys were already well-scoped; the IP dimension is where the privacy obligation actually lands.

Cost

Principal

What does this cost, and where's the money — the algorithm choice or the topology?

Principal

And before you answer — is the algorithm even where the money is?

Staff

No, and that's the inverted-narrative trap. The ElastiCache fleet is the third-largest line item. Let me put the whole bill on the table.

Napkin math — store cost at 1M req/s, per cell

Memory is trivial: sliding-window counter is ~2 MB for 10k tenants; the fleet is CPU-bound, not memory-bound — r7g.large's 13 GB is wildly over-provisioned on RAM, so reaching for a bigger node "for headroom" just doubles the bill. The cost is the op rate. Using a realistic p95 EVAL throughput of ~$10^{5}$ ops/s (not the top-end $1.5\times10^{5}$) and the corrected $k=12$ from three descriptors:

$$ N_{shards} = \frac{12\times10^{6}}{1.0\times10^{5}} \approx 120 \Rightarrow \text{co-located descriptors bring effective } k\!\approx\!8 \Rightarrow \sim 80\ \text{shards} \times (\text{primary}+\text{replica}) \approx 160\ \text{nodes} $$

On r7g.large On-Demand that's ~$\$20\text{k}$/month per cell; a 1-year Savings Plan (this load is predictable) cuts it ~30% to ~$\$14\text{k}$. Cross-AZ replication is a real, separately-billed line: ~65 bytes/write × 1M req/s ≈ 65 MB/s ≈ 168 TB/month × \$0.01/GB ≈ ~$\$1.7\text{k}$/month per cell — small, but not zero, and we now state it.

Where the money actually is — the edge layer

At 1M req/s = 86.4B requests/month, the edge dwarfs the store:

  • API Gateway REST at ~\$2.80–3.50/M calls ≈ ~$\$240\text{k}$/month — by itself ~60% of the bill, the exact opposite of "near-zero."
  • CloudFront + WAF (request inspection at \$0.60/M) ≈ ~$\$130\text{k}$/month.
  • CloudWatch via EMF (aggregated, per beat 06) ~$\$2\text{k}$–$\$12\text{k}$; per-decision PutMetricData would have been ~$\$777\text{M}$/month and is physically impossible past the API's 150 TPS — the reason we aggregate.
  • DynamoDB quota config under the push model: a startup scan plus stream pushes, ~$\$50$–$\$500$/month with a bounded TTL-fallback read cap; near-zero in steady state.
  • Envoy + ratelimit compute on EKS: ~$\$1\text{k}$–$\$3\text{k}$/month per cell.

So the per-cell total is ~$\$385\text{k}$–$\$420\text{k}$/month, of which the rate-limiting machinery we spent nine beats on is <5%.

Staff

That reframes the cost levers entirely. The first lever is the edge, not Redis: if Envoy already terminates and we keep WAF for volumetric defense, dropping API Gateway REST in favor of Envoy-native per-key throttling takes the edge from ~$\$370\text{k}$ to ~$\$130\text{k}$ — and if we keep usage plans, HTTP API + a custom authorizer beats REST by ~$\$190\text{k}$. We keep API Gateway REST here only because usage plans give us the managed coarse floor that survives a Redis outage (beat 07); that floor is worth real money, so it's a named trade, not a default.

The second lever is the one we already pull: keep cheap traffic off Redis — WAF and usage plans absorb coarse volumetric and per-key limits, and only business-logic limits (per-tenant, per-operation, GraphQL cost) touch ElastiCache, cutting the op rate and the fleet. ElastiCache Serverless is the alternative if the 1M req/s is a peak rather than a sustained floor; we gate it on one verified constraint first — that Serverless supports cluster-mode EVAL with hash-tagged keys at our ops rate — so a cost optimization can't silently break the slot-co-location the design depends on. For genuine 24/7 sustained load, provisioned + Savings Plan still wins.

Did we ever leave AWS?

Principal

You reached for the Envoy ratelimit service — that's open source, not an AWS product. Did we leave AWS?

Staff

No. Every stateful and managed piece is AWS: ElastiCache for Redis holds the counters, DynamoDB holds quota config, CloudWatch holds metrics, WAF and API Gateway do edge limiting, and the control plane is Lambda + DynamoDB Streams + SNS/SQS + AppConfig for the rollout gate. Envoy and the ratelimit gRPC service are software we run on AWS compute (EKS/ECS) — that's rung 3 of the ladder, self-hosting on AWS, not a departure, forced by the named gap: no AWS service resolves multi-dimensional descriptors to per-tenant/per-operation limits over gRPC.

The one requirement that would force a real departure is a globally exact, strongly-consistent counter in the hot path at low latency — and even that has an AWS shape (DynamoDB conditional writes, or Aurora DSQL) before it leaves the platform. We don't have that requirement: rate limiting tolerates approximation, so we never need it.

Principal

One more: a reviewer suggested App Mesh to manage Envoy's lifecycle, and folding the circuit breaker into App Mesh outlier detection. Why not?

Staff

App Mesh is the right reflex and we'd happily take its managed Envoy lifecycle, xDS, and TLS — but it doesn't subsume the two things people want it to here. App Mesh outlier detection ejects an unhealthy upstream host from a load-balanced pool; our circuit breaker trips on the ElastiCache dependency and switches the decision to a fail-open + local-backstop path. Those aren't the same control: there's no "other host" to fail over to, the action is "stop calling and serve degraded," and the backstop bucket is bespoke precisely because its whole job is to enforce something when the managed datastore is gone. That backstop is a calculated residual-risk component, sized off EDS instance count and metered — we keep it as a designed piece, not a thing App Mesh hands us. App Mesh manages the proxy; it does not manage our degradation policy.