Distributed rate limiting at API gateway scale

01

Shed the cheap traffic at the edge

flowchart LR Internet([Internet]) --> CF[CloudFront + WAF] CF --> AG[API Gateway]

Before anything stateful, CloudFront + WAF rate-based rules absorb volumetric IP floods and API Gateway REST usage plans apply a coarse per-API-key throttle. This edge layer is also the enforced volumetric floor that survives a Redis outage.

02

Envoy fronts the cell

flowchart LR Internet([Internet]) --> CF[CloudFront + WAF] CF --> AG[API Gateway] AG --> EV[Envoy Proxy per cell] EV -->|JWT to descriptors| RL[Rate Limit Svc gRPC]

An Envoy proxy in each cell carries the rate-limit filter. It turns an authenticated request into ordered descriptors — (tenant, op), (tenant), (ip) — from a verified JWT claim, never a client-supplied header.

03

Atomic decision in-cell

flowchart LR Internet([Internet]) --> CF[CloudFront + WAF] CF --> AG[API Gateway] AG --> EV[Envoy Proxy] EV --> RL[Rate Limit Svc] RL -->|Lua EVAL atomic| RD[(ElastiCache Redis)]

Envoy calls a stateless ratelimit gRPC service, which runs one atomic Lua EVAL against the cell's ElastiCache Redis Cluster — a sliding-window counter, two hash-tagged keys per tenant, sub-millisecond hop.

04

Allow, or 429 with jitter

flowchart LR Internet([Internet]) --> CF[CloudFront + WAF] CF --> AG[API Gateway] AG --> EV[Envoy Proxy] EV --> RL[Rate Limit Svc] RL -->|Lua EVAL| RD[(ElastiCache Redis)] RL -- allow --> SVC[Upstream Service] RL -- 429 + jitter --> Internet

Under the limit, the request flows to the upstream service. Over it, the client gets a 429 with a jittered Retry-After so 10k throttled clients don't retry on the same millisecond.

05

Quota config as data, not deploys

flowchart LR subgraph DP[Data Plane] Internet([Internet]) --> CF[CloudFront + WAF] --> AG[API Gateway] --> EV[Envoy] --> RL[Rate Limit Svc] RL -->|Lua EVAL| RD[(ElastiCache Redis)] RL -- allow --> SVC[Upstream] end subgraph CP[Control Plane] DDB[(DynamoDB quota)] -->|Streams| LA[Lambda] --> SN[SNS] --> SQ[SQS] --> RL end

Tenant plans live in DynamoDB; a Streams-driven control-plane Lambda publishes resolved limits to SNS, and each cell drains its own SQS queue — no fan-out storm. Services serve last-known-good on staleness so a stalled push never freezes enforcement.

06

Observe, then enforce

flowchart LR subgraph DP[Data Plane] Internet([Internet]) --> CF[CloudFront + WAF] --> AG[API Gateway] --> EV[Envoy] --> RL[Rate Limit Svc] RL -->|Lua EVAL| RD[(ElastiCache Redis)] RL -- allow --> SVC[Upstream] end subgraph CP[Control Plane] DDB[(DynamoDB quota)] -->|Streams| LA[Lambda] --> SN[SNS] --> SQ[SQS] --> RL end RL -->|EMF 1-s buckets| CW[CloudWatch dark-launch]

The ratelimit service aggregates allow/deny per descriptor in 1-second buckets and flushes them as EMF — not per-decision PutMetricData, which is impossible at 1M req/s. CloudWatch powers dark-launch: a new limit runs in count mode before it's flipped to enforce.