Architecture review

Real-time bidding engine at scale

Design a DSP bidder for a 300k req/s slice of the RTB bidstream that decides in under 10 ms (fail-closed with a fast HTTP 204), paces a $10,000 daily budget across 24 uneven hours with bounded overspend and no Meta-style runaway, caps user frequency at 500k checks/second on ElastiCache for Valkey 8, and mints collision-free sortable ledger IDs without coordination - all on AWS RTB Fabric.

staff12 min readrealtimerate-limitingid-generationpayments-exactly-once

A single page-load fires an auction. Multiply by the open web and the bidstream is 30 million bid requests/second. Each one gives our DSP a hard 100 ms wall clock to decide, of which the network eats most - so the bidder itself has 10 ms to look up a user, score them with a model, check a budget, check a frequency cap, and either return a price or stay silent. Get the budget math wrong and you overspend a client's $10,000 by half a million dollars overnight (Meta did). Get the silence wrong - bid on 100% instead of 2-5% of requests - and you melt the fleet. This is a system whose dominant path is saying no in under ten milliseconds, 19 times more often than it says yes.

The problem, and the numbers we design to

Principal

Frame the system before any boxes. What are we, and what's the SLA that actually gets us thrown off an exchange?

Staff

We're a Demand-Side Platform - a bidder. SSPs and exchanges run the auction; we connect to them and they fan out an OpenRTB BidRequest to every DSP in parallel over HTTP POST. We answer with a BidResponse carrying a price, or we answer HTTP 204 No Content, which is the no-bid and the most bandwidth-efficient thing we can do. The SLA that ends careers: Google Authorized Buyers requires 85% of our responses to arrive within tmax (the 100 ms field on the request) or Google throttles us - cuts the QPS it sends. Get throttled and your win rate and your revenue both fall off a cliff.

Napkin math - the no-bid path dominates

DSP win rates run 2-5%. AppNexus disclosed 123 billion auctions producing 10.7 billion impressions - an 8.7% win rate. Take 3% as our design point. For every impression we win, we evaluated

$$ \frac{1}{0.03} \approx 33\ \text{requests}. $$

So 97% of all work is the no-bid path. If we receive even 1% of the 30M/s bidstream, that's 300,000 requests/second, of which only ~9,000/s become bids. The architecture must make the no fast and cheap; the yes can afford a few more milliseconds because there are 33x fewer of them. Any design that does equal work on both paths burns 33x the fleet it needs.

Principal

10 ms internal budget. Where does it go?

Staff

Out of the 100 ms tmax: ~40 ms network each way when we're colocated near the exchange, leaving ~20 ms of slack and a 10 ms hard internal target. Inside that: user-profile lookup 0.5-1.1 ms P99, ML bid-price inference under 20 ms - which already blows the budget unless we use a compressed GBDT model that runs in single-digit ms, budget check ~1 ms, frequency check ~1 ms. Every millisecond here is a network hop we refuse to take. That single constraint - sub-10 ms in a globally fanned-out auction - is what drives every service choice below.

The edge: surviving bidstream inflation

Principal

Naive instinct first. Stand up an ALB in front of a bidder fleet, autoscale on CPU. Why is that wrong here?

Staff

It's tempting because it's the reflex for any HTTP service. It dies on two things specific to RTB. First, latency tax: a standard ALB plus public-internet ingress adds tens of milliseconds and unpredictable jitter - we'd blow tmax and get throttled before our code even runs. Second, bidstream inflation. Header bidding sends the same impression to ~15 SSPs, each fanning to ~100 DSPs - 1,500 bid requests for one ad slot. During the COVID-19 news surge The Trade Desk saw a 10x QPS spike and had to suppress duplicate bidding. If we autoscale on raw QPS we'd 10x our fleet to chase requests that represent the same handful of real impressions.

Napkin math - the inflation tax

One impression, header-bidding fan-out of $15\ \text{SSPs} \times 100\ \text{DSPs} = 1{,}500$ requests. If we sit behind 8 of those 15 SSPs, we see the same impression 8 times within the 100 ms window via different supply paths. At a 10x traffic surge our ingress must absorb $300{,}000 \times 10 = 3{,}000{,}000\ \text{req/s}$ - most of it duplicate or junk (geos we don't target, ad sizes we don't buy). Filtering that in our app tier means paying full TLS + parse + dispatch cost on traffic we will 204 anyway.

To be precise about what reaches the app: the Fabric filter drops wrong-geo/wrong-format junk, but the traffic that passes the filter is still 97% no-bid in-process (eligibility, sellers.json, budget exhausted, model-says-no). So the fleet is sized for filtered request rate - not raw ingress and not the ~3% that wins - because every filtered request still costs a profile read and cheap eligibility checks even when it ends in a 204.

Staff

So we push the filter below our app servers. AWS RTB Fabric (GA Oct 2025) is a purpose-built managed network for exactly this: single-digit-ms private connectivity between exchanges and bidders, ~80% lower networking cost than standard AWS egress, and three documented inline modules - an inline Rate Limiter that enforces per-partner QPS caps at the network layer before a request reaches our app, an inline OpenRTB Filter that drops non-qualifying requests (wrong geo, wrong format) so they never cost us a CPU cycle, and Error Masking. Publisher-domain blocklists are applied via the OpenRTB Filter as JSON-path rules on $.site.domain/$.app.bundle where the inline module supports it - or, if not, as a near-zero-cost early-exit check in the bidder before model invocation. We deploy bidders in its six regions - us-east-1, us-west-2, eu-central-1, eu-west-1, ap-southeast-1, ap-northeast-1 - colocated with the major exchanges.

Principal

Supply-chain verification - sellers.json, ads.txt - where does that run? The network layer?

Staff

No - that's an application-layer check, not an RTB Fabric module. RTB Fabric's documented inline modules are Rate Limiter, OpenRTB Filter, and Error Masking; supply-chain verification is on us. Before ML scoring, the bidder verifies the OpenRTB schain object against each SSP's sellers.json (cached in DynamoDB, refreshed hourly) and the declared site.domain/app.bundle against ads.txt. Requests failing verification return 204. It runs in the cheap eligibility band before we ever invoke the model, so junk and unverified supply cost us almost nothing.

The bidder fleet and the user-profile lookup

Principal

The bidder itself - language, runtime, and where does the user profile live? Criteo famously fought this.

Staff

The bidder is stateless C++ on ECS Fargate, ARM (Graviton) - best perf/dollar for a tight, allocation-sensitive hot loop, and stateless means we scale out horizontally and any node can answer any request. The hard part is the profile. Criteo's war story is the template: they ran Couchbase + Memcached on 3,200 servers, 6 incidents/year, P99 user lookup ~4 ms. They moved to Aerospike's hybrid DRAM+SSD design and landed at 800 servers (78% fewer), P99 under 1 ms, 1 incident in 4 years, 80% less carbon. On AWS the equivalent for a pure key-value access pattern is DynamoDB fronted by DAX. Be honest about the latency profile: DAX cache hits are sub-1 ms; cache misses fall through to DynamoDB at 3-5 ms P99, which still fits inside our 10 ms internal budget with margin. At 300k req/s and a 95% cache-hit rate, ~15k misses/s reach DynamoDB - well within the 150k RCU provisioned. Aerospike on EC2 (rung 3) would extend sub-1 ms coverage to the cache-miss reads too, but the 5 ms miss tail is acceptable here, so we stay on the managed rung.

Napkin math - profile read budget

At 300,000 req/s, every request needs one profile read. DynamoDB strongly-consistent reads would be both slow and expensive; eventually-consistent reads at 4 KB are 0.5 RCU each:

$$ 300{,}000\ \text{req/s} \times 0.5\ \text{RCU} = 150{,}000\ \text{RCU}. $$

That's exactly DynamoDB's quoted high-capacity tier - and it's the cost we avoid. DAX absorbs the read fan-out: at a 95% cache hit rate only $300{,}000 \times 0.05 = 15{,}000\ \text{req/s}$ reach DynamoDB, a 10x table-cost reduction. The 95% served from DAX RAM come back in under 1 ms; the 5% miss tail comes back at 3-5 ms - both inside the 10 ms wall. The 1B-record profile table sits behind it; campaign config and audience segments share the same store.

Principal

ML inference - you said under 20 ms blows a 10 ms budget. How do you square that?

Staff

We don't call a remote model server in the hot path - that's a network hop we can't afford. We train on SageMaker (bid-price prediction plus bid shading), export a compressed GBDT, and run inference in-process inside the C++ bidder. A compressed gradient-boosted-tree scores in single-digit milliseconds with no RPC. SageMaker owns training, versioning, and A/B model rollout; the bidder just loads the artifact. The win path can afford the full scoring; the no-bid path short-circuits on the RTB Fabric filter and on cheap eligibility checks before we ever invoke the model.

Budget pacing - the half-million-dollar bug

Principal

$10,000/day, smooth across 24 uneven hours, 50+ stateless nodes, zero runaway. Naive design?

Staff

Naive: one central budget counter; every node does read remaining; if > 0 bid; on win decrement. It's tempting because it's obviously correct in a single-threaded mental model. At scale it has two fatal flaws. First, the race to zero: 50+ nodes all read "remaining = $40" simultaneously, all bid, all win - we blow past zero because the read and the decrement aren't coordinated. Second, putting that counter in the 10 ms hot path means a network round-trip per bid - we can't.

Principal

And the famous failure mode?

Staff

Meta. Their budget-enforcement guard lost contact with the spend counter during a database failover and defaulted fail-open - kept bidding with no enforcement. Brands lost $100K-$500K overnight. The lesson is absolute: fail-closed is mandatory. But "fail-closed" has to be expressed correctly: it means the bidder emits HTTP 204 within an 8 ms hard internal deadline, never a timeout and never silence. Google Authorized Buyers throttles DSPs whose response rate within tmax drops - a missed or timed-out response is worse than a clean no-bid. So every bidder holds a hard internal deadline: 8 ms beyond which it returns 204 regardless of cause - DAX stall, budget unreadable, anything. A 204 costs us one missed impression; a fail-open costs a client half a million dollars and the account, and a timeout costs us the QPS allocation.

Napkin math - pacing curve and chunk size

Daytime supply runs 3-5x nighttime. A flat $10{,}000/24 = \$417/\text{hour}$ would starve at night and exhaust by noon. The PID controller targets a supply-weighted curve. Chunk size trades freshness against hot-path cost: with 50 nodes and a $10{,}000 budget, a per-node chunk of

$$ \frac{\$10{,}000}{50} \times \frac{1}{288} \approx \$0.69 $$

per 5-second reconciliation tick. Single-late-tick bound: if every node simultaneously exhausts its chunk and the next tick is late, max excess is $50 \times \$0.69 \approx \$35$, i.e. 0.35% of the daily budget. But the dangerous case is a stalled reconciliation - the Lambda or its Kinesis feed stops, nodes keep their last chunk and never get a fresh one. Without a bound, every node spends its full chunk per tick indefinitely. The lease-with-expiry fixes this: a node stops spending a chunk after 2 missed intervals (10 s), so worst-case excess during a stall is at most two chunks per node, $50 \times 2 \times \$0.69 \approx \$69$ (0.69%), after which spending halts entirely until a fresh allocation arrives. Smaller chunks = tighter bound but more reconciliation traffic.

Frequency capping - atomic at 500k/second

Principal

Cap a user at N impressions per campaign per day, at 500k checks/second. Where does the counter live?

Staff

Key schema freq:{user_id}:{campaign_id}:{day_bucket} → an integer counter in ElastiCache for Valkey 8, cluster mode, 5 shards (each a primary + 1 replica). Shard count is derived: 500k ops/s ÷ ~100k ops/s per shard ≈ 5. On a bid: GET the counter, compare to cap, bid or skip. On a win (actually on burl, same as spend): INCR the counter and EXPIRE 86400. The subtlety: do we read from the primary or a replica?

Napkin math - replica reads buy linear scale

At 500,000 freq checks/second, a single Valkey primary tops out around 100k-200k ops/s - hence cluster mode with 5 shards to spread the keyspace, and we additionally read from replicas within each shard, replication lag 1-5 ms. The cost is over-cap: during the lag window a user can be served slightly more than the cap. Industry accepts 5-10% over-cap for linear read scalability. For a soft cap (frequency, not money) that's the right trade - we are not protecting dollars here, we're protecting user experience and brand-fatigue metrics.

ID generation - collision-free without coordination

Principal

You're generating IDs for bid responses and ledger entries. Even at our 300k req/s (and Criteo cites 290 million KV queries/second across their whole fleet as the ceiling case), why isn't UUIDv4 just fine?

Staff

UUIDv4 is the convention for the SSP-generated BidRequest.id, and we don't control that. But we generate two things ourselves: the Bid.id (unique within each response) and every spend-ledger entry ID. The naive move is libuuid for both. At high per-node ID rates, OS-entropy-backed UUID generation becomes a contention hotspot - threads serialize on the entropy pool. For the ledger we also want ordering, which random UUIDs throw away.

Napkin math - why Snowflake for the ledger

A 64-bit Snowflake ID packs a 41-bit millisecond timestamp, a datacenter/worker field, and a 12-bit per-ms sequence:

$$ 2^{12} = 4{,}096\ \text{IDs per worker per millisecond} = 4.096\ \text{million IDs/second/worker}. $$

No coordination, no shared entropy pool, and the IDs are monotonic and time-sortable - which means the spend ledger is naturally ordered for reconciliation and range scans. For Bid.id, where we only need within-response uniqueness and not global ordering, a PRNG-seeded generator (seeded once per thread) avoids the entropy-pool hotspot entirely.

Failure and recovery

Principal

3am. Walk me through the failures and what each one does.

Staff

Six named modes, each with a defined behavior. In every one, fail-closed means emit HTTP 204 within the 8 ms internal deadline - never a timeout, never silence, because Google throttles on response-rate-within-tmax.

1. Budget counter unreadable (the Meta mode): fail-closed - bidders return 204 for the affected campaign within 8 ms. Lost revenue, zero overspend. The reconciliation Lambda alarms; once spend state is readable, bidding resumes.

2. Budget runaway (spend rate > 3x target for 60 s): the circuit-breaker Lambda pauses the campaign, driven from both the Kinesis stream and the CloudWatch EMF spend-velocity metric. CloudWatch alarm pages on-call.

3. Bidstream inflation / thundering herd (10x surge, COVID-news style): RTB Fabric's inline Rate Limiter caps per-partner QPS at the network layer; ECS Fargate scales the fleet on the filtered request rate, not raw ingress, so we don't 10x the fleet for duplicate traffic.

4. Auction duplication: not solvable in real time post-Prebid-2025; mitigated structurally by SPO.

5. DAX stall or failover: DAX failover takes seconds, not zero - it is not transparent. The profile-read path must fail-soft within the 8 ms deadline: if DAX does not answer in time the bidder either reads through to DynamoDB (3-5 ms) or emits 204, but it never blocks on a DAX stall and withholds a response.

6. RTB Fabric control-plane incident: the hot data path can survive a Fabric control-plane wobble, but if ingress itself degrades we need a degraded path, not zero bidding. The break-glass fallback is a dormant NLB path, gated by a feature flag in SSM Parameter Store, that accepts traffic from exchange public endpoints at higher latency. It is wired up and tested but kept out of the hot path; flipping the flag gives us degraded bidding while the Fabric recovers.

For the rest of the data plane: a Kinesis IteratorAge > 10 s alarm triggers emergency chunk revocation (fail-closed 204s); ElastiCache replica loss falls back to another replica in the shard or, worst case, the primary at reduced read capacity; an entire region loss sheds that region's exchange traffic - RTB is inherently per-region colocated, so blast radius is one region's auctions.

Security, multi-tenancy, and compliance

Principal

Multiple advertisers, real user data, the open internet. What's the isolation and compliance story?

Staff

Tenant isolation (advertisers): campaign config, budgets, and spend ledgers are partitioned by advertiser ID at the DynamoDB partition-key level, accessed only through a mediated, partition-keyed data-access layer; IAM policies scope the control-plane Lambdas to per-tenant key prefixes. The bidder is a shared, read-dominant, stateless hot loop - so the cross-tenant guarantee is enforced at the data layer plus an invariant check: the reconciliation Lambda verifies per-tenant spend-sum invariants on every tick and a CloudWatch alarm fires on any mismatch. We deliberately do not run a dedicated ECS service per tenant - that would multiply cost 10-100× for a risk the data-layer mediation and invariant already cover.

Bidder task role (blast radius): the bidder ECS task role is least-privilege - dax:GetItem/BatchGetItem, dynamodb:GetItem/BatchGetItem/Query (no Scan, no writes to campaign config), elasticache actions scoped to the specific cluster ARN, and egress restricted to VPC endpoints only. A compromised bidder cannot enumerate tables, mutate config, or exfiltrate to the public internet.

Bid-stuffing / profile harvesting: bad-actor publishers inflate requests to harvest user profiles from our bid behavior. Defense is RTB Fabric's inline per-partner QPS limit plus application-layer supply-chain verification: before ML scoring, the bidder verifies the OpenRTB schain against each SSP's sellers.json (cached in DynamoDB, refreshed hourly) and the declared domain/bundle against ads.txt; requests failing verification return 204. (sellers.json is not an RTB Fabric module - its inline modules are Rate Limiter, OpenRTB Filter, and Error Masking.)

User data / GDPR: the user-profile store holds pseudonymous IDs and segments, not PII in the clear, and is a regional table - EU user data never leaves EU regions. We honor consent strings (TCF) at the edge - no-bid on absent consent. For right-to-erasure we use crypto-shredding: per-user data is encrypted under a per-user key, and erasure destroys the key, rendering the data unrecoverable without a table-wide delete sweep; frequency counters carry a per-user EXPIRE TTL and auto-delete. DynamoDB encryption at rest (KMS) plus TLS in transit cover NIST 800-53 SC-controls and GDPR Article 32.

Encryption in transit on internal hops: DAX cluster encryption-in-transit enabled at creation; ElastiCache in-transit encryption with an AUTH token required; VPC endpoints (TLS) for DynamoDB and Kinesis. No internal hop is in the clear.

Secrets: all partner credentials, burl signing secrets, and ElastiCache AUTH tokens live in Secrets Manager with automatic rotation; the bidder task role retrieves them at startup - never in env vars or the container image.

Audit: every spend event is a time-sortable Snowflake-keyed ledger entry in Kinesis -> S3, written under S3 Object Lock in compliance mode (7-year retention), versioning on, with a bucket policy that denies DeleteObject and any weakening of retention - a genuinely tamper-evident financial trail. CloudTrail (management + data events for the profile/config DynamoDB tables and the ledger S3 bucket) writes to a separate Object-Lock'd logging account with log-file validation on. Together these satisfy SOC 2, ISO 27001, and advertiser billing-dispute requirements.

Cost

Principal

Order-of-magnitude monthly cost, and where the dollars actually go.

Napkin math - monthly cost shape (300k req/s design point)

Networking is the dominant line in standard RTB and the reason RTB Fabric exists. RTB Fabric bills per message hop, and an auction is two hops - BidRequest in and BidResponse/204 out. So at 300k req/s the transaction count is bidirectional:

$$ 300{,}000\ \text{req/s} \times 2\ \text{hops} \times 86{,}400 \times 30 \approx 1.56\ \text{trillion transactions/month}. $$

This is double our earlier single-direction estimate. The ~80% saving applies only when exchange partners are RTB Fabric participants (internal rate ~$4.50/billion vs external ~$34/billion). Assuming 70%+ of volume is internal, the blended bill is on the order of $1.56\text{T} \times (0.7 \times \$4.5 + 0.3 \times \$34)/10^9 \approx \$21{,}000/\text{month}$ - still the single biggest line, but verify the internal/external split per exchange partner before committing.

Compute: the no-bid-dominant path means the fleet is sized for filtered traffic, not raw ingress. Fargate ARM (Graviton) at ~30% better perf/dollar than x86 keeps the bidder fleet a fraction of a naive sizing.

DAX: ~3 × dax.r5.2xlarge per region × 6 regions ≈ $6,600/month; the DynamoDB throughput saving (~15k effective RCU vs 150k) is ~$4,400-$26,000/month depending on read mode - so DAX is net cost-positive even before counting the latency win.

ElastiCache for Valkey 8: 5-shard cluster × 6 regions at cache.r7g.xlarge ≈ $2,555/region ≈ $15,300/month of cheap RAM-bound counters.

Kinesis bills per shard-hour on the spend/notice stream. Cross-AZ transfer: DAX and ElastiCache clients are configured for AZ-affinity routing, which eliminates the $0.01/GB cross-AZ data-transfer charge that would otherwise accrue on every read.

The cost lesson: in RTB, networking dwarfs compute, which is the entire economic justification for adopting RTB Fabric over a hand-rolled ALB+EC2 stack.

Did we ever leave AWS?

Principal

Criteo runs Aerospike. Did you leave AWS for the profile store, or anywhere?

Staff

No. A year ago this design would have been forced to leave - there was no AWS managed answer to the sub-10 ms colocated-with-exchanges networking problem, and teams hand-rolled it or went off-platform. As of October 2025, AWS RTB Fabric closes exactly that gap: managed single-digit-ms ingress, inline rate limiting, inline OpenRTB filtering, 80%-lower networking cost. That was the one component that historically had no AWS option; now it does.

For the profile store, Criteo's Aerospike result (sub-1 ms P99, 78% fewer servers) is met by DynamoDB + DAX within our real constraint - a sub-10 ms wall - even though DAX's 5% cache-miss tail is 3-5 ms rather than sub-1 ms. Since no hard requirement forces sub-1 ms across the full distribution, there is no reason to self-manage Aerospike on EC2. ML stays on SageMaker with in-process GBDT inference. Pacing and circuit-breaking are Lambda. IDs are a library, not a service. Every component lands on rung 1 or 2 of the ladder. The honest summary: the existence of RTB Fabric is the reason this design never leaves AWS.

↓ podcast script (.txt)