SYSTEM DESIGN STUDIO — Podcast Script
Topic: Real-time bidding engine at scale
Date: 2026-06-19  |  Difficulty: staff

Two voices: PRINCIPAL (skeptical interviewer) and STAFF (proposing engineer).
For ElevenLabs Projects: assign each speaker label to a distinct voice ID.

---

[INTRO] A single page-load fires an auction. Multiply by the open web and the bidstream is 30 million bid requests/second. Each one gives our DSP a hard 100 ms wall clock to decide, of which the network eats most - so the bidder itself has 10 ms to look up a user, score them with a model, check a budget, check a frequency cap, and either return a price or stay silent. Get the budget math wrong and you overspend a client's $10,000 by half a million dollars overnight (Meta did). Get the silence wrong - bid on 100% instead of 2-5% of requests - and you melt the fleet. This is a system whose dominant path is saying no in under ten milliseconds, 19 times more often than it says yes.


=== The problem, and the numbers we design to ===

PRINCIPAL: Frame the system before any boxes. What are we, and what's the SLA that actually gets us thrown off an exchange?

STAFF: We're a Demand-Side Platform - a bidder. SSPs and exchanges run the auction; we connect to them and they fan out an OpenRTB BidRequest to every DSP in parallel over HTTP POST. We answer with a BidResponse carrying a price, or we answer HTTP 204 No Content, which is the no-bid and the most bandwidth-efficient thing we can do. The SLA that ends careers: Google Authorized Buyers requires 85% of our responses to arrive within tmax (the 100 ms field on the request) or Google throttles us - cuts the QPS it sends. Get throttled and your win rate and your revenue both fall off a cliff.

PRINCIPAL: 10 ms internal budget. Where does it go?

STAFF: Out of the 100 ms tmax: ~40 ms network each way when we're colocated near the exchange, leaving ~20 ms of slack and a 10 ms hard internal target. Inside that: user-profile lookup 0.5-1.1 ms P99, ML bid-price inference under 20 ms - which already blows the budget unless we use a compressed GBDT model that runs in single-digit ms, budget check ~1 ms, frequency check ~1 ms. Every millisecond here is a network hop we refuse to take. That single constraint - sub-10 ms in a globally fanned-out auction - is what drives every service choice below.


=== The edge: surviving bidstream inflation ===

PRINCIPAL: Naive instinct first. Stand up an ALB in front of a bidder fleet, autoscale on CPU. Why is that wrong here?

STAFF: It's tempting because it's the reflex for any HTTP service. It dies on two things specific to RTB. First, latency tax: a standard ALB plus public-internet ingress adds tens of milliseconds and unpredictable jitter - we'd blow tmax and get throttled before our code even runs. Second, bidstream inflation. Header bidding sends the same impression to ~15 SSPs, each fanning to ~100 DSPs - 1,500 bid requests for one ad slot. During the COVID-19 news surge The Trade Desk saw a 10x QPS spike and had to suppress duplicate bidding. If we autoscale on raw QPS we'd 10x our fleet to chase requests that represent the same handful of real impressions.

STAFF: So we push the filter below our app servers. AWS RTB Fabric (GA Oct 2025) is a purpose-built managed network for exactly this: single-digit-ms private connectivity between exchanges and bidders, ~80% lower networking cost than standard AWS egress, and three documented inline modules - an inline Rate Limiter that enforces per-partner QPS caps at the network layer before a request reaches our app, an inline OpenRTB Filter that drops non-qualifying requests (wrong geo, wrong format) so they never cost us a CPU cycle, and Error Masking. Publisher-domain blocklists are applied via the OpenRTB Filter as JSON-path rules on $.site.domain/$.app.bundle where the inline module supports it - or, if not, as a near-zero-cost early-exit check in the bidder before model invocation. We deploy bidders in its six regions - us-east-1, us-west-2, eu-central-1, eu-west-1, ap-southeast-1, ap-northeast-1 - colocated with the major exchanges.

PRINCIPAL: Supply-chain verification - sellers.json, ads.txt - where does that run? The network layer?

STAFF: No - that's an application-layer check, not an RTB Fabric module. RTB Fabric's documented inline modules are Rate Limiter, OpenRTB Filter, and Error Masking; supply-chain verification is on us. Before ML scoring, the bidder verifies the OpenRTB schain object against each SSP's sellers.json (cached in DynamoDB, refreshed hourly) and the declared site.domain/app.bundle against ads.txt. Requests failing verification return 204. It runs in the cheap eligibility band before we ever invoke the model, so junk and unverified supply cost us almost nothing.


=== The bidder fleet and the user-profile lookup ===

PRINCIPAL: The bidder itself - language, runtime, and where does the user profile live? Criteo famously fought this.

STAFF: The bidder is stateless C++ on ECS Fargate, ARM (Graviton) - best perf/dollar for a tight, allocation-sensitive hot loop, and stateless means we scale out horizontally and any node can answer any request. The hard part is the profile. Criteo's war story is the template: they ran Couchbase + Memcached on 3,200 servers, 6 incidents/year, P99 user lookup ~4 ms. They moved to Aerospike's hybrid DRAM+SSD design and landed at 800 servers (78% fewer), P99 under 1 ms, 1 incident in 4 years, 80% less carbon. On AWS the equivalent for a pure key-value access pattern is DynamoDB fronted by DAX. Be honest about the latency profile: DAX cache hits are sub-1 ms; cache misses fall through to DynamoDB at 3-5 ms P99, which still fits inside our 10 ms internal budget with margin. At 300k req/s and a 95% cache-hit rate, ~15k misses/s reach DynamoDB - well within the 150k RCU provisioned. Aerospike on EC2 (rung 3) would extend sub-1 ms coverage to the cache-miss reads too, but the 5 ms miss tail is acceptable here, so we stay on the managed rung.

PRINCIPAL: ML inference - you said under 20 ms blows a 10 ms budget. How do you square that?

STAFF: We don't call a remote model server in the hot path - that's a network hop we can't afford. We train on SageMaker (bid-price prediction plus bid shading), export a compressed GBDT, and run inference in-process inside the C++ bidder. A compressed gradient-boosted-tree scores in single-digit milliseconds with no RPC. SageMaker owns training, versioning, and A/B model rollout; the bidder just loads the artifact. The win path can afford the full scoring; the no-bid path short-circuits on the RTB Fabric filter and on cheap eligibility checks before we ever invoke the model.


=== Budget pacing - the half-million-dollar bug ===

PRINCIPAL: $10,000/day, smooth across 24 uneven hours, 50+ stateless nodes, zero runaway. Naive design?

STAFF: Naive: one central budget counter; every node does read remaining; if > 0 bid; on win decrement. It's tempting because it's obviously correct in a single-threaded mental model. At scale it has two fatal flaws. First, the race to zero: 50+ nodes all read "remaining = $40" simultaneously, all bid, all win - we blow past zero because the read and the decrement aren't coordinated. Second, putting that counter in the 10 ms hot path means a network round-trip per bid - we can't.

PRINCIPAL: And the famous failure mode?

STAFF: Meta. Their budget-enforcement guard lost contact with the spend counter during a database failover and defaulted fail-open - kept bidding with no enforcement. Brands lost $100K-$500K overnight. The lesson is absolute: fail-closed is mandatory. But "fail-closed" has to be expressed correctly: it means the bidder emits HTTP 204 within an 8 ms hard internal deadline, never a timeout and never silence. Google Authorized Buyers throttles DSPs whose response rate within tmax drops - a missed or timed-out response is worse than a clean no-bid. So every bidder holds a hard internal deadline: 8 ms beyond which it returns 204 regardless of cause - DAX stall, budget unreadable, anything. A 204 costs us one missed impression; a fail-open costs a client half a million dollars and the account, and a timeout costs us the QPS allocation.


=== Frequency capping - atomic at 500k/second ===

PRINCIPAL: Cap a user at N impressions per campaign per day, at 500k checks/second. Where does the counter live?

STAFF: Key schema freq:{user_id}:{campaign_id}:{day_bucket} &rarr; an integer counter in ElastiCache for Valkey 8, cluster mode, 5 shards (each a primary + 1 replica). Shard count is derived: 500k ops/s &divide; ~100k ops/s per shard &asymp; 5. On a bid: GET the counter, compare to cap, bid or skip. On a win (actually on burl, same as spend): INCR the counter and EXPIRE 86400. The subtlety: do we read from the primary or a replica?


=== ID generation - collision-free without coordination ===

PRINCIPAL: You're generating IDs for bid responses and ledger entries. Even at our 300k req/s (and Criteo cites 290 million KV queries/second across their whole fleet as the ceiling case), why isn't UUIDv4 just fine?

STAFF: UUIDv4 is the convention for the SSP-generated BidRequest.id, and we don't control that. But we generate two things ourselves: the Bid.id (unique within each response) and every spend-ledger entry ID. The naive move is libuuid for both. At high per-node ID rates, OS-entropy-backed UUID generation becomes a contention hotspot - threads serialize on the entropy pool. For the ledger we also want ordering, which random UUIDs throw away.


=== Failure and recovery ===

PRINCIPAL: 3am. Walk me through the failures and what each one does.

STAFF: Six named modes, each with a defined behavior. In every one, fail-closed means emit HTTP 204 within the 8 ms internal deadline - never a timeout, never silence, because Google throttles on response-rate-within-tmax. 1. Budget counter unreadable (the Meta mode): fail-closed - bidders return 204 for the affected campaign within 8 ms. Lost revenue, zero overspend. The reconciliation Lambda alarms; once spend state is readable, bidding resumes. 2. Budget runaway (spend rate > 3x target for 60 s): the circuit-breaker Lambda pauses the campaign, driven from both the Kinesis stream and the CloudWatch EMF spend-velocity metric. CloudWatch alarm pages on-call. 3. Bidstream inflation / thundering herd (10x surge, COVID-news style): RTB Fabric's inline Rate Limiter caps per-partner QPS at the network layer; ECS Fargate scales the fleet on the filtered request rate, not raw ingress, so we don't 10x the fleet for duplicate traffic. 4. Auction duplication: not solvable in real time post-Prebid-2025; mitigated structurally by SPO. 5. DAX stall or failover: DAX failover takes seconds, not zero - it is not transparent. The profile-read path must fail-soft within the 8 ms deadline: if DAX does not answer in time the bidder either reads through to DynamoDB (3-5 ms) or emits 204, but it never blocks on a DAX stall and withholds a response. 6. RTB Fabric control-plane incident: the hot data path can survive a Fabric control-plane wobble, but if ingress itself degrades we need a degraded path, not zero bidding. The break-glass fallback is a dormant NLB path, gated by a feature flag in SSM Parameter Store, that accepts traffic from exchange public endpoints at higher latency. It is wired up and tested but kept out of the hot path; flipping the flag gives us degraded bidding while the Fabric recovers. For the rest of the data plane: a Kinesis IteratorAge > 10 s alarm triggers emergency chunk revocation (fail-closed 204s); ElastiCache replica loss falls back to another replica in the shard or, worst case, the primary at reduced read capacity; an entire region loss sheds that region's exchange traffic - RTB is inherently per-region colocated, so blast radius is one region's auctions.


=== Security, multi-tenancy, and compliance ===

PRINCIPAL: Multiple advertisers, real user data, the open internet. What's the isolation and compliance story?

STAFF: Tenant isolation (advertisers): campaign config, budgets, and spend ledgers are partitioned by advertiser ID at the DynamoDB partition-key level, accessed only through a mediated, partition-keyed data-access layer; IAM policies scope the control-plane Lambdas to per-tenant key prefixes. The bidder is a shared, read-dominant, stateless hot loop - so the cross-tenant guarantee is enforced at the data layer plus an invariant check: the reconciliation Lambda verifies per-tenant spend-sum invariants on every tick and a CloudWatch alarm fires on any mismatch. We deliberately do not run a dedicated ECS service per tenant - that would multiply cost 10-100x for a risk the data-layer mediation and invariant already cover. Bidder task role (blast radius): the bidder ECS task role is least-privilege - dax:GetItem/BatchGetItem, dynamodb:GetItem/BatchGetItem/Query (no Scan, no writes to campaign config), elasticache actions scoped to the specific cluster ARN, and egress restricted to VPC endpoints only. A compromised bidder cannot enumerate tables, mutate config, or exfiltrate to the public internet. Bid-stuffing / profile harvesting: bad-actor publishers inflate requests to harvest user profiles from our bid behavior. Defense is RTB Fabric's inline per-partner QPS limit plus application-layer supply-chain verification: before ML scoring, the bidder verifies the OpenRTB schain against each SSP's sellers.json (cached in DynamoDB, refreshed hourly) and the declared domain/bundle against ads.txt; requests failing verification return 204. (sellers.json is not an RTB Fabric module - its inline modules are Rate Limiter, OpenRTB Filter, and Error Masking.) User data / GDPR: the user-profile store holds pseudonymous IDs and segments, not PII in the clear, and is a regional table - EU user data never leaves EU regions. We honor consent strings (TCF) at the edge - no-bid on absent consent. For right-to-erasure we use crypto-shredding: per-user data is encrypted under a per-user key, and erasure destroys the key, rendering the data unrecoverable without a table-wide delete sweep; frequency counters carry a per-user EXPIRE TTL and auto-delete. DynamoDB encryption at rest (KMS) plus TLS in transit cover NIST 800-53 SC-controls and GDPR Article 32. Encryption in transit on internal hops: DAX cluster encryption-in-transit enabled at creation; ElastiCache in-transit encryption with an AUTH token required; VPC endpoints (TLS) for DynamoDB and Kinesis. No internal hop is in the clear. Secrets: all partner credentials, burl signing secrets, and ElastiCache AUTH tokens live in Secrets Manager with automatic rotation; the bidder task role retrieves them at startup - never in env vars or the container image. Audit: every spend event is a time-sortable Snowflake-keyed ledger entry in Kinesis -> S3, written under S3 Object Lock in compliance mode (7-year retention), versioning on, with a bucket policy that denies DeleteObject and any weakening of retention - a genuinely tamper-evident financial trail. CloudTrail (management + data events for the profile/config DynamoDB tables and the ledger S3 bucket) writes to a separate Object-Lock'd logging account with log-file validation on. Together these satisfy SOC 2, ISO 27001, and advertiser billing-dispute requirements.


=== Cost ===

PRINCIPAL: Order-of-magnitude monthly cost, and where the dollars actually go.


=== Did we ever leave AWS? ===

PRINCIPAL: Criteo runs Aerospike. Did you leave AWS for the profile store, or anywhere?

STAFF: No. A year ago this design would have been forced to leave - there was no AWS managed answer to the sub-10 ms colocated-with-exchanges networking problem, and teams hand-rolled it or went off-platform. As of October 2025, AWS RTB Fabric closes exactly that gap: managed single-digit-ms ingress, inline rate limiting, inline OpenRTB filtering, 80%-lower networking cost. That was the one component that historically had no AWS option; now it does. For the profile store, Criteo's Aerospike result (sub-1 ms P99, 78% fewer servers) is met by DynamoDB + DAX within our real constraint - a sub-10 ms wall - even though DAX's 5% cache-miss tail is 3-5 ms rather than sub-1 ms. Since no hard requirement forces sub-1 ms across the full distribution, there is no reason to self-manage Aerospike on EC2. ML stays on SageMaker with in-process GBDT inference. Pacing and circuit-breaking are Lambda. IDs are a library, not a service. Every component lands on rung 1 or 2 of the ladder. The honest summary: the existence of RTB Fabric is the reason this design never leaves AWS.