SYSTEM DESIGN STUDIO — Podcast Script
Topic: Push notification fan-out at scale
Date: 2026-06-17  |  Difficulty: staff

Two voices: PRINCIPAL (skeptical interviewer) and STAFF (proposing engineer).
For ElevenLabs Projects: assign each speaker label to a distinct voice ID.

---

[INTRO] Fan-out looks like a loop: an event happens, you walk the follower list, you write one notification per follower. That loop is fine until a celebrity with 50 million followers posts and the loop becomes a write storm that takes down the inbox table, melts the push provider's connection pool, and -- if you are unlucky -- sends the same "X liked your photo" twice to 50 million phones at 3am local time. The interesting version of this problem is delivering every notification effectively once -- exactly-once is a marketing claim no at-least-once pipeline can honour through every failover -- in the user's timezone, while one account fans out to a hundred million people.


=== The fan-out problem and the celebrity threshold ===

PRINCIPAL: Walk me through the simplest thing that works. An event lands -- a like, a comment, a new post from someone you follow. What happens?

STAFF: Naive instinct, and it's the right place to start: write-time fan-out. When user A does something, I look up A's followers and write one notification row per follower into their inbox. Reads become trivial -- each user just queries their own inbox, already materialised. This is the "fan-out on write" pattern and for the median user it is exactly correct.

PRINCIPAL: Median user. So you already know where this breaks. Show me.

STAFF: So I split users by follower count. Below a threshold -- I'll use 10,000 followers, which is where Twitter, Meta, and LinkedIn all landed publicly -- I keep write-time fan-out. Above it, the account is a "celebrity" and I do not fan out on write. I store the event once as a celebrity-event pointer, and I merge it into each follower's stream at read time. This is the hybrid two-tier model. The math inverts cleanly: a celebrity post is now $O(1)$ writes instead of $O(F)$. The cost moves to read time, where it's bounded by how many followers actually open the app, not by the total follower count.

PRINCIPAL: What carries the event from the originator into the fan-out workers? Why not just call the workers synchronously?

STAFF: Synchronous fan-out couples the user's write latency to the size of their follower list -- unacceptable. I put a durable, ordered log between them: Amazon Kinesis Data Streams, partition key = source user ID. That partition key matters: it preserves per-originator ordering, so two posts from the same account fan out in order, and it spreads load across shards by author. Fan-out workers are AWS Lambda with a Kinesis trigger.


=== Notification ID generation at scale ===

PRINCIPAL: Every notification needs an ID. The obvious answer is a UUID. Convince me that's wrong.

STAFF: UUIDv4 is the naive instinct because it's a one-liner with no coordinator and no collisions. The problem is that it's random, and random keys are poison for a time-series write pattern on a B-tree or LSM index. Inserting random keys scatters writes across the entire keyspace, causing constant index page splits and cache misses. On DynamoDB and Cassandra this shows up as write amplification and degraded throughput exactly under the load we just sized.

STAFF: So I use a ULID as the notification ID. 128 bits: a 48-bit millisecond timestamp followed by 80 bits of randomness, Crockford base32-encoded to 26 URL-safe characters. Lexical sort order equals time order. No central coordinator -- the timestamp comes from the worker's clock and 80 bits of entropy makes collisions within a millisecond astronomically unlikely. The collision math: within a single millisecond you'd need a birthday collision across $2^{80}$ values. Even at $10^{6}$ IDs in one millisecond, $$ P_{collision} \approx \frac{n^{2}}{2 \cdot 2^{80}} = \frac{(10^{6})^{2}}{2 \cdot 2^{80}} \approx 4 \times 10^{-13}, $$ negligible, and we never generate $10^6$ in one millisecond per worker anyway.

PRINCIPAL: Discord uses Snowflake. Why not Snowflake?

STAFF: Snowflake (41-bit timestamp + 10-bit machine ID + 12-bit sequence) is excellent and Discord's choice is sound -- but it requires a coordinated machine-ID allocation. Someone has to guarantee each generator gets a unique worker ID, which means a registry or ZooKeeper-style assignment. In a Lambda fleet that scales to thousands of concurrent executions and recycles constantly, allocating and reclaiming stable machine IDs is a coordination headache. ULID's 80 random bits buy us out of that coordination entirely. We give up Snowflake's guaranteed strict monotonicity within a node, but we don't need strict monotonicity -- millisecond ordering plus a tiebreaker is enough for a notification stream.

PRINCIPAL: Where do these rows live?

STAFF: The inbox is a DynamoDB table keyed (tenant_id#user_id PK, notification_id SK) where the SK is the ULID. The partition key is tenant-scoped from the start -- tenant_id is the literal leading segment, which is what the dynamodb:LeadingKeys IAM condition clamps on (more on that in the multi-tenancy section). Newest-first reads are a descending range query on the partition -- the cursor pagination falls out for free. A 90-day TTL on each item keeps the table from growing without bound; notifications older than 90 days self-expire. A GSI on (tenant_id#user_id, read_status) serves the unread-count badge without scanning the partition -- and note that GSI doubles the write cost (base WCU + GSI WCU), which we account for in the cost sketch. And Global Tables replicate the inbox cross-region so a user in another region reads from a local replica.


=== The delivery layer -- APNs, FCM, deduplication ===

PRINCIPAL: A row in DynamoDB isn't a notification on a phone. How does it get to APNs and FCM, and what goes wrong at burst?

STAFF: Naive instinct: the fan-out Lambda opens an HTTP/2 connection to APNs and FCM and pushes directly. That fails three ways at scale. First, APNs throttles aggressively during bursts -- best practice is a pool of 5-10 HTTP/2 connections each sustaining $\sim$2,000 req/s, and a Lambda that spins up and dies can't maintain a warm pool. Second, FCM caps a project at 600,000 messages/minute. Third, dead tokens: a token unused for 270 days expires on FCM, and invalid-token responses from both providers must immediately stop sends to that token -- otherwise you burn quota retrying corpses.

STAFF: So the fan-out worker does not call providers. It writes the inbox row and enqueues a delivery job onto SQS Standard queues, one per channel: mobile-push, email, in-app, SMS. Every message carries a mandatory tenant_id message attribute so a consumer can never act on a job without knowing whose data it is. Each queue has a dead-letter queue with a redrive after 3 failed receives. Channel-specific consumer Lambdas drain each queue at a controlled rate. For mobile push I deliberately do not hand-roll APNs/FCM clients -- I use Amazon Pinpoint on top of SNS mobile push. SNS owns the platform-endpoint abstraction: token registration, the APNs/FCM feedback loop, and auto-deregistration of stale tokens. Pinpoint adds the campaign/quiet-hours/timezone layer on top.

PRINCIPAL: SQS Standard is at-least-once -- you just gave up FIFO's window entirely. A redrive after a visibility-timeout expiry can re-deliver. How do you not double-notify 50 million people?

STAFF: Right -- at-least-once plus retries equals duplicate risk, and the classic bug is a dedup window shorter than the provider's own retry timeout, which leaks duplicates through the seam. OneSignal documents a 30-day dedup window for exactly this reason. So the push consumer Lambda runs an idempotency check before every send against Amazon MemoryDB for Redis -- same Redis API, but a durable multi-AZ transaction log so the dedup state survives a failover instead of resetting empty: key = "dedup:" + sha256(tenant_id + ":" + notification_ulid + ":" + recipient_id) if SET key 1 NX EX 2592000: # 30-day TTL, only set if absent deliver() # we won the race; send else: skip() # someone already delivered this SET ... NX is atomic, so even two consumer Lambdas processing a redriven message race on the same key and exactly one wins. The 30-day TTL comfortably exceeds any provider retry horizon, so duplicates can't slip through the seam. Note the key is tenant-prefixed -- without it, two tenants minting colliding ULIDs (or sharing a recipient device across apps) could cross-suppress each other's notifications. The prefix makes the dedup keyspace tenant-isolated.

PRINCIPAL: And when MemoryDB is unreachable mid-burst? Does the consumer block, or send and pray?

STAFF: Neither silently. The draft named the tension and resolved nothing -- that's the bug. We resolve it: fail-closed. On a MemoryDB timeout or error the consumer Lambda throws, does not ack the SQS message, and lets it become visible again for retry with backoff. The job stays in the queue. We never fail-open into a blind send. For the catastrophic case -- MemoryDB fully down for a sustained window -- consent-bearing channels fall back to a DynamoDB conditional write as a durable second dedup store, so legally-gated sends still get an idempotency guarantee even with the cache gone.


=== Quiet hours, preferences, and timezone bugs ===

PRINCIPAL: A user sets quiet hours 22:00-07:00. The classic failure is computing that in UTC. Talk me through it.

STAFF: This is the bug that silently excludes whole user segments. If you store quiet hours as 22:00-07:00 and compare against now() in UTC, then every user in, say, UTC+9 has their "10pm-7am quiet window" applied to UTC 10pm-7am -- which is 7am-4pm local. You either spam them at midnight or suppress them all day. The fix is non-negotiable: store the user's IANA timezone (e.g. Asia/Tokyo) per user, and evaluate quiet hours in user-local time.

PRINCIPAL: And preferences themselves -- you can't read DynamoDB on every one of 10 million deliveries.

STAFF: Correct -- preferences go behind a write-through ElastiCache in front of the DynamoDB preferences table. The trap here is cache TTL. A long TTL (say 1 hour) means a user who turns off notifications keeps getting them for up to an hour -- that's not just annoying, under GDPR it's a consent violation for up to an hour. But a very short TTL hammers DynamoDB. So I use a moderate TTL of 5 minutes as a backstop, plus event-driven invalidation. And here's the managed-first correction the draft missed: I do not stand up a dedicated Kinesis stream for preference changes. The preferences table already knows when it changes -- DynamoDB Streams emits a change record on every write for free, and an EventBridge Pipe wires that stream directly to the cache-invalidation Lambda. No separate ingest path to operate, and the invalidation trigger is source-of-truth-driven by construction. The TTL only matters if a stream record is somehow missed; the common path is invalidated in real time.

PRINCIPAL: You keep saying "marketing channels." Under GDPR and ePrivacy, doesn't all push need consent?

STAFF: You're right, and the draft got it wrong by gating only marketing. Push to a device is push to a device -- GDPR Art. 7 and ePrivacy treat it as requiring a lawful basis regardless of whether we'd call the content "transactional" or "social." So the gate moves to every push send: the push consumer reads a push_consent attribute -- stored in the preferences table as {granted_at, version} -- and default-denies when it is absent. No consent record, no push. Quiet-hours is treated the same way: if the user's timezone or preference is unavailable, we defer rather than send -- quiet-hours fails closed, not open.

STAFF: One more storm to prevent: even with dedup, a single user can get blasted -- imagine 500 people liking your post in a minute. I run a sliding-window rate limit per (user, channel) in ElastiCache: above a threshold, collapse into a digest ("500 people liked your photo") rather than 500 separate buzzes. That protects both the user's attention and the provider quota.


=== Failure modes, recovery, and multi-tenancy ===

PRINCIPAL: It's 3am. APNs starts returning 429s and 500s in bulk. What happens to my system?

STAFF: The buffer absorbs it. Failed sends from the push consumer don't get acknowledged, so SQS makes them visible again after the visibility timeout; SNS retries with exponential backoff. After 3 failed receives a message lands in the DLQ instead of looping forever. Crucially I cap the push consumer's Lambda reserved concurrency so we never stampede APNs harder during its degradation -- concurrency limits are the back-pressure valve. When APNs recovers I redrive the DLQ. Because every send is dedup-guarded, redriving the DLQ can't double-deliver the messages that actually succeeded before the error.

PRINCIPAL: This is multi-tenant -- many apps/brands on one platform. How do you keep tenant A's notifications and tokens away from tenant B, and what compliance does that map to?

STAFF: Tenant isolation runs through the keys and the IAM boundary, mapped to the controls auditors actually ask for: SOC 2 CC6 (logical access control): every inbox partition key is tenant-scoped (tenant_id#user_id, with tenant_id as the literal leading segment), and consumer Lambdas assume per-tenant IAM roles with the dynamodb:LeadingKeys condition key bound to that tenant_id so a role can only read its own tenant's partitions. The dedup key, the SQS message attributes, and the cache keys are all tenant-prefixed too -- the tenant boundary is in the data, not just the IAM policy. No query can cross the tenant boundary.. Least privilege, scoped concretely: the fan-out Lambda role is not "DynamoDB access." It is dynamodb:PutItem/Query/BatchGetItem on the specific inbox and preferences table ARNs (with LeadingKeys), sqs:SendMessage on the four channel-queue ARNs, and nothing else. The push consumer adds sns:Publish on its platform-application ARNs only -- not sns:CreatePlatformEndpoint (token registration is a separate, narrower role).. ISO 27001 A.10 / encryption: notification bodies can contain PII ("Alice messaged you"). At rest: DynamoDB SSE-KMS, MemoryDB and ElastiCache at-rest encryption. In transit: transit_encryption_enabled=true on MemoryDB and ElastiCache with Redis ACL users (no anonymous access), and the data plane runs over VPC interface endpoints for Kinesis/SQS/SNS/Secrets Manager and a gateway endpoint for DynamoDB -- traffic never traverses the public internet. Provider credentials (APNs auth key, FCM service-account keys) live in AWS Secrets Manager with rotation, never in env vars or code. The push payload to APNs/FCM carries only a reference plus a generic body when the device is locked -- sensitive content is fetched after authentication, not shipped in the push.. One CMK per table, not per tenant: to be precise -- DynamoDB SSE applies one KMS CMK per table, not per item, so the pooled inbox table uses a shared CMK and tenant isolation is enforced by LeadingKeys, not by per-tenant keys. A genuine per-tenant CMK requires a table-per-tenant silo; we name that as an enterprise-tier upgrade path, not the default. The earlier "per-tenant KMS key" phrasing was wrong and is corrected here.. GDPR (consent and residency): consent is a first-class preference checked on every push send (Art. 7 lawful basis), default-deny when absent; a consent withdrawal invalidates the cache via the DynamoDB-Streams Pipe immediately. Right-to-erasure purges all Global-Tables replicas, the dedup keys, the preference cache, and the audit log -- not just the home region. EU-tenant residency: tenants flagged EU-only are pinned to EU regions and excluded from non-EU Global-Tables replication. TTLs on the inbox (90 days) and dedup keys (30 days) bound retention.. NIST SP 800-53 AU-2/AU-9 (audit, integrity): every delivery attempt -- enqueued, sent, suppressed by quiet hours, denied by consent, deduped, or DLQ'd -- emits a structured audit event via Kinesis Firehose to an S3 bucket with Object Lock in compliance mode (WORM) in a dedicated logging account -- the records cannot be altered or deleted within the retention window, even by an admin. CloudTrail captures all control-plane actions. We can prove, per notification, who we tried to notify, when, on what channel, and why it was or wasn't delivered -- and prove the proof wasn't tampered with..

PRINCIPAL: What does this cost me at steady state?


=== Did we ever leave AWS? ===

PRINCIPAL: Final question. Anywhere in this design did you have to leave AWS?

STAFF: No. Every tier has a managed AWS service that fits: Kinesis for the event log, Lambda for fan-out and delivery workers, DynamoDB for the ULID-keyed inbox and preferences (with DynamoDB Streams + an EventBridge Pipe driving preference-cache invalidation -- no extra stream to run), SQS Standard for the per-channel delivery buffer, SNS + Pinpoint for APNs/FCM transport and quiet-hours scheduling, SES for email, MemoryDB for the durable dedup keyspace and ElastiCache for the evictable follower-list and rate-limit caches, KMS for at-rest encryption, Secrets Manager for provider credentials, and Firehose/S3 (Object Lock) plus CloudTrail for the audit trail. The only non-AWS endpoints are APNs and FCM themselves -- and those aren't a choice, they're Apple's and Google's networks, which every notification system on Earth must terminate at. SNS abstracts even those. We never left AWS.