Push notification fan-out at scale

Fan-out looks like a loop: an event happens, you walk the follower list, you write one notification per follower. That loop is fine until a celebrity with 50 million followers posts and the loop becomes a write storm that takes down the inbox table, melts the push provider's connection pool, and — if you are unlucky — sends the same "X liked your photo" twice to 50 million phones at 3am local time. The interesting version of this problem is delivering every notification effectively once — exactly-once is a marketing claim no at-least-once pipeline can honour through every failover — in the user's timezone, while one account fans out to a hundred million people.

The fan-out problem and the celebrity threshold

Principal

Walk me through the simplest thing that works. An event lands — a like, a comment, a new post from someone you follow. What happens?

Staff

Naive instinct, and it's the right place to start: write-time fan-out. When user A does something, I look up A's followers and write one notification row per follower into their inbox. Reads become trivial — each user just queries their own inbox, already materialised. This is the "fan-out on write" pattern and for the median user it is exactly correct.

Principal

Median user. So you already know where this breaks. Show me.

Napkin math — the celebrity write storm

A celebrity with $F = 50 \times 10^{6}$ followers posts once. Write-time fan-out demands one inbox write per follower:

$$ W = F = 5 \times 10^{7}\ \text{writes for a single event.} $$

If we want that materialised within a $t = 60\ \text{s}$ SLA, the sustained write rate is

$$ R_{w} = \frac{5 \times 10^{7}}{60} \approx 8.3 \times 10^{5}\ \text{writes/second} \;\text{from one post.}$$

A DynamoDB table charges 1 WCU per 1 KB write. At $\sim$300 byte notification records that is $\sim$833k WCU sustained — and that is one celebrity. Ten celebrities posting in the same minute is $8.3 \times 10^{6}$ writes/s. The partition for a hot account also concentrates the follower-list read on a single key. The loop doesn't scale; it detonates.

Staff

So I split users by follower count. Below a threshold — I'll use 10,000 followers, which is where Twitter, Meta, and LinkedIn all landed publicly — I keep write-time fan-out. Above it, the account is a "celebrity" and I do not fan out on write. I store the event once as a celebrity-event pointer, and I merge it into each follower's stream at read time.

This is the hybrid two-tier model. The math inverts cleanly: a celebrity post is now $O(1)$ writes instead of $O(F)$. The cost moves to read time, where it's bounded by how many followers actually open the app, not by the total follower count.

Principal

What carries the event from the originator into the fan-out workers? Why not just call the workers synchronously?

Staff

Synchronous fan-out couples the user's write latency to the size of their follower list — unacceptable. I put a durable, ordered log between them: Amazon Kinesis Data Streams, partition key = source user ID. That partition key matters: it preserves per-originator ordering, so two posts from the same account fan out in order, and it spreads load across shards by author. Fan-out workers are AWS Lambda with a Kinesis trigger.

Notification ID generation at scale

Principal

Every notification needs an ID. The obvious answer is a UUID. Convince me that's wrong.

Staff

UUIDv4 is the naive instinct because it's a one-liner with no coordinator and no collisions. The problem is that it's random, and random keys are poison for a time-series write pattern on a B-tree or LSM index. Inserting random keys scatters writes across the entire keyspace, causing constant index page splits and cache misses. On DynamoDB and Cassandra this shows up as write amplification and degraded throughput exactly under the load we just sized.

Napkin math — why time-ordered keys matter

Notifications are written in roughly time order and read newest-first. With a random sort key, a "give me my 20 newest notifications" query can't use a contiguous range scan — the newest items are scattered. With a time-ordered key, the newest 20 are the last 20 in the index: one contiguous read. The difference at a $10^{6}$ writes/s ingest is the difference between sequential appends (cache-friendly, few splits) and random inserts (one page split per few writes). We also get free cursor pagination: the sort key is the cursor.

Staff

So I use a ULID as the notification ID. 128 bits: a 48-bit millisecond timestamp followed by 80 bits of randomness, Crockford base32-encoded to 26 URL-safe characters. Lexical sort order equals time order. No central coordinator — the timestamp comes from the worker's clock and 80 bits of entropy makes collisions within a millisecond astronomically unlikely.

The collision math: within a single millisecond you'd need a birthday collision across $2^{80}$ values. Even at $10^{6}$ IDs in one millisecond,

$$ P_{collision} \approx \frac{n^{2}}{2 \cdot 2^{80}} = \frac{(10^{6})^{2}}{2 \cdot 2^{80}} \approx 4 \times 10^{-13}, $$

negligible, and we never generate $10^6$ in one millisecond per worker anyway.

Principal

Discord uses Snowflake. Why not Snowflake?

Staff

Snowflake (41-bit timestamp + 10-bit machine ID + 12-bit sequence) is excellent and Discord's choice is sound — but it requires a coordinated machine-ID allocation. Someone has to guarantee each generator gets a unique worker ID, which means a registry or ZooKeeper-style assignment. In a Lambda fleet that scales to thousands of concurrent executions and recycles constantly, allocating and reclaiming stable machine IDs is a coordination headache. ULID's 80 random bits buy us out of that coordination entirely. We give up Snowflake's guaranteed strict monotonicity within a node, but we don't need strict monotonicity — millisecond ordering plus a tiebreaker is enough for a notification stream.

Principal

Where do these rows live?

Staff

The inbox is a DynamoDB table keyed (tenant_id#user_id PK, notification_id SK) where the SK is the ULID. The partition key is tenant-scoped from the start — tenant_id is the literal leading segment, which is what the dynamodb:LeadingKeys IAM condition clamps on (more on that in the multi-tenancy section). Newest-first reads are a descending range query on the partition — the cursor pagination falls out for free. A 90-day TTL on each item keeps the table from growing without bound; notifications older than 90 days self-expire. A GSI on (tenant_id#user_id, read_status) serves the unread-count badge without scanning the partition — and note that GSI doubles the write cost (base WCU + GSI WCU), which we account for in the cost sketch. And Global Tables replicate the inbox cross-region so a user in another region reads from a local replica.

The delivery layer — APNs, FCM, deduplication

Principal

A row in DynamoDB isn't a notification on a phone. How does it get to APNs and FCM, and what goes wrong at burst?

Staff

Naive instinct: the fan-out Lambda opens an HTTP/2 connection to APNs and FCM and pushes directly. That fails three ways at scale. First, APNs throttles aggressively during bursts — best practice is a pool of 5–10 HTTP/2 connections each sustaining $\sim$2,000 req/s, and a Lambda that spins up and dies can't maintain a warm pool. Second, FCM caps a project at 600,000 messages/minute. Third, dead tokens: a token unused for 270 days expires on FCM, and invalid-token responses from both providers must immediately stop sends to that token — otherwise you burn quota retrying corpses.

Napkin math — provider throughput vs. burst

FCM's ceiling is $6 \times 10^{5}$ messages/minute $= 10^{4}$ messages/second per project. Our celebrity post needs to deliver to, say, the 20% of 50M followers who have push enabled and are active:

$$ D = 0.20 \times 5 \times 10^{7} = 10^{7}\ \text{push messages.} $$

At $10^{4}$ msg/s that's $10^{7} / 10^{4} = 1000\ \text{s} \approx 17\ \text{minutes}$ to drain on a single FCM project — so we shard across $K$ sender projects (concretely: $K$ SNS platform applications, one per FCM sender project; the push consumer routes by recipient_user_id mod K, which keeps a given device sticky to one project) and, more importantly, we buffer. The delivery layer must absorb a 10M-message spike and bleed it out at the providers' aggregate sustainable rate, not drop it. That's a queue, not a synchronous call.

Staff

So the fan-out worker does not call providers. It writes the inbox row and enqueues a delivery job onto SQS Standard queues, one per channel: mobile-push, email, in-app, SMS. Every message carries a mandatory tenant_id message attribute so a consumer can never act on a job without knowing whose data it is. Each queue has a dead-letter queue with a redrive after 3 failed receives. Channel-specific consumer Lambdas drain each queue at a controlled rate. For mobile push I deliberately do not hand-roll APNs/FCM clients — I use Amazon Pinpoint on top of SNS mobile push. SNS owns the platform-endpoint abstraction: token registration, the APNs/FCM feedback loop, and auto-deregistration of stale tokens. Pinpoint adds the campaign/quiet-hours/timezone layer on top.

Decision — Standard, not FIFO

The first draft reached for SQS FIFO. We pulled it. FIFO buys two things: strict ordering and a 5-minute dedup window. We need neither at the delivery queue. Display order is already carried by the ULID, not by queue arrival order — the phone shows newest-first regardless of which job drained first. And dedup is handled by a dedicated 30-day store (below), not a 5-minute queue window. Meanwhile FIFO costs us the thing we cannot give up: throughput. A single message group tops out around 3,000 TPS with batching; one celebrity event is 10M push jobs, which at 3,000 TPS is a 55-minute drain — a 3× shortfall against the 9,800 msg/s we need to clear in ~17 minutes. SQS Standard has effectively unlimited throughput. We take Standard; the dedup layer makes ordering-and-duplicate guarantees a queue-independent concern.

Principal

SQS Standard is at-least-once — you just gave up FIFO's window entirely. A redrive after a visibility-timeout expiry can re-deliver. How do you not double-notify 50 million people?

Staff

Right — at-least-once plus retries equals duplicate risk, and the classic bug is a dedup window shorter than the provider's own retry timeout, which leaks duplicates through the seam. OneSignal documents a 30-day dedup window for exactly this reason. So the push consumer Lambda runs an idempotency check before every send against Amazon MemoryDB for Redis — same Redis API, but a durable multi-AZ transaction log so the dedup state survives a failover instead of resetting empty:

key = "dedup:" + sha256(tenant_id + ":" + notification_ulid + ":" + recipient_id)
if SET key 1 NX EX 2592000:   # 30-day TTL, only set if absent
    deliver()                  # we won the race; send
else:
    skip()                     # someone already delivered this

SET ... NX is atomic, so even two consumer Lambdas processing a redriven message race on the same key and exactly one wins. The 30-day TTL comfortably exceeds any provider retry horizon, so duplicates can't slip through the seam. Note the key is tenant-prefixed — without it, two tenants minting colliding ULIDs (or sharing a recipient device across apps) could cross-suppress each other's notifications. The prefix makes the dedup keyspace tenant-isolated.

Principal

And when MemoryDB is unreachable mid-burst? Does the consumer block, or send and pray?

Staff

Neither silently. The draft named the tension and resolved nothing — that's the bug. We resolve it: fail-closed. On a MemoryDB timeout or error the consumer Lambda throws, does not ack the SQS message, and lets it become visible again for retry with backoff. The job stays in the queue. We never fail-open into a blind send. For the catastrophic case — MemoryDB fully down for a sustained window — consent-bearing channels fall back to a DynamoDB conditional write as a durable second dedup store, so legally-gated sends still get an idempotency guarantee even with the cache gone.

Napkin math — dedup cache footprint

Each dedup key is a 64-char SHA-256 hex string plus a small value and Redis overhead, call it $\sim$120 bytes amortised. If we send $D = 2 \times 10^{9}$ push messages over a rolling 30-day window:

$$ M = 2 \times 10^{9} \times 120\ \text{B} = 2.4 \times 10^{11}\ \text{B} \approx 240\ \text{GB.} $$

That's a sharded MemoryDB cluster, not a single node — but it's bounded and predictable, and the 30-day TTL caps it. We accept this as the price of effectively-once delivery at the seam — and the durability of MemoryDB's transaction log is what keeps a failover from silently resetting all that idempotency state to empty.

Quiet hours, preferences, and timezone bugs

Principal

A user sets quiet hours 22:00–07:00. The classic failure is computing that in UTC. Talk me through it.

Staff

This is the bug that silently excludes whole user segments. If you store quiet hours as 22:00–07:00 and compare against now() in UTC, then every user in, say, UTC+9 has their "10pm–7am quiet window" applied to UTC 10pm–7am — which is 7am–4pm local. You either spam them at midnight or suppress them all day. The fix is non-negotiable: store the user's IANA timezone (e.g. Asia/Tokyo) per user, and evaluate quiet hours in user-local time.

Principal

And preferences themselves — you can't read DynamoDB on every one of 10 million deliveries.

Staff

Correct — preferences go behind a write-through ElastiCache in front of the DynamoDB preferences table. The trap here is cache TTL. A long TTL (say 1 hour) means a user who turns off notifications keeps getting them for up to an hour — that's not just annoying, under GDPR it's a consent violation for up to an hour. But a very short TTL hammers DynamoDB.

So I use a moderate TTL of 5 minutes as a backstop, plus event-driven invalidation. And here's the managed-first correction the draft missed: I do not stand up a dedicated Kinesis stream for preference changes. The preferences table already knows when it changes — DynamoDB Streams emits a change record on every write for free, and an EventBridge Pipe wires that stream directly to the cache-invalidation Lambda. No separate ingest path to operate, and the invalidation trigger is source-of-truth-driven by construction. The TTL only matters if a stream record is somehow missed; the common path is invalidated in real time.

Principal

You keep saying "marketing channels." Under GDPR and ePrivacy, doesn't all push need consent?

Staff

You're right, and the draft got it wrong by gating only marketing. Push to a device is push to a device — GDPR Art. 7 and ePrivacy treat it as requiring a lawful basis regardless of whether we'd call the content "transactional" or "social." So the gate moves to every push send: the push consumer reads a push_consent attribute — stored in the preferences table as {granted_at, version} — and default-denies when it is absent. No consent record, no push. Quiet-hours is treated the same way: if the user's timezone or preference is unavailable, we defer rather than send — quiet-hours fails closed, not open.

Staff

One more storm to prevent: even with dedup, a single user can get blasted — imagine 500 people liking your post in a minute. I run a sliding-window rate limit per (user, channel) in ElastiCache: above a threshold, collapse into a digest ("500 people liked your photo") rather than 500 separate buzzes. That protects both the user's attention and the provider quota.

Failure modes, recovery, and multi-tenancy

Principal

It's 3am. APNs starts returning 429s and 500s in bulk. What happens to my system?

Staff

The buffer absorbs it. Failed sends from the push consumer don't get acknowledged, so SQS makes them visible again after the visibility timeout; SNS retries with exponential backoff. After 3 failed receives a message lands in the DLQ instead of looping forever. Crucially I cap the push consumer's Lambda reserved concurrency so we never stampede APNs harder during its degradation — concurrency limits are the back-pressure valve. When APNs recovers I redrive the DLQ. Because every send is dedup-guarded, redriving the DLQ can't double-deliver the messages that actually succeeded before the error.

The Kinesis hot-shard case — why we did NOT add a compound partition key

A critic pushed for a compound partition key (source_user_id + seq_mod_N) to defend against a celebrity pinning one shard. We rejected it. A single Kinesis shard absorbs 1,000 records/s — 3.6M author-events/hour. A celebrity going live posts tens of events per minute, not millions; the fan-out computation happens inside Lambda, not in Kinesis, so one prolific author is negligible shard load. A compound key would shatter per-author ordering (the whole reason we partition by author) for a saturation case that doesn't exist at this scale. The real defence is the two-tier model itself — celebrity events are $O(1)$ writes — plus a CloudWatch alarm on IteratorAgeMilliseconds and Kinesis on-demand auto-scaling to split a genuinely hot shard. Lightweight mitigation, no ordering sacrifice.

Principal

This is multi-tenant — many apps/brands on one platform. How do you keep tenant A's notifications and tokens away from tenant B, and what compliance does that map to?

Staff

Tenant isolation runs through the keys and the IAM boundary, mapped to the controls auditors actually ask for:

SOC 2 CC6 (logical access control): every inbox partition key is tenant-scoped (tenant_id#user_id, with tenant_id as the literal leading segment), and consumer Lambdas assume per-tenant IAM roles with the dynamodb:LeadingKeys condition key bound to that tenant_id so a role can only read its own tenant's partitions. The dedup key, the SQS message attributes, and the cache keys are all tenant-prefixed too — the tenant boundary is in the data, not just the IAM policy. No query can cross the tenant boundary.
Least privilege, scoped concretely: the fan-out Lambda role is not "DynamoDB access." It is dynamodb:PutItem/Query/BatchGetItem on the specific inbox and preferences table ARNs (with LeadingKeys), sqs:SendMessage on the four channel-queue ARNs, and nothing else. The push consumer adds sns:Publish on its platform-application ARNs only — not sns:CreatePlatformEndpoint (token registration is a separate, narrower role).
ISO 27001 A.10 / encryption: notification bodies can contain PII ("Alice messaged you"). At rest: DynamoDB SSE-KMS, MemoryDB and ElastiCache at-rest encryption. In transit: transit_encryption_enabled=true on MemoryDB and ElastiCache with Redis ACL users (no anonymous access), and the data plane runs over VPC interface endpoints for Kinesis/SQS/SNS/Secrets Manager and a gateway endpoint for DynamoDB — traffic never traverses the public internet. Provider credentials (APNs auth key, FCM service-account keys) live in AWS Secrets Manager with rotation, never in env vars or code. The push payload to APNs/FCM carries only a reference plus a generic body when the device is locked — sensitive content is fetched after authentication, not shipped in the push.
One CMK per table, not per tenant: to be precise — DynamoDB SSE applies one KMS CMK per table, not per item, so the pooled inbox table uses a shared CMK and tenant isolation is enforced by LeadingKeys, not by per-tenant keys. A genuine per-tenant CMK requires a table-per-tenant silo; we name that as an enterprise-tier upgrade path, not the default. The earlier "per-tenant KMS key" phrasing was wrong and is corrected here.
GDPR (consent and residency): consent is a first-class preference checked on every push send (Art. 7 lawful basis), default-deny when absent; a consent withdrawal invalidates the cache via the DynamoDB-Streams Pipe immediately. Right-to-erasure purges all Global-Tables replicas, the dedup keys, the preference cache, and the audit log — not just the home region. EU-tenant residency: tenants flagged EU-only are pinned to EU regions and excluded from non-EU Global-Tables replication. TTLs on the inbox (90 days) and dedup keys (30 days) bound retention.
NIST SP 800-53 AU-2/AU-9 (audit, integrity): every delivery attempt — enqueued, sent, suppressed by quiet hours, denied by consent, deduped, or DLQ'd — emits a structured audit event via Kinesis Firehose to an S3 bucket with Object Lock in compliance mode (WORM) in a dedicated logging account — the records cannot be altered or deleted within the retention window, even by an admin. CloudTrail captures all control-plane actions. We can prove, per notification, who we tried to notify, when, on what channel, and why it was or wasn't delivered — and prove the proof wasn't tampered with.

Principal

What does this cost me at steady state?

Napkin math — monthly cost sketch (corrected)

Assume $2 \times 10^{9}$ push notifications/month, on-demand, two regions. The first draft understated this badly — here is the corrected composition:

DynamoDB: writes are not 1× — the unread-count GSI doubles WCU (base + GSI), and Global Tables charges replicated WCU at $\sim$\$1.875/M per replica region. $2 \times 10^{9}$ writes $\times$ 2 (GSI) $\times$ \$1.25/M base, plus the replica region, lands at $\sim$\$8{,}000$–$\$10{,}000$/month. The biggest line by far.
SNS mobile push: the line the draft omitted entirely — \$1.00/M $\times 2 \times 10^{9}$ = \$2,000/month. A top-three driver hiding in plain sight.
MemoryDB dedup cluster: the $\sim$240 GB durable dedup keyspace — $\sim$\$2,500/month (the durability premium over plain ElastiCache is the price of not losing dedup state on failover).
SQS Standard: \$0.40/M, but at $\sim$2.5 API requests per delivered message (send + receive + retries): $2 \times 10^{9} \times 2.5 \times$ \$0.40/M $\approx$ \$2,000/month.
ElastiCache follower-list + pointer + rate-limit cache: top hot accounts' follower lists — e.g. cache.r6g.2xlarge $\times$ 2 nodes $\approx$ \$650/month (eviction acceptable here, which is why this stays ElastiCache, not MemoryDB).
Lambda: $\sim$$2 \times 10^{9}$ delivery invocations $+$ fan-out $+$ compute $\approx$ low thousands of $/month.
Kinesis: author-action events ($\sim$$10^{8}$/month, far fewer than deliveries) $\times$ a few shards: low hundreds of $/month. Long-term replay goes to S3 via Firehose rather than paying for extended Kinesis retention.

Corrected total lands around \$17,000–\$22,000/month for 2 billion notifications — still low five figures, so the directional claim survives, but the composition is honest now: DynamoDB (GSI + Global Tables), MemoryDB, SNS, and SQS are the four real drivers. The two-tier model (turning $O(F)$ celebrity writes into $O(1)$) is the biggest cost lever because it attacks the DynamoDB write line directly; reserved capacity and right-sizing the dedup TTL are next.

Did we ever leave AWS?

Principal

Final question. Anywhere in this design did you have to leave AWS?

Staff

No. Every tier has a managed AWS service that fits: Kinesis for the event log, Lambda for fan-out and delivery workers, DynamoDB for the ULID-keyed inbox and preferences (with DynamoDB Streams + an EventBridge Pipe driving preference-cache invalidation — no extra stream to run), SQS Standard for the per-channel delivery buffer, SNS + Pinpoint for APNs/FCM transport and quiet-hours scheduling, SES for email, MemoryDB for the durable dedup keyspace and ElastiCache for the evictable follower-list and rate-limit caches, KMS for at-rest encryption, Secrets Manager for provider credentials, and Firehose/S3 (Object Lock) plus CloudTrail for the audit trail. The only non-AWS endpoints are APNs and FCM themselves — and those aren't a choice, they're Apple's and Google's networks, which every notification system on Earth must terminate at. SNS abstracts even those. We never left AWS.

↓ podcast script (.txt)