Real-time notification fan-out

A user does something — likes a post, a price moves, a build finishes — and we need every interested client to see it within a second or two. Easy at a thousand connections. The interesting version is ten million.

The problem, and the numbers we design to

Principal

Before any boxes — what are we actually building, and at what scale?

Staff

A fan-out delivery system. Clients hold a live connection; a backend event needs to reach all clients subscribed to that event's topic, end-to-end in under ~2 seconds at p99. Design point: 10 million concurrent connections, a baseline of 50k events/second, average fan-out of 200 clients per event, bursting to 20× during big moments.

Napkin math

Fan-out work is the thing that explodes, not ingestion. Outbound message rate is

$$ R_{out} = R_{events} \times \overline{f} = 50{,}000 \times 200 = 10^{7}\ \text{messages/second} $$

At a 20× burst that's $2 \times 10^{8}$ pushes/second. So the design question isn't "can we accept 50k events" — trivially yes — it's "can we absorb a 10–200M/s fan-out without the publisher feeling it." That pushes us toward decoupling and horizontal shards from the start.

The connection layer

Principal

Ten million sockets. Are you running your own fleet of socket servers?

Staff

The naive instinct is exactly that — an EC2/EKS fleet of WebSocket servers behind an NLB, holding sockets in memory. It works, but now I own connection draining on deploy, sticky routing, scaling the fleet against a spiky connection count, and a connection registry I build myself. That's a lot of undifferentiated heavy lifting.

So I walk the ladder: is there a managed AWS service for "hold the socket for me"? Yes — API Gateway WebSocket APIs. It terminates the socket, gives me $connect/$disconnect/$default routes, and lets me push to any client later with a @connections POST {connectionId} call. I never hold a socket in my own process.

Where do connections live?

Principal

When an event for topic T arrives, how do you find the 200 connections that care?

Staff

A connection registry keyed for that lookup. DynamoDB, with the topic as the partition key and the connectionId as the sort key, so "all connections for topic T" is a single Query. On $connect we write the subscriptions; on $disconnect we delete them; TTL sweeps anything stale.

PK = TOPIC#<topicId>
SK = CONN#<connectionId>
attrs: { region, connectedAt, ttl }

A hot topic — say a celebrity with millions of subscribers — would make that one partition a hot key. So for high-cardinality topics we shard the partition: TOPIC#<id>#<shard> over N shards, and fan a delivery worker out per shard. Same write-sharding pattern as any hot-partition problem.

The fan-out itself

Principal

So the publisher fires an event. Walk me from there to the client. And don't tell me the publisher calls @connections POST 200 times synchronously.

Staff

It doesn't — that would couple the publisher's latency to the slowest client and to API Gateway throttling. The publisher does one thing: publish the event to a topic and return. Decoupling buffer is SNS → SQS: the event lands on an SNS topic that fans out to a set of SQS queues (shards). Lambda delivery workers consume the queues, Query DynamoDB for the connections in their shard, and issue @connections POST in batches.

The reflexive answer here is "use Kafka." But MSK or self-managed Kafka is a cluster I'd run, partition, and patch — and SNS→SQS gives me the fan-out, per-consumer buffering, retries, and a dead-letter queue with zero servers. Kafka earns its place only if I needed log replay, ordered partitions consumed by many independent groups, or sustained throughput where per-message SQS cost hurts. I don't, yet — so I stay on managed primitives.

Principal

A slow region, a burst, a poison event — what breaks, and at 3am what pages?

Staff

SQS is the shock absorber: a 20× burst grows queue depth, not publisher latency, and Lambda scales out against it. Each push is idempotent (clients dedupe on a message id), so SQS at-least-once redelivery is safe. A message that fails repeatedly goes to a DLQ after maxReceiveCount rather than blocking the queue. The page is on queue age (oldest-message age crossing the latency budget) and on DLQ depth — not on CPU. Stale connections just fail the POST with 410 Gone; the worker deletes them from the registry on the spot, which keeps the table self-healing.