Live ABR streaming looks like a pipe: broadcaster pushes RTMP, you transcode into a few bitrates, slice into HLS segments, put a CDN in front, done. That pipe has a heartbeat — the manifest — that, if it goes stale for ten seconds, freezes every viewer at once; a transcoder restart that quietly triples your bitrate and stampedes buffering across the whole audience; and a live edge so fresh that the CDN is always one cache miss away from pulling the origin apart. The interesting version keeps 10 million people three seconds behind live, through a transcoder crash at 3am, without a single managed component leaving AWS.
The problem, and the numbers we design to
Live streaming. Before any boxes — what are we building, and at what scale?
One ingest, massive fan-out. A broadcaster sends one contribution stream — RTMP or SRT, one source, maybe 6 Mbps of 1080p60. We transcode it into an ABR ladder of 6 renditions and deliver to up to 10 million concurrent viewers, each of whom is independently picking the rendition their bandwidth can hold. Two SLOs. Glass-to-glass latency: viewers should be within 3–10 s of the live edge — 3 s if we run LL-HLS, 6–10 s for classic HLS. And rebuffer ratio: time spent buffering over time spent playing, target p99 ≤ 0.5%. A viewer will forgive being 5 s behind. They will not forgive the spinner.
Two numbers, two different failure surfaces. Which one is harder?
Rebuffer ratio, by a mile, because it's a correlated failure. Latency degrades gracefully — if the edge gets slow, viewers drift a few seconds further back and keep watching. Rebuffering is the cliff: the events that cause it (a stale manifest, a transcoder restart, a cold edge) hit all viewers of a channel simultaneously. One broadcaster's transcoder hiccup is not a hiccup for one viewer — it's a synchronized spinner across the entire audience. So the whole design is really about removing the single points where one event becomes a million spinners.
Fan-out, not ingest, is the dominant cost. Take a mid-size event: 1 million concurrent viewers, average rendition 3 Mbps (most viewers land in the 720p tier, some above, many below). Aggregate egress:
$$ B = 10^{6} \times 3\ \text{Mbps} = 3 \times 10^{6}\ \text{Mbps} = 3\ \text{Tbps} $$
Per hour that is
$$ 3 \times 10^{12}\ \text{bit/s} \times 3600\ \text{s} \div 8 = 1.35 \times 10^{15}\ \text{B/h} \approx 1.35\ \text{PB/hour} $$
At CloudFront's $0.085/GB list rate (first 10 TB/mo, US/EU), that is roughly $115,000 per hour of delivery; at the ~$0.02/GB enterprise/committed-use rate the deepest tiers reach, it is ~$27,000 per hour — and ingest is exactly one 6 Mbps stream. The entire economic and reliability problem lives on the read side. Design every component asking "what does this do at a million simultaneous identical requests?"
The naive pipe, and why it freezes everyone at once
Start naive. One RTMP server, FFmpeg transcodes into HLS, files on disk, one CDN in front. What breaks at a million viewers?
Three things break, and they break in increasing order of subtlety. The obvious one: a single transcoder box is a single point of failure for the whole channel — it dies, the channel is black, full stop. The second: origin crush. Every HLS player polls the manifest every target-duration seconds and fetches each new segment as it appears. A million players hitting one origin box for segments is millions of requests per second against local disk; it falls over instantly.
The third is the one that's killed real platforms — the manifest stampede. The HLS/DASH manifest is the playlist that tells the player which segments exist and what's at the live edge. Players re-fetch it every 1–2 s; it is the heartbeat of the stream. If your origin can't update that manifest for 10 s — a transcoder race, a GC pause, an overloaded origin — players reach the end of the list they have and find no new segment named. They don't error gracefully. They all freeze, at the same instant, and then all retry, at the same instant. The manifest is both the most-requested object and the most fragile one.
So you need transcoder redundancy, an origin that scales reads, and a way to keep the manifest both fresh and survivable. That's most of a media stack. Are you building it, or buying it?
Buying it, and I'll defend that hard. Twitch built TwitchTranscoder over FFmpeg and Netflix built bespoke live encoding — but they did it because at their scale they needed proprietary codec metadata and per-frame quality control that no managed product exposed, and they had the standing media-infra org to own it. For a staff design targeting "10M viewers, AWS-first, ship this quarter," AWS Elemental gives me the three hard pieces as managed services: MediaLive for redundant transcoding, MediaPackage for manifest-and-segment origin packaging, and CloudFront for delivery. I'll walk each, and I'll name exactly where I'd consider leaving the managed path.
Ingest: contribution-grade transport with a redundant path
Start at the source. The broadcaster's stream is on the public internet. How does it get to you reliably?
Raw RTMP over the open internet is the naive instinct, and it's tempting because every encoder speaks it. The problem is RTMP rides TCP: a single packet loss stalls the whole stream behind retransmission, and the broadcaster's last-mile is exactly where loss happens. A 2% loss event on a TCP contribution feed shows up downstream as a frozen frame for every viewer. So contribution-grade transport wants forward error correction and packet recovery, not bare TCP.
The decision: ingest over SRT or RTMP into AWS Elemental MediaConnect for any premium event, with two source inputs — the broadcaster sends from two encoders, or two paths, into two MediaConnect flows. MediaConnect does ARQ-based packet recovery and gives me source-redundant failover at the transport layer. The encoder can also just point RTMP straight at a MediaLive input for lower-stakes streams; MediaLive inputs accept dual RTMP push for redundancy too.
Ingest is also an authentication boundary, not just a transport one. Lock the MediaLive/MediaConnect input security groups to the broadcaster's known source CIDRs, and prefer SRT with a per-tenant AES passphrase (stored in Secrets Manager, rotated on a schedule) over bare RTMP — that gives encrypted contribution transport and stops a stream-hijack where someone who learns the input URL pushes their own feed into the channel.
Two inputs only helps if the failover is seamless. What does a viewer see when the primary input drops at 3am?
Nothing, if it's configured right — and that's the whole point of doing it at the input layer rather than the transcoder layer. MediaLive does automatic input failover: it ingests the primary, holds the secondary hot, and on loss-of-signal or a black-frame/silence detector it cuts to the secondary within the buffer window, keeping the same output timeline. Because the segment timestamps stay continuous, the manifest never gaps and players never notice. The trade-off is cost — you're paying to ingest and stand by a second pipeline that's idle 99.9% of the time — but for a live event the idle-pipeline cost is rounding error against the egress bill, and a black channel during the main event is an unrecoverable business failure. You pay for the standby.
Transcode: the ABR ladder and the keyframe contract
Now the transcode. Walk me through the ladder, and tell me what's actually subtle about it.
The ladder is the set of renditions a player can switch between as its bandwidth changes. Ours:
160p @ 0.3 Mbps (last-resort mobile)
360p @ 0.6 Mbps
480p @ 1.2 Mbps
720p60 @ 3.0 Mbps (the modal tier)
1080p60@ 6.0 Mbps
source passthrough (no re-encode)
To be precise about that last row: "source passthrough" passes the contribution encode rate — typically 6–20 Mbps from a streaming encoder — not a broadcast-uncompressed 50–100 Mbps truck feed. A critic flagged this as a bitrate inconsistency; it isn't. The contribution chain is MediaConnect → MediaLive at the broadcaster's encoder output, so the passthrough rendition is bounded by that, and it stays a sane top-of-ladder option rather than a bandwidth bomb.
MediaLive runs this as one channel producing all six outputs. The naive view is "six independent encodes." The subtle, load-bearing requirement is keyframe alignment: every rendition must place its IDR keyframes at the exact same presentation timestamps, so that segment N of the 360p stream covers precisely the same wall-clock interval as segment N of the 1080p stream.
You raised the transcoder-restart spike earlier. Walk me through it concretely. A pipeline gets killed and restarted at 3am — what exactly goes wrong, and how does MediaLive save you?
Here's the failure in detail. When an encoder cold-starts mid-stream it has no reference frames, so the first GOP it emits is keyframe-heavy — closer to all-intra than the steady-state I/P/B mix. An intra frame is several times larger than a predicted frame, so for the first 2–3 segments the actual bitrate of the "3 Mbps" rendition can spike to 8–10 Mbps.
Now chain that with ABR. A player measures its throughput from the last segment it downloaded. It downloads a fat post-restart segment, its bandwidth estimate looks bad, so it steps down the ladder — and across a million players, that's a synchronized downshift, a brief rebuffer wave, and then a slow climb back up. One restart, a million spinners. That's the cascade.
MediaLive's answer is not to cold-restart in the viewer's path at all. A MediaLive Standard channel runs two independent pipelines (pipeline 0 and pipeline 1) across two AZs, each producing the full output, both feeding the packager. If pipeline 0 crashes and is being rebuilt, pipeline 1 is still producing steady-state segments, and the packager simply takes segments from the healthy pipeline. The viewer never sees the restart spike because the restart never reaches the manifest. The trade-off is that you're paying for a second full transcode continuously — Standard is ~2× the encode cost of a Single-pipeline channel — but a Single-pipeline channel is the cold-restart cascade waiting to happen.
Packaging: the manifest as heartbeat, and LL-HLS
MediaLive produces segments. Who turns them into HLS and DASH manifests, and where does the heartbeat problem from beat 2 actually get solved?
AWS Elemental MediaPackage is the origin packager. MediaLive pushes it a contribution format; MediaPackage generates HLS, DASH, and CMAF manifests on the fly per request and writes segments to S3-backed storage. Doing packaging at the origin rather than baking fixed playlists at the encoder is what lets me serve HLS to Apple devices and DASH to others from one source, and offer time-shift / DVR windows without re-encoding. Set the MediaPackage segment retention window to match only the DVR requirement, and add an S3 Intelligent-Tiering lifecycle rule on objects older than that window so accumulating event archives don't quietly grow the storage bill.
The heartbeat discipline lives in the TTLs. Two object classes, two completely different cache policies:
Segment (seg_NNN.ts/.m4s): immutable
Cache-Control: max-age = segment_duration + 10s buffer
Manifest (.m3u8/.mpd): mutable, points at live edge
Cache-Control: max-age = 1s (effectively per-poll revalidate)
Segments are immutable, so they cache aggressively and a million viewers fetching seg_050.ts all hit warm cache. The manifest is the heartbeat — it must be near-real-time fresh, so its TTL is ~1 s. That tiny TTL is what bounds how stale the live edge can get, and it's also the request that's hardest on the system.
Classic HLS players re-poll the manifest roughly every target-duration. With 6 s segments and 10 million viewers:
$$ Q_{manifest} = \frac{10^{7}\ \text{viewers}}{6\ \text{s}} \approx 1.7 \times 10^{6}\ \text{requests/s} $$
1.7M req/s for an object with a 1 s TTL. If those misses reach MediaPackage you've recreated origin crush. The fix is that CloudFront collapses identical concurrent requests and the 1 s TTL means the origin sees at most ~1 manifest fetch per second per edge POP, not per viewer — request collapsing turns 1.7M/s into hundreds/s at origin. Lose request collapsing (e.g. manifests marked truly non-cacheable, no-store) and that 1.7M/s lands on the packager. The 1 s TTL is doing enormous work; "non-cacheable manifest" is the trap.
You keep saying 3 s latency. Classic HLS with 6 s segments is structurally 18–30 s behind. How do you get to 3?
Low-Latency HLS. The latency floor of classic HLS is "you can't list a segment until it's complete," so a 6 s segment means the edge is at least 6 s old before it's even advertised, times a few segments of player buffer. LL-HLS breaks the segment into partial segments (parts) of ~200 ms that get advertised the moment they're encoded, plus blocking playlist reload (the manifest request hangs server-side until the next part exists, instead of the client polling-and-missing) and preload hints. MediaPackage supports LL-HLS output. That gets glass-to-glass into the 2–5 s range.
Delivery: CloudFront, Origin Shield, and the live-edge stampede
The live edge is by definition uncached — the newest segment has never been requested before. A million players want it the instant it appears. That's a thundering herd straight at your origin. How do you survive it?
This is the single most important reliability decision in the whole design, so let me be precise. Front everything with a CloudFront distribution. CloudFront already does request collapsing at each edge POP: if 50,000 viewers behind one POP request the brand-new seg_051.ts simultaneously, the POP makes one origin fetch and fans the response out to all 50,000. That alone reduces the herd from "viewers" to "POPs."
But CloudFront has ~600 POPs. 600 simultaneous fetches for every new segment, every few seconds, still hammers MediaPackage. So the second layer: CloudFront Origin Shield — a designated regional caching tier that all POPs route through before hitting origin. Now the request funnel is viewers → ~600 POPs → 1 Origin Shield → MediaPackage. Origin Shield collapses the 600 POP fetches into one origin fetch per segment. The origin sees a flat, predictable ~1 request per segment regardless of audience size.
New 6 s segment, 10M viewers. Origin requests per segment without any collapsing: up to $10^{7}$. With edge request collapsing only: ~600 (one per POP). With Origin Shield added:
$$ \approx 1\ \text{origin fetch per segment} \;\Rightarrow\; \frac{1}{6\ \text{s}} \approx 0.17\ \text{req/s at origin} $$
That is a seven-order-of-magnitude reduction in origin load, and — crucially — it's independent of audience size. The origin's load for a 10M-viewer event is the same as for a 10k-viewer event. That property is what makes "scale to 10M" a CDN-config problem rather than an origin-capacity problem.
Failure and recovery: detecting the frozen heartbeat
It's 3am, the main event is live, and the manifest stops advancing. Both MediaLive pipelines hiccup, or MediaPackage stalls. Walk me through detection and recovery — and what the player does in the gap.
Four layers of defense, inside-out. Layer 1, the player: a well-behaved HLS/DASH player already retries with exponential backoff on a missing segment and will hold its buffer. If we run a healthy buffer (a few segments), a 2–3 s origin stall is fully absorbed — the viewer's buffer drains down but never hits zero. The whole latency-vs-resilience tension is exactly this buffer size: smaller buffer = lower latency = less tolerance for a stall.
Layer 2, dual pipeline: covered in beat 4 — a single pipeline crash never reaches the manifest because the packager pulls from the healthy pipeline.
Layer 3, manifest staleness detection: this is the active monitor. The metric isn't "is the origin up" — it's manifest age: now - newest_segment_PTS, read from #EXT-X-PROGRAM-DATE-TIME. When that crosses, say, 6 s, the heartbeat is flatlining and we alarm and trigger failover before viewers freeze.
A scalability critic worried this canary can't survive a 1,000-channel fleet — that O(channels) external Synthetics polling would either overload us or get masked by CDN caching. That's a critique of an implementation we're not using. The freshness check runs inline at the edge: a CloudFront Functions response handler reads the live-edge timestamp out of the manifest it's already serving and pushes manifest-age as a CloudWatch embedded metric, per channel. There's no external poller, so there's no per-channel polling load and no cache-masking problem — the signal is computed from the exact bytes the viewer receives.
Layer 4, channel-level failover and recovery. For an event we can run a standby MediaLive channel (a second full channel, in a second region) and a redundant MediaPackage origin, with CloudFront origin-group failover between the two MediaPackage endpoints — if the primary origin returns errors or stalls, CloudFront fails the origin group over to the secondary. But detection without an actor is just a dashboard, so wire it explicitly: the manifest-age CloudWatch alarm fires an EventBridge rule that triggers a Lambda which (a) promotes the standby MediaLive channel in the second region via API and (b) updates the CloudFront origin group to point at the secondary MediaPackage endpoint. Be honest about the cost of this path: a channel-level failover (cold standby promoted in another region) takes 10–30 s and introduces a timeline discontinuity that forces the player to re-sync — viewers will rebuffer once. It's the recovery of last resort, below player backoff and dual-pipeline, not the first reflex.
Viewer session state — last position, entitlement, chosen language — lives in DynamoDB on-demand with Global Tables active-active across the two regions. We dropped DAX deliberately: viewer reads are high-cardinality (each viewer reads only their own record — that's not the hot-key pattern DAX accelerates), DAX is single-region (which fights the second-region standby), and it carries an always-on cluster cost. Global Tables gives cross-region session continuity for the failover case and costs nothing at rest. The honest caveat is that Global Tables replication is sub-second but not zero, so an in-flight session write at the instant of a region failure can be lost — so players degrade gracefully to a live-edge restart rather than blocking on a session lookup that might be a few hundred milliseconds stale.
And we put a circuit breaker in the edge logic via CloudFront Functions — not Lambda@Edge: if the origin is failing, serve a "we'll be right back" slate manifest rather than letting players hammer a dying origin into a deeper hole. CloudFront Functions is the right tier here because the slate fallback is a pure header/redirect rewrite (swap the manifest URL on an origin error) with no outbound network call — so we get sub-millisecond, no-cold-start, per-POP-unlimited execution at ~6× lower cost. Lambda@Edge would only be warranted if the breaker had to fetch live state from a remote store, which it doesn't.
On S3 itself: a reliability critic asked for cross-region replication of live segments. We're not doing it. S3 Standard is already multi-AZ within a region, and a true regional S3 outage is the extreme tail. For live content the CDN cache absorbs the blast radius for everything already delivered; only brand-new segments are at risk, and a live segment has a 2–6 s lifetime in the manifest — by the time CRR replicated a 2 s segment it would already be stale. Instead, the second-region standby channel writes to its own second-region S3 bucket, which gives geographic redundancy for free along the failover path. DVR/time-shift RPO is then bounded by time-since-last-successful-segment-write on the primary, which we document rather than paper over with CRR that buys nothing for the live edge.
Security and multi-tenancy: signed URLs, DRM, and tenant isolation
This is a multi-tenant platform — many broadcasters, paywalled content, geo-restrictions. How do you stop someone from hot-linking a premium stream, and how do you keep tenant A's content and viewer data away from tenant B?
Three concerns: access control, content protection, and isolation. Access control is CloudFront signed cookies / signed URLs. A viewer authenticates to the app, the app issues a short-lived signed cookie scoped to that channel's path prefix, and CloudFront rejects any segment/manifest request without a valid signature. Signed cookies over signed URLs because a stream is hundreds of segment requests — you sign once and the cookie covers the whole path, rather than re-signing every segment. The signature also carries policy: expiry, source-IP range, and we layer CloudFront geo-restriction on top for licensing-bound geos. For the signing key itself, use a CloudFront key group with two active public keys so rotation is zero-downtime, store the private key in Secrets Manager with a 90-day rotation Lambda, and keep cookie TTL short (≤5 min). For mid-event compromise, put an entitlement-epoch field in the cookie payload that the edge CloudFront Function checks against a monotonic counter — bumping the epoch invalidates every outstanding cookie within one TTL window without waiting for natural expiry.
In front of all of it, deploy AWS WAF on the CloudFront distribution with rate-based rules (a per-IP manifest-fetch rate limit) and the AWS Managed Rules Common Rule Set — that covers the SOC 2 CC6.6 boundary-protection control and HTTP-layer DDoS without custom code.
Content protection for premium content is DRM: MediaPackage integrates with a SPEKE-compliant key provider to encrypt segments and serve Widevine / PlayReady / FairPlay licenses, with content keys managed under AWS KMS. Crucially, the license endpoint must validate a DRM token — a JWT carrying tenant and entitlement, signed by the same KMS key that issued the CloudFront signed cookie — before it hands back a decryption key. An unauthenticated license endpoint defeats DRM entirely no matter how well the segments are encrypted. Signed cookies stop hot-linking; DRM with a gated license endpoint stops the authenticated viewer from ripping the stream. They're different threat models and you need both for licensed content.
Cost, and did we ever leave AWS?
Put numbers on it. What does a channel cost streaming, what's the fixed floor, and where does the money actually go?
The cost shape is: a fixed per-channel encode cost that's basically constant regardless of audience, plus a delivery cost that scales linearly with viewers. They're orders of magnitude apart.
Encode (audience-independent). MediaLive has no idle state — you pay per running hour, not for a warm-but-idle channel. On-demand, a 6-output HD Standard channel (two pipelines) is ~$3–4 per channel-hour. Reserved pricing saves ~75% but bills 24/7 whether the channel runs or not — roughly $700–900/month per channel — so reserve only channels with scheduled, predictable uptime, and otherwise start the channel via API on broadcast start and stop it immediately after. MediaPackage origin packaging is ~$0.035/GB of output it processes.
Contribution (fixed). Dual SRT flows through MediaConnect are ~$1.50–2.50/flow/h, so ~$3–5/h total; for an always-on channel that's ~$2,500–3,600/month. Negligible against CDN at event scale, but a material fixed cost for a multi-tenant platform running many channels.
Origin Shield (fixed, tiny). Charged per request forwarded POP→Shield. At ~600 POP requests per segment and 600 segments/h that's 360k req/h at $0.009/10k ≈ ~$0.32/h — for a 2-hour event, ~$0.65. Rounding error, but it belongs in the math so the picture is complete.
Deliver (audience-linear). From beat 1, 1M viewers at 3 Mbps is ~1.35 PB/hour. At CloudFront's $0.085/GB list rate:
$$ 1.35 \times 10^{6}\ \text{GB} \times \$0.085 \approx \$115{,}000\ \text{per hour} $$
At the ~$0.02/GB enterprise/committed rate the deepest tiers reach, the same hour is ~$27,000. Smaller cut for intuition: 10k concurrent at 3 Mbps = 30 Gbps ⇒ ~13.5 TB/hour ⇒ ~$1,150/hour CDN at list (~$270/h enterprise); a 2-hour event is ~$2,300 list or ~$540 committed. The encode for that same stream is single-digit dollars/hour. CDN egress is the overwhelming majority of the bill at scale. Every dollar of cost engineering should target egress: tune the ABR ladder so the modal viewer sits at 3 Mbps not 6, use per-title/per-scene encoding to shave bitrate at equal quality, and negotiate committed-use CloudFront pricing — a 10% bitrate reduction is worth more than eliminating the entire encode tier.
Last question. Did you ever leave AWS, and what single requirement would have made you?
No — the entire pipeline is managed AWS end to end: MediaConnect → MediaLive → MediaPackage → CloudFront, with S3 for segments, DynamoDB on-demand + Global Tables for sessions, KMS for keys, WAF on the distribution, CloudWatch + EventBridge + Lambda for the manifest-age canary and failover automation, and CloudFront Functions for the circuit breaker. The only external dependency is the broadcaster's own encoder pushing RTMP/SRT in, which by definition lives outside our platform.
The one requirement that would push me off the managed path is the same one that pushed Twitch and Netflix off it: proprietary per-frame codec metadata or a custom codec / quality-control loop that MediaLive doesn't expose. If the product demanded, say, frame-accurate proprietary watermarking or a bespoke perceptual-quality encoder, I'd run a custom transcoder on ECS Fargate (staying on AWS, ladder step 3) and keep MediaPackage + CloudFront downstream unchanged. But I'd make the business prove that requirement first, because it trades a managed dual-pipeline SLA for a media-infra team I'd have to build and run at 3am.