Patterns from this design

Adaptive bitrate live streaming pipeline

media-cdn

Keyframe-align every rendition of the ABR ladder

When: You transcode one source into multiple bitrate renditions and want players to switch between them mid-stream as bandwidth changes. Seamless switching is only possible if segment N of every rendition covers the identical wall-clock interval and starts on an aligned keyframe - otherwise the player glitches at the splice or stalls waiting for the next GOP boundary.
AWS: Run the ladder as a single AWS Elemental MediaLive channel with a fixed GOP length that is an exact multiple of the segment duration and keyframes forced at segment boundaries across all outputs, so every rendition's segments share identical presentation timestamps. MediaLive enforces this when GOP and segment length are set consistently; a hand-rolled FFmpeg ladder with a mismatched -g per rendition silently breaks switching.
Trade-off: Forcing keyframes at fixed boundaries spends bitrate - you can't let the encoder place IDR frames purely where the content wants them, so you pay slightly more bits for the same quality in exchange for switchability. The GOP-to-segment coupling also constrains how short you can make segments before keyframe overhead dominates.

realtime

Treat the manifest as a heartbeat with a seconds-scale TTL, not as uncacheable

When: Live HLS/DASH players re-poll the manifest every 1-2s and freeze - all at once - if it stops advancing for ~10s. The manifest is simultaneously the highest-QPS object and the one that must be freshest, so getting its cache and freshness model wrong desyncs the whole audience together.
AWS: Serve the manifest from MediaPackage through CloudFront with a short ~1s Cache-Control TTL (not no-store) so request collapsing lets the origin see roughly one fetch per second per POP instead of one per viewer, while the TTL still bounds live-edge staleness. Monitor manifest age (now minus newest-segment PTS) as a CloudWatch metric and alarm on it - a stale 200 is more dangerous than a 500. Drop segment size to ~200ms LL-HLS parts with blocking playlist reload when sub-5s latency is required.
Trade-off: A 1s TTL means viewers can be up to a second behind the absolute edge, and marking the manifest truly non-cacheable to chase that second collapses request collapsing and dumps the full poll storm (millions of req/s) onto the origin. LL-HLS parts cut latency but multiply request volume ~30x and hold connections open via blocking reload - at large scale that can hit a single MediaPackage endpoint's hard quota (~1,000 manifest req/s, ~500 segment req/s), forcing multiple endpoints behind path-based CloudFront routing or a fan-out tier that absorbs the blocking long-polls before they reach the origin.

ingestion

Redundant active-active ingest with seamless input failover

When: A single live contribution feed over bare TCP RTMP stalls the entire downstream audience on a single packet-loss event, and a single ingest endpoint is a hard single point of failure for the whole channel - the part of the path you control least (the broadcaster's last mile) is exactly where loss happens.
AWS: Carry premium contribution over AWS Elemental MediaConnect (SRT/Zixi-grade transport with ARQ packet recovery) with two source flows, feeding two MediaLive input endpoints. MediaLive automatic input failover holds the secondary hot and cuts over on loss-of-signal or black-frame/silence detection while preserving a continuous output timeline, so the manifest never gaps and viewers never notice. Commodity streams can push dual RTMP directly to MediaLive inputs.
Trade-off: You pay to ingest and stand by a second pipeline that is idle nearly all the time, and the premium path gives up RTMP's universality - the encoder must speak SRT or push twice. For a live event the standby cost is rounding error against egress, but for low-stakes streams it may not be worth it.

media-cdn

Origin Shield to make live-edge origin load independent of audience size

When: The newest segment at the live edge has never been requested, so it is uncached by definition, and millions of players want it the instant it appears - a thundering herd straight at the origin. Edge request collapsing alone still leaves one fetch per POP (hundreds) per new segment, which scales with neither zero nor one.
AWS: Enable CloudFront Origin Shield in the region nearest MediaPackage so all ~600 edge POPs route through one regional caching tier before reaching origin. The funnel becomes viewers to POPs to one Origin Shield to origin, collapsing the herd into roughly one origin fetch per segment - making origin load flat and audience-independent (a 10M-viewer event costs the origin the same as a 10k-viewer one).
Trade-off: Adds one cache hop of latency on a true edge miss and a per-request Origin Shield charge, and concentrates origin-facing traffic through a single regional tier - so that region's health becomes a dependency you must monitor. For tiny audiences the extra hop is pure overhead.

caching

Split cache policy by mutability - pin immutable segments, revalidate the live manifest

When: A live stream mixes two object classes with opposite cache requirements: segments that are immutable once written (a stale copy is still correct) and a manifest that must always reflect the live edge (a stale copy desyncs everyone). One blanket TTL is wrong for at least one of them.
AWS: Set Cache-Control on segments to duration plus a ~10s buffer so a million viewers fetching the same segment hit warm CloudFront cache, and set the manifest to a ~1s TTL so it effectively revalidates per poll while still benefiting from request collapsing. Origin-group failover on CloudFront covers a stalled primary MediaPackage endpoint. Bound staleness with a manifest-age canary rather than trusting HTTP 200.
Trade-off: Two cache policies means two places to misconfigure, and the segment TTL must be sized against the DVR/time-shift window - too short and time-shifted viewers miss the origin warm path, too long and storage and stale-edge risk grow. The 1s manifest TTL caps how fresh the live edge can be.

realtime

Close the loop - wire the staleness signal to an actuator, not a dashboard

When: A manifest-age canary that only alarms is a dashboard: it detects the frozen heartbeat but nothing acts on it. For channel-level recovery (a stalled MediaPackage origin or a wedged MediaLive channel) you need detection to mechanically trigger promotion of a standby channel and a CloudFront origin-group repoint, fast and without a human at 3am.
AWS: Compute manifest age at the edge via a CloudFront Functions response handler reading EXT-X-PROGRAM-DATE-TIME and push it as a per-channel CloudWatch embedded metric (no external Synthetics poller, so no O(channels) polling load and no CDN cache masking). A CloudWatch alarm fires an EventBridge rule that invokes a Lambda which promotes the standby MediaLive channel in a second region via API and updates the CloudFront origin group to the secondary MediaPackage endpoint. Viewer session state on DynamoDB on-demand with Global Tables gives cross-region continuity; the slate-fallback circuit breaker runs in CloudFront Functions, not Lambda at Edge, because it is a pure header/redirect rewrite with no outbound call.
Trade-off: A cold standby promoted in another region takes 10-30s and forces a player timeline re-sync, so viewers rebuffer once - it is the recovery of last resort below player backoff and dual-pipeline, not the first reflex. Global Tables replication is sub-second but not zero, so an in-flight session write can be lost at the instant of region failure; players must degrade to a live-edge restart rather than block on a session lookup.