SYSTEM DESIGN STUDIO — Podcast Script Topic: Global image delivery with CloudFront and edge-side transformation Date: 2026-06-28 | Difficulty: staff Two voices: PRINCIPAL (skeptical interviewer) and STAFF (proposing engineer). For ElevenLabs Projects: assign each speaker label to a distinct voice ID. --- [INTRO] An image CDN looks like the easy part of the stack: put CloudFront in front of S3, append ?w=300&fmt=webp when the frontend needs a thumbnail, transform on the fly. That sketch survives a demo and dies in production for one reason nobody draws on the whiteboard: every distinct query string is a distinct cache object. A CDN whose job is to absorb reads becomes a pass-through that forwards almost everything to a Lambda, and your "$72k/month of egress" design quietly grows a "$272k/year of compute" tumour. The interesting version delivers a billion transformed images a day, survives a product launch and a DMCA takedown at 3am, isolates tenants who each bring their own signing keys, and never leaves AWS. === The problem, and the numbers we design to === PRINCIPAL: Image delivery. Before any boxes — what are we actually building, and at what scale? STAFF: A multi-tenant image origin-and-delivery platform. Tenants upload originals — product photos, user avatars, marketing assets — and our clients request derivatives: a specific width, height, format, quality, crop. The frontend says "give me this image at 300 px wide, WebP, quality 80" and we return it from the nearest edge in tens of milliseconds. Scale target: 1 billion image requests/day, average delivered object 40 KB, served globally. SLO: p95 edge latency <= 50 ms on a cache hit, and a cache-hit ratio we can actually defend — I'm going to argue the whole design is a fight to keep that ratio above 90%. PRINCIPAL: Why is hit ratio the headline number? Egress is the same bytes either way. STAFF: Because a miss isn't just "fetch from S3" here — it's "fetch the original, run an image transform, re-encode." A hit costs us a cached byte. A miss costs us a Lambda invocation plus CPU plus an S3 GET. The egress bill is roughly fixed by traffic; the miss bill is the variable we control, and it's set almost entirely by how many distinct cache objects we let exist. Get the cache key wrong and hit ratio collapses toward zero — at which point CloudFront is an expensive reverse proxy and the real system is 200 million Lambda runs a day. So yes, hit ratio is the headline. === Cache-key explosion — the failure nobody draws === PRINCIPAL: Naive version. CloudFront in front of S3, transform on the query string. Walk me to where it breaks. STAFF: The naive instinct is to forward the full query string to the origin and let CloudFront cache per-URL. It's tempting because it Just Works in a demo — every distinct request gets its own cache entry, correctness is trivial. The failure is that the cache key is now an unbounded namespace. CloudFront's default, if you forward all query strings, is to treat ?w=300&h=200 and ?h=200&w=300 as two different objects. Same pixels, two cache entries. Add a tracking param — ?utm_source=email — and now every marketing campaign mints a fresh, uncacheable copy of every image. Add a malicious client appending ?cachebust= and they can drive your hit ratio to zero on demand. PRINCIPAL: So you allowlist params. Where does that logic run? Not in Lambda, I hope — you can't afford a function call on a hit. STAFF: Correct, and that's the crux. Two mechanisms, layered. First, a CloudFront cache policy that allowlists exactly the four params that affect the bytes — w, h, fmt, q — and ignores everything else for cache-key purposes. That alone kills utm_* and session junk for free; it's pure configuration, no compute. Second, a CloudFront Function on the viewer-request event that normalizes what's left: sort the params into canonical order, clamp w and h to an allowed set of breakpoints (snap 287 px up to 320), lowercase the format, and reject anything outside policy. CloudFront Functions run in ~1 ms at the edge with no cold start, at $0.10 per million — so I can afford to run this on the viewer side of every request, hit or miss. === The thundering herd on a cold object === PRINCIPAL: Product launch. A new hero image goes live and 10,000 edge requests hit it in the same second, all cold. What does your origin see? STAFF: Without mitigation, something close to 10,000 simultaneous misses, each independently deciding "not in cache, go to origin." That's 10,000 concurrent S3 GETs and 10,000 concurrent transform invocations for the same output. The transform is the dangerous part — it's CPU, and 10,000 cold Lambda@Edge invocations for one image is a synchronized spike that does no useful work, because 9,999 of them produce the identical bytes. PRINCIPAL: So you want request coalescing. CloudFront doesn't collapse misses across edge locations on its own. What's the AWS answer? STAFF: Origin Shield. It's a designated regional cache layer that sits between all the edge POPs and your origin. Every edge miss for a given object funnels through one Shield region, and Shield collapses concurrent misses for the same key into a single origin fetch — the rest wait on that in-flight request. Ten thousand edge misses become one transform. That's the textbook fix. PRINCIPAL: Then turn it on everywhere and move on? STAFF: No — and this is where people cargo-cult it. Origin Shield is not free, and for a plain S3 origin it can cost more than it saves. Let me show the arithmetic, because it flips the obvious answer. === CloudFront Functions vs Lambda@Edge — the 1ms/50ms split === PRINCIPAL: You've used both already. Make the boundary explicit. Why not one compute primitive for everything? STAFF: Because they're built for different physics. CloudFront Functions are a sandboxed JS runtime that runs on the POP itself: ~1 ms CPU budget, no cold start, no network, no filesystem, ~2 MB memory. Lambda@Edge is real Lambda running in regional edge caches: up to 5 s (viewer) or 30 s (origin) timeout, up to 3008 MB, full network and a real runtime — but with cold starts of 50-500 ms. The rule I apply: per-request, no-I/O string work goes to Functions; anything that needs the image bytes, a library, or a network call goes to Lambda@Edge, and only on the origin-request event so it runs on misses only. STAFF: So the topology is: viewer-request → CloudFront Function (normalize key, validate signed URL claims, hotlink check) on every request. Origin-request → Lambda@Edge (fetch original from S3, transform with sharp/libvips, re-encode) on misses only. The image-decoding library and the CPU-heavy resize physically cannot run in a 1 ms Function — that alone forces the split. === Geo routing and multi-region origin failover === PRINCIPAL: CloudFront is global, but your origin and your transform Lambdas live in regions. A user in Tokyo misses cache — where does the transform happen, and what if that region is down? STAFF: Lambda@Edge origin-request executes in the regional edge cache nearest the POP, so the Tokyo miss runs the transform in or near ap-northeast-1 — close to the user. The originals live in S3, and I replicate the original bucket across three regions — us-east-1, eu-west-1, ap-northeast-1 — with S3 Cross-Region Replication, so the transform's source fetch is regional too. For failover I use CloudFront Origin Groups: each cache behavior has a primary origin and a secondary, and CloudFront fails over on configured status codes (500/502/503/504) automatically. Primary is the in-region S3 bucket; secondary is the next-nearest replica. PRINCIPAL: Why not Route 53 latency-based routing for that instead of Origin Groups? STAFF: I use both, for different jobs. Route 53 latency-based routing is great for steering to a regional endpoint when the origin is a dynamic service behind a load balancer. But S3 origins and the failover decision are best handled inside CloudFront with Origin Groups, because CloudFront makes the retry decision per-request on the actual response code, with no DNS-TTL lag — a Route 53 health-check flip can take tens of seconds and is cached by resolvers. Origin Groups fail over on the very request that saw the 5xx. So: Origin Groups for primary/secondary S3 failover; Route 53 latency routing reserved for any dynamic control-plane endpoints (uploads, admin API). === Cache invalidation at scale — launches and DMCA === PRINCIPAL: 3am. Legal sends a DMCA takedown — a copyrighted image must be gone from every edge in minutes. Also, a tenant just re-uploaded 50,000 product photos for a launch. Walk me through invalidation. STAFF: These are two different problems and conflating them is the classic mistake. For the launch re-upload, I do not invalidate — I use versioned URLs. The path is /img/v2/product-123.jpg; bumping the version to v3 mints a fresh key that's simply never been cached, so it's a guaranteed miss with no purge needed and no race between "new bytes uploaded" and "old bytes still cached." CloudFront's invalidation limits are brutal — 3,000 in-progress invalidation paths and ~15 wildcard invalidations/sec — so trying to invalidate 50,000 objects is both slow and a quota violation. Versioning sidesteps it entirely. PRINCIPAL: Versioning works when you control the change. DMCA is the opposite — the URL must stop serving the same bytes. You can't version your way out of that. STAFF: Right. For takedowns I do three things at once: (1) delete the original from S3 so no future transform can produce it; (2) issue a wildcard CloudFront invalidation for that image's derivatives (/img/*/product-123*); and (3) rely on a short TTL bucket for legally-sensitive content. Anything flagged takedown-eligible is served under a path with a 60 s max-TTL behavior, so even if an invalidation is slow, the blast radius is one minute. To fan out a batch of takedowns without tripping the per-second wildcard limit, I drive invalidations through Step Functions Express — fan out up to a few hundred CreateInvalidation calls with controlled concurrency and retry, rather than a naive loop that hits the throttle and silently drops paths. === Security: signed URLs, hotlinking, SSRF, tenant isolation === PRINCIPAL: Multi-tenant. Tenant A must never serve from tenant B's keys, and nobody should be able to make your transformer fetch an arbitrary URL. Convince me. STAFF: Four controls. First, signed URLs with Trusted Key Groups. Each tenant gets its own key group (up to 5 public keys, so we can rotate with an overlap window — push the new key, sign with it, retire the old after ~30 min). The CloudFront Function on viewer-request validates the signature and the policy (expiry, path scope) before anything else runs. Tenant A's URLs are signed by tenant A's private key; a forged or cross-tenant URL fails validation at the edge in 1 ms. Second — the one that keeps me up — SSRF. The transformer must never fetch a user-supplied URL. The original's S3 key is derived server-side from the validated request path and the tenant id; the Lambda@Edge transform only ever does GetObject against a known bucket with a key it constructs, never an arbitrary fetch. There is no code path where a request body or query param becomes a URL the transformer dials. That's a hard architectural invariant, not a filter. Third, hotlink protection: an AWS WAF rule on the distribution matching the Referer header against the tenant's allowed domains, plus a WAF rate-based rule per source IP to blunt scraping. Fourth, tenant data isolation: per-tenant key prefixes in S3, IAM and bucket policies scoping the transform role to its prefix, and per-tenant CloudFront behaviors so signing keys and cache policies can't bleed across tenants. PRINCIPAL: GDPR erasure — a user invokes right-to-be-forgotten on an avatar. What's the deletion path and how do you prove it's gone? STAFF: Delete the original and all replicas from S3 (CRR delete propagation), invalidate the derivatives at CloudFront, and because erasure-eligible content lives under a short-TTL behavior, the cache purge is bounded to ~60 s. We log the S3 delete event (CloudTrail) and the invalidation id as the audit artifact — that's the proof for the data-subject request. This maps cleanly to GDPR Art. 17 (erasure), SOC 2 confidentiality, and ISO 27001 access-control objectives; the per-tenant key-group and IAM-prefix scoping is the NIST AC-3/AC-6 least-privilege story. === Cost, and the pre-compute escape hatch === PRINCIPAL: Put the whole bill on the table. Then tell me the one change that could halve it. STAFF: The transform bill exists because we re-derive variants on demand. If the variant set is small and predictable — and for a product catalogue it usually is — pre-compute instead. On upload, an S3 event triggers a Lambda (or AWS Step Functions for the fan-out) that renders all 8 standard variants and writes them to S3 as static objects. Now every request is a plain cache-fillable S3 GET; the on-the-fly transform Lambda essentially disappears. === Failure and recovery === PRINCIPAL: 3am, your transform Lambda's region (us-east-1) is degraded and error rates spike. What does a user in New York experience? STAFF: For the 80%+ of requests that are cache hits: nothing — they're served from the POP, the origin is irrelevant. For misses where the transform fails: CloudFront Origin Groups fail the source fetch over to the secondary region's S3 replica, and the transform retries there. If transforms are broadly failing, the pre-computed static variants save us — standard sizes are already materialized in S3 and served as plain GETs with no Lambda in the path at all. The only requests that actually fail are cold misses for non-standard derivatives during the incident window, which is a small slice. We also keep stale-while-revalidate behavior so an expiring-but-present cached object is served while the refresh happens in the background — a slow origin degrades freshness, not availability. === Did we ever leave AWS? === PRINCIPAL: Final question. Anything in here that isn't AWS-native — or did you stay on the platform the whole way? STAFF: We never left. Delivery is CloudFront; storage and originals are S3 with Cross-Region Replication; edge compute is CloudFront Functions (viewer) and Lambda@Edge (origin); coalescing is Origin Shield; failover is Origin Groups; geo and dynamic-endpoint routing is Route 53; auth is CloudFront Trusted Key Groups; the WAF is AWS WAF; pre-compute orchestration is Step Functions with S3 events; audit is CloudTrail. The one component people assume forces you off-platform — the image transform itself — runs inside Lambda@Edge using an open-source library (sharp/libvips) bundled into the function. That's still "on AWS": it's a library in a managed runtime, not a service we operate. The only requirement that would push us off would be a transform AWS's runtimes can't host — say a GPU-bound ML upscaler exceeding Lambda's resources — at which point I'd reach for an ECS/Fargate GPU task on the origin-request path, still on AWS. There is no hard requirement in this design that leaves the platform.