Architecture review

Global image delivery with CloudFront and edge-side transformation

Serving 1B transformed images/day where the real adversary is cache-key explosion turning your CDN into a pass-through and your origin into a Lambda bill.

staff13 min readmedia-cdngeocaching

An image CDN looks like the easy part of the stack: put CloudFront in front of S3, append ?w=300&fmt=webp when the frontend needs a thumbnail, transform on the fly. That sketch survives a demo and dies in production for one reason nobody draws on the whiteboard: every distinct query string is a distinct cache object. A CDN whose job is to absorb reads becomes a pass-through that forwards almost everything to a Lambda, and your "$72k/month of egress" design quietly grows a "$272k/year of compute" tumour. The interesting version delivers a billion transformed images a day, survives a product launch and a DMCA takedown at 3am, isolates tenants who each bring their own signing keys, and never leaves AWS.

The problem, and the numbers we design to

Principal

Image delivery. Before any boxes — what are we actually building, and at what scale?

Staff

A multi-tenant image origin-and-delivery platform. Tenants upload originals — product photos, user avatars, marketing assets — and our clients request derivatives: a specific width, height, format, quality, crop. The frontend says "give me this image at 300 px wide, WebP, quality 80" and we return it from the nearest edge in tens of milliseconds. Scale target: 1 billion image requests/day, average delivered object 40 KB, served globally. SLO: p95 edge latency ≤ 50 ms on a cache hit, and a cache-hit ratio we can actually defend — I'm going to argue the whole design is a fight to keep that ratio above 90%.

Principal

Why is hit ratio the headline number? Egress is the same bytes either way.

Staff

Because a miss isn't just "fetch from S3" here — it's "fetch the original, run an image transform, re-encode." A hit costs us a cached byte. A miss costs us a Lambda invocation plus CPU plus an S3 GET. The egress bill is roughly fixed by traffic; the miss bill is the variable we control, and it's set almost entirely by how many distinct cache objects we let exist. Get the cache key wrong and hit ratio collapses toward zero — at which point CloudFront is an expensive reverse proxy and the real system is 200 million Lambda runs a day. So yes, hit ratio is the headline.

Napkin math — what 1B/day actually weighs

Traffic: $1\text{B/day} \div 86400\text{ s} \approx 11{,}600\text{ req/s}$ average, call it ~35k req/s at peak (3× diurnal). Egress at 40 KB/object:

$$ 10^{9} \times 40\ \text{KB} = 4 \times 10^{13}\ \text{B/day} = 40\ \text{TB/day} \approx 1{,}200\ \text{TB/month} $$

At CloudFront's volume tier (~$0.06/GB after committed-use discounts), egress alone is

$$ 1{,}200{,}000\ \text{GB} \times \$0.06 = \$72{,}000/\text{month}. $$

That number is fixed. Everything else in this transcript is about not adding a second number of the same magnitude on top of it.

Cache-key explosion — the failure nobody draws

Principal

Naive version. CloudFront in front of S3, transform on the query string. Walk me to where it breaks.

Staff

The naive instinct is to forward the full query string to the origin and let CloudFront cache per-URL. It's tempting because it Just Works in a demo — every distinct request gets its own cache entry, correctness is trivial. The failure is that the cache key is now an unbounded namespace. CloudFront's default, if you forward all query strings, is to treat ?w=300&h=200 and ?h=200&w=300 as two different objects. Same pixels, two cache entries. Add a tracking param — ?utm_source=email — and now every marketing campaign mints a fresh, uncacheable copy of every image. Add a malicious client appending ?cachebust=<random> and they can drive your hit ratio to zero on demand.

Napkin math — the combinatorial blow-up

Suppose a single original is legitimately requested at 5 widths × 3 formats × 4 quality levels = 60 valid derivatives. Fine. Now let one junk param leak into the key — say a session id with $10^{6}$ distinct values. The cache namespace for that one image becomes

$$ 60 \times 10^{6} = 6 \times 10^{7}\ \text{distinct keys}, $$

of which 60 are real. Effective hit ratio for that object $\approx 60 / (6\times10^{7}) \approx 10^{-6}$. Across the catalogue you've converted a CDN into a Lambda trigger. At 1B req/day and a 0% hit ratio you'd be paying for $10^{9}$ origin transforms/day instead of $2\times10^{8}$ — a blow-up of the most expensive line item, caused by one stray query parameter.

Principal

So you allowlist params. Where does that logic run? Not in Lambda, I hope — you can't afford a function call on a hit.

Staff

Correct, and that's the crux. Two mechanisms, layered. First, a CloudFront cache policy that allowlists exactly the four params that affect the bytes — w, h, fmt, q — and ignores everything else for cache-key purposes. That alone kills utm_* and session junk for free; it's pure configuration, no compute. Second, a CloudFront Function on the viewer-request event that normalizes what's left: sort the params into canonical order, clamp w and h to an allowed set of breakpoints (snap 287 px up to 320), lowercase the format, and reject anything outside policy. CloudFront Functions run in ~1 ms at the edge with no cold start, at $0.10 per million — so I can afford to run this on the viewer side of every request, hit or miss.

The thundering herd on a cold object

Principal

Product launch. A new hero image goes live and 10,000 edge requests hit it in the same second, all cold. What does your origin see?

Staff

Without mitigation, something close to 10,000 simultaneous misses, each independently deciding "not in cache, go to origin." That's 10,000 concurrent S3 GETs and 10,000 concurrent transform invocations for the same output. The transform is the dangerous part — it's CPU, and 10,000 cold Lambda@Edge invocations for one image is a synchronized spike that does no useful work, because 9,999 of them produce the identical bytes.

Principal

So you want request coalescing. CloudFront doesn't collapse misses across edge locations on its own. What's the AWS answer?

Staff

Origin Shield. It's a designated regional cache layer that sits between all the edge POPs and your origin. Every edge miss for a given object funnels through one Shield region, and Shield collapses concurrent misses for the same key into a single origin fetch — the rest wait on that in-flight request. Ten thousand edge misses become one transform. That's the textbook fix.

Principal

Then turn it on everywhere and move on?

Staff

No — and this is where people cargo-cult it. Origin Shield is not free, and for a plain S3 origin it can cost more than it saves. Let me show the arithmetic, because it flips the obvious answer.

Napkin math — when Origin Shield is a net loss

At 1B req/day and 80% hit ratio, misses $= 2\times10^{8}/\text{day}$. Shield charges per request routed through it (~$0.0075 per 10k in the US):

$$ 2\times10^{8} \div 10^{4} \times \$0.0075 = \$150/\text{day} \approx \$4{,}500/\text{month}. $$

What does it save? If the origin were just S3, those 200M GETs cost $2\times10^{8} \div 10^{3} \times \$0.0004 = \$80/\text{day}$. So Shield would cost $150/day to save $80/day — a net loss of $70/day. Shield only pays for itself when the per-miss origin work is expensive. Here it is: each miss is a Lambda@Edge transform at roughly $745/\text{day} \div 2\times10^{8} \approx \$3.7\times10^{-6}$ each, plus the herd-collapse benefit. Coalescing 10,000-deep launch spikes into one transform is worth far more than the $150/day, because it removes the correlated CPU spike that would otherwise blow your Lambda concurrency limits.

CloudFront Functions vs Lambda@Edge — the 1ms/50ms split

Principal

You've used both already. Make the boundary explicit. Why not one compute primitive for everything?

Staff

Because they're built for different physics. CloudFront Functions are a sandboxed JS runtime that runs on the POP itself: ~1 ms CPU budget, no cold start, no network, no filesystem, ~2 MB memory. Lambda@Edge is real Lambda running in regional edge caches: up to 5 s (viewer) or 30 s (origin) timeout, up to 3008 MB, full network and a real runtime — but with cold starts of 50–500 ms. The rule I apply: per-request, no-I/O string work goes to Functions; anything that needs the image bytes, a library, or a network call goes to Lambda@Edge, and only on the origin-request event so it runs on misses only.

Staff

So the topology is: viewer-request → CloudFront Function (normalize key, validate signed URL claims, hotlink check) on every request. Origin-request → Lambda@Edge (fetch original from S3, transform with sharp/libvips, re-encode) on misses only. The image-decoding library and the CPU-heavy resize physically cannot run in a 1 ms Function — that alone forces the split.

Napkin math — the compute bill, split by layer

Viewer layer, all requests, on CloudFront Functions:

$$ 10^{9}/\text{day} \times \$0.10/10^{6} = \$100/\text{day} = \$3{,}000/\text{month}. $$

Origin transform, misses only, on Lambda@Edge (128 MB, ~500 ms avg):

$$ \text{compute} = 2\times10^{8} \times 0.5\,\text{s} \times \tfrac{128}{1024}\,\text{GB} \times \$0.00005001 = \$625/\text{day}, $$

$$ \text{invocations} = 2\times10^{8}/10^{6} \times \$0.60 = \$120/\text{day}. $$

Total transform $745/day ≈ $22.4k/month. Now the punchline: if I'd done the viewer-layer normalization in Lambda@Edge instead of Functions, I'd add roughly another $22k/month for work a 1 ms Function does for $3k. The layer split is worth ~$19k/month.

Geo routing and multi-region origin failover

Principal

CloudFront is global, but your origin and your transform Lambdas live in regions. A user in Tokyo misses cache — where does the transform happen, and what if that region is down?

Staff

Lambda@Edge origin-request executes in the regional edge cache nearest the POP, so the Tokyo miss runs the transform in or near ap-northeast-1 — close to the user. The originals live in S3, and I replicate the original bucket across three regions — us-east-1, eu-west-1, ap-northeast-1 — with S3 Cross-Region Replication, so the transform's source fetch is regional too. For failover I use CloudFront Origin Groups: each cache behavior has a primary origin and a secondary, and CloudFront fails over on configured status codes (500/502/503/504) automatically. Primary is the in-region S3 bucket; secondary is the next-nearest replica.

Principal

Why not Route 53 latency-based routing for that instead of Origin Groups?

Staff

I use both, for different jobs. Route 53 latency-based routing is great for steering to a regional endpoint when the origin is a dynamic service behind a load balancer. But S3 origins and the failover decision are best handled inside CloudFront with Origin Groups, because CloudFront makes the retry decision per-request on the actual response code, with no DNS-TTL lag — a Route 53 health-check flip can take tens of seconds and is cached by resolvers. Origin Groups fail over on the very request that saw the 5xx. So: Origin Groups for primary/secondary S3 failover; Route 53 latency routing reserved for any dynamic control-plane endpoints (uploads, admin API).

Cache invalidation at scale — launches and DMCA

Principal

3am. Legal sends a DMCA takedown — a copyrighted image must be gone from every edge in minutes. Also, a tenant just re-uploaded 50,000 product photos for a launch. Walk me through invalidation.

Staff

These are two different problems and conflating them is the classic mistake. For the launch re-upload, I do not invalidate — I use versioned URLs. The path is /img/v2/product-123.jpg; bumping the version to v3 mints a fresh key that's simply never been cached, so it's a guaranteed miss with no purge needed and no race between "new bytes uploaded" and "old bytes still cached." CloudFront's invalidation limits are brutal — 3,000 in-progress invalidation paths and ~15 wildcard invalidations/sec — so trying to invalidate 50,000 objects is both slow and a quota violation. Versioning sidesteps it entirely.

Principal

Versioning works when you control the change. DMCA is the opposite — the URL must stop serving the same bytes. You can't version your way out of that.

Staff

Right. For takedowns I do three things at once: (1) delete the original from S3 so no future transform can produce it; (2) issue a wildcard CloudFront invalidation for that image's derivatives (/img/*/product-123*); and (3) rely on a short TTL bucket for legally-sensitive content. Anything flagged takedown-eligible is served under a path with a 60 s max-TTL behavior, so even if an invalidation is slow, the blast radius is one minute. To fan out a batch of takedowns without tripping the per-second wildcard limit, I drive invalidations through Step Functions Express — fan out up to a few hundred CreateInvalidation calls with controlled concurrency and retry, rather than a naive loop that hits the throttle and silently drops paths.

Security: signed URLs, hotlinking, SSRF, tenant isolation

Principal

Multi-tenant. Tenant A must never serve from tenant B's keys, and nobody should be able to make your transformer fetch an arbitrary URL. Convince me.

Staff

Four controls. First, signed URLs with Trusted Key Groups. Each tenant gets its own key group (up to 5 public keys, so we can rotate with an overlap window — push the new key, sign with it, retire the old after ~30 min). The CloudFront Function on viewer-request validates the signature and the policy (expiry, path scope) before anything else runs. Tenant A's URLs are signed by tenant A's private key; a forged or cross-tenant URL fails validation at the edge in 1 ms.

Second — the one that keeps me up — SSRF. The transformer must never fetch a user-supplied URL. The original's S3 key is derived server-side from the validated request path and the tenant id; the Lambda@Edge transform only ever does GetObject against a known bucket with a key it constructs, never an arbitrary fetch. There is no code path where a request body or query param becomes a URL the transformer dials. That's a hard architectural invariant, not a filter.

Third, hotlink protection: an AWS WAF rule on the distribution matching the Referer header against the tenant's allowed domains, plus a WAF rate-based rule per source IP to blunt scraping. Fourth, tenant data isolation: per-tenant key prefixes in S3, IAM and bucket policies scoping the transform role to its prefix, and per-tenant CloudFront behaviors so signing keys and cache policies can't bleed across tenants.

Principal

GDPR erasure — a user invokes right-to-be-forgotten on an avatar. What's the deletion path and how do you prove it's gone?

Staff

Delete the original and all replicas from S3 (CRR delete propagation), invalidate the derivatives at CloudFront, and because erasure-eligible content lives under a short-TTL behavior, the cache purge is bounded to ~60 s. We log the S3 delete event (CloudTrail) and the invalidation id as the audit artifact — that's the proof for the data-subject request. This maps cleanly to GDPR Art. 17 (erasure), SOC 2 confidentiality, and ISO 27001 access-control objectives; the per-tenant key-group and IAM-prefix scoping is the NIST AC-3/AC-6 least-privilege story.

Cost, and the pre-compute escape hatch

Principal

Put the whole bill on the table. Then tell me the one change that could halve it.

Napkin math — the monthly bill at 1B/day, 80% hit

Egress: $72,000 (40 TB/day, $0.06/GB). S3 GETs on misses: $2\times10^{8}/\text{day} \times \$0.0004/10^{3} \times 30 = \$2{,}400$. Lambda@Edge transform: $22,400 ($745/day). CloudFront Functions viewer layer: $3,000. Origin Shield: $4,500. Rough total:

$$ 72{,}000 + 22{,}400 + 4{,}500 + 3{,}000 + 2{,}400 \approx \$104{,}000/\text{month}. $$

Egress is 70% of it and effectively floor-priced. The variable, attackable chunk is the $22.4k transform line.

Staff

The transform bill exists because we re-derive variants on demand. If the variant set is small and predictable — and for a product catalogue it usually is — pre-compute instead. On upload, an S3 event triggers a Lambda (or AWS Step Functions for the fan-out) that renders all 8 standard variants and writes them to S3 as static objects. Now every request is a plain cache-fillable S3 GET; the on-the-fly transform Lambda essentially disappears.

Napkin math — pre-compute economics

100k products × 8 variants × 50 KB $= 40\ \text{GB}$ stored. S3 Standard at $0.023/GB:

$$ 40\ \text{GB} \times \$0.023 \approx \$0.92/\text{month}\ \text{storage}. $$

Render cost is a one-time $100\text{k} \times 8 = 8\times10^{5}$ Lambda runs per full catalogue refresh — negligible amortized. Net: trade $22.4k/month of on-the-fly transform for under $1/month of storage plus a one-time render. The on-the-fly Lambda@Edge path stays only as the fallback for the long-tail of non-standard requests.

Failure and recovery

Principal

3am, your transform Lambda's region (us-east-1) is degraded and error rates spike. What does a user in New York experience?

Staff

For the 80%+ of requests that are cache hits: nothing — they're served from the POP, the origin is irrelevant. For misses where the transform fails: CloudFront Origin Groups fail the source fetch over to the secondary region's S3 replica, and the transform retries there. If transforms are broadly failing, the pre-computed static variants save us — standard sizes are already materialized in S3 and served as plain GETs with no Lambda in the path at all. The only requests that actually fail are cold misses for non-standard derivatives during the incident window, which is a small slice. We also keep stale-while-revalidate behavior so an expiring-but-present cached object is served while the refresh happens in the background — a slow origin degrades freshness, not availability.

Did we ever leave AWS?

Principal

Final question. Anything in here that isn't AWS-native — or did you stay on the platform the whole way?

Staff

We never left. Delivery is CloudFront; storage and originals are S3 with Cross-Region Replication; edge compute is CloudFront Functions (viewer) and Lambda@Edge (origin); coalescing is Origin Shield; failover is Origin Groups; geo and dynamic-endpoint routing is Route 53; auth is CloudFront Trusted Key Groups; the WAF is AWS WAF; pre-compute orchestration is Step Functions with S3 events; audit is CloudTrail. The one component people assume forces you off-platform — the image transform itself — runs inside Lambda@Edge using an open-source library (sharp/libvips) bundled into the function. That's still "on AWS": it's a library in a managed runtime, not a service we operate. The only requirement that would push us off would be a transform AWS's runtimes can't host — say a GPU-bound ML upscaler exceeding Lambda's resources — at which point I'd reach for an ECS/Fargate GPU task on the origin-request path, still on AWS. There is no hard requirement in this design that leaves the platform.

↓ podcast script (.txt)