Architecture review

Geospatial proximity search at scale

Geohash boundary bugs, H3 hexagons, and Redis GEORADIUS are the easy part — the hard part is 2 million driver-location writes per second without hot shards, stale ghosts, or cross-region lag that mismatch surge pricing.

staff12 min readgeosearchstorage

Proximity search looks like a one-liner: GEOADD a point, GEOSEARCH a radius, done. That one-liner has a boundary bug that silently drops half your drivers in exactly the dense downtowns where every meter matters, a sharding choice that turns rush-hour NYC into a single saturated node, and a missing TTL that leaves dead phones haunting the index for hours. The interesting version writes 2 million driver locations per second and still answers "who is within 5 km" in under 50 ms.

The problem, and the numbers we design to

Principal

Ride-hailing dispatch. Before any boxes — what are we building, and at what scale?

Staff

Two flows against one live index of driver locations. Write path: 10 million active drivers, each phone sending a location fix every 5 seconds ⇒ 2,000,000 writes/second, sustained, all day. Read path: a rider opens the app and asks "all available drivers within 5 km," and we owe them an answer in < 50 ms p99. Across 50 cities, and — this is the part that bites — demand is geographically clustered. Downtown SF and downtown Manhattan at 6 pm are a tiny fraction of the map carrying a huge fraction of both writes and reads.

Principal

"Under 50 ms" is one SLO. State the other one — how fresh is the data you return that fast?

Staff

Right — latency and freshness are different promises and people conflate them constantly. We commit to two SLOs. Query latency p99 ≤ 50 ms — how fast we answer. And indexed-position freshness p99 ≤ 7 s — how old the position we hand back can be. The 7 s is the 5 s ping interval plus a couple hundred ms of buffering plus headroom; under failover it can stretch, and we want that visible rather than silent. We emit now − fix_timestamp as a CloudWatch metric at index-write time, and we carry the fix timestamp in the GEOSEARCH member value so the client can render "this driver was here ~4 s ago" and down-rank stale ones. A fast answer about a stale world is still a wrong answer; both numbers are the contract.

Napkin math — the write firehose

The write rate is fixed by fleet size and ping interval:

$$ R_{w} = \frac{N_{drivers}}{T_{ping}} = \frac{10^{7}}{5\ \text{s}} = 2 \times 10^{6}\ \text{writes/second} $$

Each location payload — driver id, lon, lat, status, timestamp — is about 200 B on the wire, so the raw ingest bandwidth is

$$ B_{w} = 2 \times 10^{6} \times 200\ \text{B} = 4 \times 10^{8}\ \text{B/s} = 400\ \text{MB/s} $$

That single number is what forbids the naive design. No stateful store is going to take 2M direct connections from churning mobile devices; the topology has to put a shock absorber in front of it.

The naive index, and the boundary bug that drops drivers

Principal

Start simple. One Redis, GEOADD every driver, GEOSEARCH on every rider query. What breaks?

Staff

Two things, and the subtle one has shipped to production at basically every geo company at least once. First, the throughput wall: one Redis node does ~100k–200k ops/s. We need 2M writes/s before we even count reads, so a single node is off by 10×–20×. That one's obvious — you shard.

The dangerous one is the geohash boundary bug. A geohash encodes lon/lat into a Base32 string where shared prefixes mean spatial proximity — usually. Two drivers 10 m apart but on opposite sides of a cell boundary get completely different prefixes: 9q8yyk vs 9q8yym, or worse, they diverge in the first character. If you implement "nearby" as a prefix-range scan over one cell, you query the rider's cell and miss the driver standing across the street. It passes every test in a sparse test city and then quietly tanks ETA accuracy in dense downtowns, where the missed driver is the one who should have gotten the trip.

Principal

So I have to remember to query nine cells every time and pray I sized the cell against the radius. That's a footgun. Is there a way the store does it for me?

Staff

Yes — and that's exactly why I reach for Redis's native GEO commands rather than rolling my own prefix index. GEOSEARCH ... BYRADIUS 5 km (the modern replacement for GEORADIUS) uses geohash internally but computes the covering neighbor cells itself and does the post-filter for you. The boundary bug literally cannot reach my code, because I never write the cell-expansion logic. I give it a center and a radius; it gives me back members sorted by distance.

Absorbing 2M writes/s: Kinesis + Lambda fan-out

Principal

You said no store takes 2M connections. So how do 10 million phones get their fixes into Redis?

Staff

A three-tier write path. Drivers hold a persistent TLS connection to a connection-server fleet on ECS Fargate behind a Network Load Balancer. Those servers don't process locations — they aggregate. Each server buffers 50–100 fixes and ships them to Kinesis Data Streams in a single PutRecords call. The stream is the shock absorber — it decouples a spiky, churning mobile fleet from a stateful store, gives me per-shard ordering, and gives me a replayable buffer if a consumer falls over. A fleet of Lambda consumers drains the stream in micro-batches and does the actual GEOADD into Redis. Redis only ever talks to a few hundred warm Lambda connections, never 10 million phones.

Napkin math — shard count

A Kinesis shard ingests 1 MB/s, or 1,000 records/s, whichever binds first. At 200 B/payload the byte limit binds: one shard absorbs

$$ \frac{1\ \text{MB/s}}{200\ \text{B}} = 5{,}000\ \text{payloads/second/shard} $$

So to carry the firehose:

$$ N_{shards} = \frac{2 \times 10^{6}\ \text{writes/s}}{5 \times 10^{3}\ \text{writes/s}} = 400\ \text{shards} $$

400 shards = 400 MB/s, matching the bandwidth number from beat 01. We provision ~480 (a 20% headroom buffer) so a city's evening peak doesn't trip ProvisionedThroughputExceeded.

Sharding the index: by spatial cell, not by city

Principal

You need dozens of Redis shards to carry that op rate. The obvious key is the city — one shard per metro. Why not?

Staff

Because that's the exact mistake Lyft shipped and then had to undo. Shard by city/region and every NYC write and every NYC rider query lands on the NYC shard. At 6 pm that one node has to absorb a metro's entire write firehose plus its entire query load, and it tops out around ~100k ops/s while the Boise shard sits idle. The hot shard becomes the dispatch bottleneck for the city that matters most.

Napkin math — why one city overruns one shard

Say NYC carries 800k active drivers at peak. Writes alone:

$$ R_{w,\text{NYC}} = \frac{8 \times 10^{5}}{5\ \text{s}} = 160{,}000\ \text{writes/second} $$

That already exceeds a single node's ~100k–150k op ceiling — before a single rider query. Geographic sharding puts the busiest place on the planet onto one node by construction.

Staff

The fix is to shard by a fixed-area spatial cell, not a named region. I use a Google S2 level-5 cell (~1 km²) as the shard key and hash-tag the Redis key by that cell: {s2:89c25a}:drivers. The {...} forces all keys for one cell onto one CRC16 slot — so a GEOSEARCH within a cell stays single-slot and atomic — but different cells spread across the cluster. Now downtown Manhattan is dozens of ~1 km² cells scattered over dozens of shards. The load that used to crush one node is fanned across the cluster.

Why S2 here specifically: it produces efficient hierarchical coverings (fewer cells to cover a query region than a rectangular grid), which is what Lyft moved to for exactly this. For the query geometry I still lean on Redis's internal geohash neighbor expansion; S2 is the sharding key, geohash is the index inside each shard. Two different jobs.

The query path and GEOSEARCH mechanics

Principal

Now the rider. Walk me from "open app" to "five cars on the map" inside 50 ms.

Staff

Rider request hits an ALB, which routes to an ECS (Fargate) query service — deliberately a separate compute tier from the Lambda write consumers, so a query storm and the write firehose never fight for the same CPU. Now the part the first draft got wrong: a 5 km circle does not live in one shard. At S2 level 5 (~1 km²) a 5 km radius covers on the order of ~20–80 cells, and each cell hash-tags to its own slot, potentially its own shard. GEOSEARCH is a single-key command — it cannot cross slots. So the query service does a scatter-gather: compute the S2 cell covering of the query circle, fire one GEOSEARCH per cell in parallel (pipelined per shard node), then merge and re-sort the candidate sets by true distance in-process.

// per covering cell, in parallel:
GEOSEARCH {s2:<cell>}:drivers
  FROMLONLAT <rider_lon> <rider_lat>
  BYRADIUS 5 km ASC COUNT 50 WITHDIST WITHCOORD

After the merge, the service applies the availability filter against the driver-status hash and returns the top-N. The latency budget is parallel, not sequential: ~20 ms for the fanned-out GEOSEARCH (bounded by the slowest shard, not the sum) + ~5 ms merge/re-sort + ~10 ms availability filter + ~15 ms overhead ≈ 50 ms p99. That only holds because it's a fan-out; the "two sub-millisecond hops" framing was only ever true for a single-cell query.

Principal

You merge across shards, then do an availability hop. Is that availability hop another scatter?

Staff

It would be if status keys landed on random slots — and in the first draft they did. We fix that by co-locating status with the geo set: status for a cell lives in a Redis hash keyed {s2:<cell>}:status, driver IDs as fields. The hash tag is the same cell, so the status hash sits on the same shard as that cell's geo set. For each covering cell, the per-shard pipeline carries both the GEOSEARCH and the HMGET — one round trip per shard, not two. The dispatch service writes status with the driver's current S2 cell (it's in the location payload), so it always knows which hash to write.

Principal

Why keep status in a separate key at all? Stuff it in the geo member and save the HMGET.

Staff

Because status changes on a different cadence than position. A driver's location updates every 5 s; their availability flips the instant they accept a trip. If status were baked into the geo member, every status flip would be a GEOADD rewrite (re-encoding the point) and every location update would have to carry the latest status or clobber it. Splitting them — but co-locating them on the same shard — lets each be written by whoever owns that fact (location pipeline writes geometry, dispatch service writes status) with no coordination, while still costing only one pipelined round trip per shard.

Stale ghosts and TTL eviction

Principal

A driver's phone dies mid-shift. Tunnel, dead battery, app crash. What happens to them in your index?

Staff

Without protection, they become a ghost — frozen at their last position, still showing up in "nearby drivers," still getting dispatched to riders who then watch a car that never moves. Pre-TTL, teams have left ghosts in the index for hours, and every ghost is a botched ETA and a failed dispatch. The fix every real implementation lands on: every location record carries a 30-second TTL, refreshed on each heartbeat update. Miss a few pings and the driver simply ages out of the index. No reaper job, no explicit "I'm going offline" message we might never receive.

Napkin math — the TTL freshness/load knob

TTL is a direct trade. With $T_{ping}=5$ s and $TTL=30$ s, a driver survives up to $\lfloor 30/5 \rfloor - 1 = 5$ consecutive dropped pings before expiring — tolerant of a brief tunnel, intolerant of a dead phone. Shorten the TTL toward $T_{ping}$ and ghosts clear faster but a single dropped packet evicts a healthy driver; lengthen it and ghosts linger. 30 s is the standard sweet spot: ~6 pings of grace.

Persistence, cross-region, and DynamoDB's role

Principal

Redis is in-memory and TTL'd — your location history evaporates in 30 s. Where does the durable record live, and what happens across regions?

Staff

And here's where the first draft would have bankrupted us. The naive move is to have the index Lambda also write every raw fix to DynamoDB — geohash partition key, timestamp range key. Run the numbers: 2M writes/s of 200 B items is 2M WCU/s, and on-demand WRU pricing puts that at ~$3.2M/month single-region, before Global Tables roughly doubles it. That is not "tens of thousands." A firehose of immutable point writes that we mostly never read back individually does not belong in a key-value store priced per write.

So we split durability by access pattern. Raw fixes go to S3 via Kinesis Data Firehose — a third consumer on the same stream, buffering into Parquet with Snappy compression. At 400 MB/s that's a few hundred dollars a month, and we query the cold trail with Athena ("where was this driver at 6:42 pm"). DynamoDB keeps only what's actually keyed and read by id: (a) driver state — one row per driver, partition key driver_id, holding status, current S2 cell, current trip; and (b) trip summaries — one record per completed trip for billing disputes. Both are tiny tables. DynamoDB is still the log of was, but only for the durable, by-id facts; the raw firehose lands in S3 where storage is cheap.

Principal

A driver crosses from one region's coverage into another, or a rider in a border city queries. How does cross-region work — and what happens if a whole region falls over?

Staff

Redis stays regional — each region runs its own cluster, and a driver is hashed to their current region's cluster. We don't try to replicate a 2M-writes/s in-memory geo index across regions; the bandwidth and the consistency cost aren't worth it. For the durable layer, DynamoDB Global Tables replicate driver state and trip summaries cross-region. A genuine cross-region lookup — rare, near a coverage seam — falls back to DynamoDB and accepts up to ~5 s of staleness from replication lag, which for a border-case query is acceptable.

On total regional loss: each city's live index is regional-only by design. We do not pay to run a hot cross-region replica of an in-memory index whose entire contents refresh every 5 s. Instead we keep a pre-provisioned warm standby cluster in a second region with Route 53 failover DNS. On failover, drivers re-ping within one 5 s cycle and repopulate the standby from scratch — so RTO ≈ 30–60 s (DNS flip plus one ping cycle) and RPO ≈ 0 for live position (the next ping is the recovery). That's an explicitly accepted tradeoff: the index is cheap to rebuild because it's ephemeral, so we let the fleet rebuild it rather than pay to mirror it.

The surge-pricing tap

Principal

Surge pricing needs to know driver and demand density per area. Tempting to just count drivers in each cell with a GEOSEARCH on the live index. Do you?

Staff

No — that's the dispatch index, and surge would be adding heavy aggregate scans on top of the latency-critical rider queries. Instead surge is a side-channel off the same Kinesis stream. Kinesis supports multiple independent consumers, so an Analytics Lambda reads the firehose as a second consumer (standard shared polling — it tolerates lag), aggregates driver supply and request demand into per-cell density, and writes DynamoDB aggregate counters keyed by S2 cell. The pricing service reads those counters. The live index never feels it.

Principal

Stop. You just told me TTL evicts healthy drivers during a partition. The surge counter draws from the same stream. So when an ISP drops heartbeats, the index goes dark and the supply counter craters at the same instant — and your "one source of truth" makes both wrong together. What stops surge from spiking 5× on a network blip?

Staff

That's the real failure mode, and it needs a circuit breaker, not just clean separation. The fix: the Analytics Lambda computes a per-cell driver-count delta rate. A normal cell loses and gains drivers smoothly; a partition makes a whole cell's count fall off a cliff. If the count drops more than a threshold (say >40% in 10 s), the Lambda sets supply_suspect=true on that cell's state in DynamoDB. The pricing service reads that flag and freezes the multiplier at last-known-good for suspect cells until the count stabilizes. The principle: mass disappearance is unknown supply, not zero supply. A demand cliff is gradual; a supply cliff that sharp is almost always infrastructure, not the market — so we refuse to price on it.

Failure & recovery at 3 am

Principal

3 am. A Redis shard primary dies. What does a rider see, and what do you do?

Staff

ElastiCache Cluster — with Multi-AZ auto-failover enabled — fails the primary over to its replica in another AZ in tens of seconds. During that window, queries for cells on that one shard degrade; cells on the other ~dozens of shards are unaffected, which is the whole point of cell-sharding the blast radius. Two cushions on reads: first, the index is self-healing — every live driver re-pings within 5 s, so the promoted replica refills with current positions almost immediately even if it was slightly behind. Second, the query service can fall back to a stale-but-present read against the driver-state table in DynamoDB for the affected cells rather than return an empty map. An empty "no drivers nearby" is the worst possible answer; a few-seconds-stale one is fine.

Principal

That covers reads during failover. What about the write path? The GEOADD into a shard mid-failover — does it stall the whole Kinesis iterator while it retries?

Staff

It must not, and the first draft would have. The index Lambda writes two independent sinks per batch — Redis and (for driver state) DynamoDB — so we report failures independently with ReportBatchItemFailures: a record that succeeded to one sink isn't re-driven just because the other failed. For Redis specifically, a GEOADD that fails during the 30–60 s failover goes to an SQS retry queue rather than blocking the shard iterator. The queue's TTL is short on purpose — a few tens of seconds — because the driver's next ping (5 s away) supersedes it anyway; the retry just narrows the gap until then. The iterator keeps draining the firehose; one shard's failover never backs up the whole stream.

Principal

A Lambda consumer crashes mid-batch and the Kinesis shard iterator resets. Do you double-write or lose fixes?

Staff

Kinesis is at-least-once, so on retry we re-process records — and that's fine because the index sink is idempotent by construction. GEOADD for a given driver id is a last-write-wins upsert: re-applying the same fix is a no-op, and a slightly-out-of-order re-apply self-corrects on the next ping 5 s later. Driver-state writes key on driver_id and are last-write-wins on the latest fix, so a replay overwrites the identical row. (The surge counters are the one non-idempotent sink — handled by the per-shard sequence dedup in beat 08.) We tune the event-source mapping with BisectBatchOnFunctionError and a DLQ so one poison record can't wedge a shard, and alarm on IteratorAge > 30 s — rising iterator age is the early warning that the consumer fleet is falling behind the firehose. DLQ'd location records are safe to drop (the next 5 s ping supersedes them); only the history/firehose path would warrant a redrive, and that path is idempotent.

Security & multi-tenancy — location is PII

Principal

Real-time location of identifiable people is about as sensitive as data gets. How do you not end up in a headline?

Staff

Treat driver location as PII end to end. In transit: TLS from the app to the edge (NLB on the write path, ALB on the read path), and the internal hops (Kinesis, ElastiCache, DynamoDB, S3) all run encrypted — ElastiCache in-transit encryption on, the rest encrypted by default. At rest: KMS-backed encryption on Kinesis, DynamoDB, ElastiCache, and S3, with customer-managed keys so we control rotation and revocation. That covers SOC 2 CC6 (logical access) and NIST SP 800-53 SC-13/SC-28 (crypto in transit and at rest).

Principal

That's the table stakes. Here's what worries me more: your GEOSEARCH takes a center and a radius. Nothing ties that center to who's asking. One valid rider account can sweep the whole map and track every driver in the city. How is that not a tracking API with extra steps?

Staff

That's the most important security finding in the design, and it's a GDPR Article 25 data-protection-by-design issue, not just an authz checkbox. Three layers. Bind the query to the asker: the ALB runs a Cognito (JWT) authorizer, and the ECS query service enforces that the query center is within tolerance (±500 m) of the authenticated rider's own device-reported GPS. You can ask "who's near me," not "who's near an arbitrary point." Coarsen the discovery phase: before a trip is matched we snap returned positions to a ~100 m grid and suppress WITHDIST, so repeated queries from shifting centers can't trilaterate a driver's exact path. Precise position and distance are revealed only after trip acceptance, and only for the matched driver. Rate-limit per principal: a WAF rate rule caps a rider to ~1 query / 5 s — enough for a live map, far too slow to enumerate a city. Combined, a single account can no longer sweep or trilaterate.

Principal

And after the fact — when someone does abuse it, can you even tell? And can you actually honor a "delete my data" request?

Staff

Both were gaps. Read audit trail: the write path was logged via Kinesis but the read path emitted nothing. Now the ECS query service writes a structured per-query access record — principal_id, query-center H3 cell (not raw coords), radius, result count, timestamp; never the result set itself — to CloudWatch Logs → Firehose → S3 with Object Lock (compliance mode), plus CloudTrail data events on the DynamoDB tables. That makes "did account X sweep the map last Tuesday" an answerable query. Erasure: with raw fixes in S3 (lifecycle-expired) and durable PII confined to the driver_id-keyed tables, erasure is a bounded delete by key, not a table scan — and a GSI on driver_id on trip summaries keeps that bounded too. The pattern we actually prefer for GDPR Article 17 is crypto-shredding: a per-driver data key in KMS; deleting the key renders that driver's history instantly unreadable across every Global Table replica without chasing rows region by region.

Retention, per store, with a basis: S3 raw history = 90 days (lifecycle rule; operational/dispute window), DynamoDB trip summaries = 7 years (billing/tax), DynamoDB driver state = 30 days post-account-deletion, access logs = 1 year (security investigation), Kinesis = 7 days (replay buffer), CloudWatch/Lambda logs = 30 days. GDPR Article 5(1)(e) storage limitation is why none of these is "forever"; Article 5(1)(c) minimization is why the payload is 200 B, not a kitchen sink.

Cost

Principal

Napkin the monthly bill. Where does the money actually go?

Napkin math — monthly run cost (order-of-magnitude)

NLB + connection fleet: NLB bills per LCU; the Fargate connection servers are stateless aggregators. Together ≈ $2,000/month.

Kinesis: ~480 shards at ~$0.015/shard-hour is $5{,}256/month, plus enhanced fan-out for the index consumer (~$5{,}256/month), plus PUT payload units. Batched at the connection tier (50–100 fixes per PutRecords) the PUT units fall ~100× to ~$7{,}258/month — versus ~$72.6M/month unbatched. Total Kinesis ≈ $12,500/month.

Lambda: consumed by batch, not by record. At BatchSize=5000 the firehose is $2\times10^{6}/5{,}000 = 400$ invocations/s (the default of 100 would be 20,000/s and ~$10k/month in requests alone). Two functions, request + GB-seconds ≈ $430/month.

ElastiCache: sized for ops, not memory (the index is a few GB). Write 2M GEOADD/s + read ~2M GEOSEARCH/s ≈ 4M ops/s; at ~100k geo ops/s per r6g.xlarge that's ~40 primary shards ⇒ 80 nodes with replicas: $0.42/\text{hr} \times 80 \times 730 \approx$ $24,528/month — the dominant line.

DynamoDB: only driver state (one row/driver) + trip summaries now, not the raw firehose. ≈ $1,500/month.

S3 + Firehose: raw history in Parquet/Snappy at 400 MB/s ≈ $800/month.

ALB (rider read path):$500/month.

Total ≈ $40,000–$50,000/month at steady state across 10M drivers and 50 cities.

Staff

The headline is what the naive design would have cost: API Gateway at 2M req/s (~$4.7M/month) + unbatched Kinesis PUT units (~$72M/month) + DynamoDB for every raw ping (~$3.2M/month, doubled by Global Tables) — north of $80M/month. Three architectural moves collapse that by three orders of magnitude: the NLB + connection fleet batching tier kills the API Gateway and PUT-unit bills, and raw fixes to S3 instead of DynamoDB kills the write-bound store bill. After that, ElastiCache (ops-bound) is the dominant line and everything else is rounding error. The remaining levers are the same two we'd always pull: edge batching and right-sizing TTL/retention so we're not paying to store ghosts or ancient trails.

Did we ever leave AWS?

Principal

Final question. Anywhere in this design did you have to step outside AWS — self-managed Kafka, a Spanner-style global store, a bespoke spatial engine?

Staff

No. The whole thing is AWS-native. Ingestion is NLB + a Fargate connection fleet → Kinesis Data Streams; fan-out is Lambda; the live index is ElastiCache for Redis with its built-in GEO commands; raw history is Firehose → S3 (Parquet) + Athena; durable by-id state is DynamoDB + Global Tables; the read path is ALB + ECS/Fargate; surge is a second Kinesis consumer into DynamoDB counters. If we ever needed managed geofencing or map-tile rendering for the rider UI, Amazon Location Service (behind CloudFront) covers it — still AWS.

The two places someone might reach off-platform are Kafka and a global strongly-consistent store. We didn't need Kafka — a single firehose with stateless fan-out is exactly Kinesis's shape. And we deliberately didn't want a globally consistent live index: regional Redis with DynamoDB Global Tables for the durable layer is the right answer when the hot read is local and only the history must travel. The one genuinely non-AWS dependency is the S2 cell library — but that's an open-source geometry library we link into our own code to compute shard keys, not a service we run. It doesn't take us off AWS any more than using a JSON parser does.

↓ podcast script (.txt)