Proximity search looks like a one-liner: GEOADD a point, GEOSEARCH a radius, done. That one-liner has a boundary bug that silently drops half your drivers in exactly the dense downtowns where every meter matters, a sharding choice that turns rush-hour NYC into a single saturated node, and a missing TTL that leaves dead phones haunting the index for hours. The interesting version writes 2 million driver locations per second and still answers "who is within 5 km" in under 50 ms.
The problem, and the numbers we design to
Ride-hailing dispatch. Before any boxes — what are we building, and at what scale?
Two flows against one live index of driver locations. Write path: 10 million active drivers, each phone sending a location fix every 5 seconds ⇒ 2,000,000 writes/second, sustained, all day. Read path: a rider opens the app and asks "all available drivers within 5 km," and we owe them an answer in < 50 ms p99. Across 50 cities, and — this is the part that bites — demand is geographically clustered. Downtown SF and downtown Manhattan at 6 pm are a tiny fraction of the map carrying a huge fraction of both writes and reads.
"Under 50 ms" is one SLO. State the other one — how fresh is the data you return that fast?
Right — latency and freshness are different promises and people conflate them constantly. We commit to two SLOs. Query latency p99 ≤ 50 ms — how fast we answer. And indexed-position freshness p99 ≤ 7 s — how old the position we hand back can be. The 7 s is the 5 s ping interval plus a couple hundred ms of buffering plus headroom; under failover it can stretch, and we want that visible rather than silent. We emit now − fix_timestamp as a CloudWatch metric at index-write time, and we carry the fix timestamp in the GEOSEARCH member value so the client can render "this driver was here ~4 s ago" and down-rank stale ones. A fast answer about a stale world is still a wrong answer; both numbers are the contract.
The write rate is fixed by fleet size and ping interval:
$$ R_{w} = \frac{N_{drivers}}{T_{ping}} = \frac{10^{7}}{5\ \text{s}} = 2 \times 10^{6}\ \text{writes/second} $$
Each location payload — driver id, lon, lat, status, timestamp — is about 200 B on the wire, so the raw ingest bandwidth is
$$ B_{w} = 2 \times 10^{6} \times 200\ \text{B} = 4 \times 10^{8}\ \text{B/s} = 400\ \text{MB/s} $$
That single number is what forbids the naive design. No stateful store is going to take 2M direct connections from churning mobile devices; the topology has to put a shock absorber in front of it.
The naive index, and the boundary bug that drops drivers
Start simple. One Redis, GEOADD every driver, GEOSEARCH on every rider query. What breaks?
Two things, and the subtle one has shipped to production at basically every geo company at least once. First, the throughput wall: one Redis node does ~100k–200k ops/s. We need 2M writes/s before we even count reads, so a single node is off by 10×–20×. That one's obvious — you shard.
The dangerous one is the geohash boundary bug. A geohash encodes lon/lat into a Base32 string where shared prefixes mean spatial proximity — usually. Two drivers 10 m apart but on opposite sides of a cell boundary get completely different prefixes: 9q8yyk vs 9q8yym, or worse, they diverge in the first character. If you implement "nearby" as a prefix-range scan over one cell, you query the rider's cell and miss the driver standing across the street. It passes every test in a sparse test city and then quietly tanks ETA accuracy in dense downtowns, where the missed driver is the one who should have gotten the trip.
So I have to remember to query nine cells every time and pray I sized the cell against the radius. That's a footgun. Is there a way the store does it for me?
Yes — and that's exactly why I reach for Redis's native GEO commands rather than rolling my own prefix index. GEOSEARCH ... BYRADIUS 5 km (the modern replacement for GEORADIUS) uses geohash internally but computes the covering neighbor cells itself and does the post-filter for you. The boundary bug literally cannot reach my code, because I never write the cell-expansion logic. I give it a center and a radius; it gives me back members sorted by distance.
Absorbing 2M writes/s: Kinesis + Lambda fan-out
You said no store takes 2M connections. So how do 10 million phones get their fixes into Redis?
A three-tier write path. Drivers hold a persistent TLS connection to a connection-server fleet on ECS Fargate behind a Network Load Balancer. Those servers don't process locations — they aggregate. Each server buffers 50–100 fixes and ships them to Kinesis Data Streams in a single PutRecords call. The stream is the shock absorber — it decouples a spiky, churning mobile fleet from a stateful store, gives me per-shard ordering, and gives me a replayable buffer if a consumer falls over. A fleet of Lambda consumers drains the stream in micro-batches and does the actual GEOADD into Redis. Redis only ever talks to a few hundred warm Lambda connections, never 10 million phones.
A Kinesis shard ingests 1 MB/s, or 1,000 records/s, whichever binds first. At 200 B/payload the byte limit binds: one shard absorbs
$$ \frac{1\ \text{MB/s}}{200\ \text{B}} = 5{,}000\ \text{payloads/second/shard} $$
So to carry the firehose:
$$ N_{shards} = \frac{2 \times 10^{6}\ \text{writes/s}}{5 \times 10^{3}\ \text{writes/s}} = 400\ \text{shards} $$
400 shards = 400 MB/s, matching the bandwidth number from beat 01. We provision ~480 (a 20% headroom buffer) so a city's evening peak doesn't trip ProvisionedThroughputExceeded.
Sharding the index: by spatial cell, not by city
You need dozens of Redis shards to carry that op rate. The obvious key is the city — one shard per metro. Why not?
Because that's the exact mistake Lyft shipped and then had to undo. Shard by city/region and every NYC write and every NYC rider query lands on the NYC shard. At 6 pm that one node has to absorb a metro's entire write firehose plus its entire query load, and it tops out around ~100k ops/s while the Boise shard sits idle. The hot shard becomes the dispatch bottleneck for the city that matters most.
Say NYC carries 800k active drivers at peak. Writes alone:
$$ R_{w,\text{NYC}} = \frac{8 \times 10^{5}}{5\ \text{s}} = 160{,}000\ \text{writes/second} $$
That already exceeds a single node's ~100k–150k op ceiling — before a single rider query. Geographic sharding puts the busiest place on the planet onto one node by construction.
The fix is to shard by a fixed-area spatial cell, not a named region. I use a Google S2 level-5 cell (~1 km²) as the shard key and hash-tag the Redis key by that cell: {s2:89c25a}:drivers. The {...} forces all keys for one cell onto one CRC16 slot — so a GEOSEARCH within a cell stays single-slot and atomic — but different cells spread across the cluster. Now downtown Manhattan is dozens of ~1 km² cells scattered over dozens of shards. The load that used to crush one node is fanned across the cluster.
Why S2 here specifically: it produces efficient hierarchical coverings (fewer cells to cover a query region than a rectangular grid), which is what Lyft moved to for exactly this. For the query geometry I still lean on Redis's internal geohash neighbor expansion; S2 is the sharding key, geohash is the index inside each shard. Two different jobs.
The query path and GEOSEARCH mechanics
Now the rider. Walk me from "open app" to "five cars on the map" inside 50 ms.
Rider request hits an ALB, which routes to an ECS (Fargate) query service — deliberately a separate compute tier from the Lambda write consumers, so a query storm and the write firehose never fight for the same CPU. Now the part the first draft got wrong: a 5 km circle does not live in one shard. At S2 level 5 (~1 km²) a 5 km radius covers on the order of ~20–80 cells, and each cell hash-tags to its own slot, potentially its own shard. GEOSEARCH is a single-key command — it cannot cross slots. So the query service does a scatter-gather: compute the S2 cell covering of the query circle, fire one GEOSEARCH per cell in parallel (pipelined per shard node), then merge and re-sort the candidate sets by true distance in-process.
// per covering cell, in parallel:
GEOSEARCH {s2:<cell>}:drivers
FROMLONLAT <rider_lon> <rider_lat>
BYRADIUS 5 km ASC COUNT 50 WITHDIST WITHCOORD
After the merge, the service applies the availability filter against the driver-status hash and returns the top-N. The latency budget is parallel, not sequential: ~20 ms for the fanned-out GEOSEARCH (bounded by the slowest shard, not the sum) + ~5 ms merge/re-sort + ~10 ms availability filter + ~15 ms overhead ≈ 50 ms p99. That only holds because it's a fan-out; the "two sub-millisecond hops" framing was only ever true for a single-cell query.
You merge across shards, then do an availability hop. Is that availability hop another scatter?
It would be if status keys landed on random slots — and in the first draft they did. We fix that by co-locating status with the geo set: status for a cell lives in a Redis hash keyed {s2:<cell>}:status, driver IDs as fields. The hash tag is the same cell, so the status hash sits on the same shard as that cell's geo set. For each covering cell, the per-shard pipeline carries both the GEOSEARCH and the HMGET — one round trip per shard, not two. The dispatch service writes status with the driver's current S2 cell (it's in the location payload), so it always knows which hash to write.
Why keep status in a separate key at all? Stuff it in the geo member and save the HMGET.
Because status changes on a different cadence than position. A driver's location updates every 5 s; their availability flips the instant they accept a trip. If status were baked into the geo member, every status flip would be a GEOADD rewrite (re-encoding the point) and every location update would have to carry the latest status or clobber it. Splitting them — but co-locating them on the same shard — lets each be written by whoever owns that fact (location pipeline writes geometry, dispatch service writes status) with no coordination, while still costing only one pipelined round trip per shard.
Stale ghosts and TTL eviction
A driver's phone dies mid-shift. Tunnel, dead battery, app crash. What happens to them in your index?
Without protection, they become a ghost — frozen at their last position, still showing up in "nearby drivers," still getting dispatched to riders who then watch a car that never moves. Pre-TTL, teams have left ghosts in the index for hours, and every ghost is a botched ETA and a failed dispatch. The fix every real implementation lands on: every location record carries a 30-second TTL, refreshed on each heartbeat update. Miss a few pings and the driver simply ages out of the index. No reaper job, no explicit "I'm going offline" message we might never receive.
TTL is a direct trade. With $T_{ping}=5$ s and $TTL=30$ s, a driver survives up to $\lfloor 30/5 \rfloor - 1 = 5$ consecutive dropped pings before expiring — tolerant of a brief tunnel, intolerant of a dead phone. Shorten the TTL toward $T_{ping}$ and ghosts clear faster but a single dropped packet evicts a healthy driver; lengthen it and ghosts linger. 30 s is the standard sweet spot: ~6 pings of grace.