ML feature store and low-latency inference serving

A feature store looks like a Redis cache with a fancy name: write features in, read them out by entity ID. That framing hides the two bugs that actually kill production ML — training-serving skew, where the value your model trained on isn't the value it serves on, and the point-in-time leak, where training quietly peeks at the future and tells you the model is brilliant right up until it ships. The interesting version retrieves 6,000 features per ranking request, answers in 40 ms at p99, serves 100,000 predictions/second, and can prove which feature value any prediction saw six months ago.

The naive instinct, and why a KV store isn't a feature store

Principal

You need features at inference time. Why not just a Redis cluster keyed by user ID? Your team already runs Redis.

Staff

That's where everyone starts, and it serves the online read perfectly well. The problem is the other half of the lifecycle. A model is trained offline on historical data and served online on live data, and the feature value has to be computed by the same logic in both worlds. If the training pipeline computes avg_order_value in a Spark job and the serving path computes it in hand-written Java, those two implementations will drift. That drift is training-serving skew, and it is the number-one silent killer in production ML.

Concretely: a team at scale shipped a model where the serving-side normalization used a slightly different mean than training. No error, no alert — the model just got steadily worse, roughly an 8% accuracy drop over two weeks, because a stale normalization constant pulled every feature off-distribution. A KV store has no opinion about how the value was produced, so it can't protect you from this. A feature store's real job is to be the single definition of a feature, written once and read by both training and serving.

Failure modes at scale, with numbers

Principal

Give me the failure that a junior engineer wouldn't see coming.

Staff

The point-in-time leak. Say you're predicting churn. Your label is "did the user cancel in the next 30 days," and one of your features is support_tickets_total. If your training join grabs the current ticket count instead of the count as of the prediction timestamp, you've leaked the future: users who churned filed angry tickets right before leaving, so the feature now encodes the answer. Offline AUC looks gorgeous — 0.95 — and in production it collapses to 0.6 because that information doesn't exist yet at inference time.

Napkin math — why point-in-time is a real join, not a lookup

A correct training row needs, for each label event at time $t_{\text{label}}$, the feature value from the most recent update strictly before it:

$$ f_i = \text{value of feature } i \text{ at } \max\{\, t_{\text{update}} : t_{\text{update}} \le t_{\text{label}} \,\} $$

With $N = 100{,}000{,}000$ label events and $k = 2{,}500$ features, a naive nested lookup is $N \times k = 2.5\times10^{11}$ as-of seeks. Done as a row-by-row lookup against the online store it never finishes; done as a sorted merge-join (ASOF join) on partitioned Parquet in S3 it's an $O(N \log N)$ scan. This is exactly why the offline store is columnar files in S3, not the online KV — the offline store's whole reason to exist is the as-of join.

Principal

And the live failure? It's 3am, your on-call gets paged.

Staff

Thundering herd on feature retrieval. Suppose we cache feature vectors with a fixed TTL and we batch-refresh a popular feature group at the top of the hour. Millions of online keys expire on the same clock boundary, every inference request misses simultaneously, and they all stampede the backing store to recompute. We've measured p99 jumping 100× — from 40 ms to 4 s — for the duration of the stampede. At 100,000 predictions/second that's a multi-second brownout across every model on the platform. The fix is probabilistic early expiration, which I'll get to.

Online and offline store: the two-store design

Principal

So you want two stores with one definition. Build it on AWS. Why not just stand up Cassandra like Uber's Michelangelo did?

Staff

Michelangelo predates the managed option — they were running 10,000 features and 10 trillion feature computations/day on self-hosted Cassandra because in 2017 there was nothing to buy. We have SageMaker Feature Store, which gives us the dual-store contract natively: you define a feature group once, and every PutRecord writes to both an online store (low-latency KV, backed by DynamoDB or an InMemory tier) and an offline store (Parquet in S3, partitioned by event time with built-in ingestion timestamps). The point-in-time join is a first-class operation on the offline store — that's the bug from Beat 02 solved by the platform, not by us.

Two pipelines feed it. Batch features — 30-day aggregates, embeddings — come from AWS Glue Spark jobs that read raw events from S3, compute, and PutRecord in bulk. Streaming features — "orders in the last 5 minutes" — come off Kinesis Data Streams (on-demand mode, so we don't hand-tune shard counts against a bursty event rate) into Managed Service for Apache Flink, which does the stateful windowed aggregation and writes to the online store within seconds. That's a lambda architecture: the same feature definition has a batch path for correctness and a streaming path for freshness.

One commitment worth stating: the streaming aggregation engine is Flink, not Lambda. Windowed aggregation ("orders in the last 5 minutes" over a sliding window) is inherently stateful, and Lambda has no durable cross-invocation state — it would have to externalize every window to a store and re-read it, and a slow synchronous PutRecord would block the Kinesis shard. Flink keeps window state in RocksDB with checkpointing, and micro-batches its writes so a popular feature group doesn't slam Feature Store's default 1,000 PutRecord/s per group soft limit. We'd keep Lambda only as a thin pre-filter or fan-out ahead of Flink, never as the aggregator.

Decision: ElastiCache Redis in front of the online store

SageMaker Feature Store's online GetRecord is ~10–15 ms — fine for one record, fatal when one ranking request fetches 6,000 features across many groups. We put ElastiCache for Redis in front as a read cache fronting the durable store. The write ordering matters: a stream/batch write goes to Feature Store first (the durable system of record), then asynchronously populates Redis. We deliberately do not double-write synchronously — a Feature Store success with a Redis failure would silently diverge, and there's no cheap atomic two-store write. A Redis miss is then just the normal path, not an exception: it falls back to GetRecord and back-fills the key.

One layout trap to avoid: do not hash-tag keys like {user_id}:feature to colocate all of an entity's features on one shard. At 100k pred/s that lands every entity's 6,000-feature read on a single shard and saturates it long before aggregate throughput. Use plain per-feature keys so an MGET is a scatter-gather across shards (2–3 round trips), and size for ~30–40 shards. If we'd rather not own shard math at all, ElastiCache Serverless auto-scales shards and is the managed-first default; we keep node-based clusters here only to pin cost and to get per-tenant ACL users on dedicated clusters. And the batch path writes deltas, not a full 30 TB replacement each cycle — a full rewrite would have bulk SETs competing with live MGETs and create a synchronized TTL cliff, the very stampede we design against.

Napkin math — online store sizing

Target Lyft's reported scale: 2,500 features, 30 TB across ~100 billion values, 100,000,000 GETs/minute at peak.

$$ \frac{100\times10^{6}\ \text{GET/min}}{60} \approx 1.67\times10^{6}\ \text{GET/second} $$

A Redis node sustains ~100k–200k ops/s, but each ranking request fans an MGET across many shards (no hash-tag colocation), so per-shard read pressure is higher than the naive $1.67\times10^{6} / 1.5\times10^{5} \approx 12$ suggests. We size for ~30–40 shards to keep per-shard headroom under scatter-gather load, then add replicas for the 30 TB working set. (ElastiCache Serverless removes this calculation entirely by auto-scaling shards; we keep it explicit here to reason about cost and tenant isolation.) SageMaker's InMemory online tier caps at 50 GB per group — far below 30 TB — which is the concrete reason the hot path lives in a sharded ElastiCache cluster, not the InMemory tier.

Inference serving and the latency budget

Principal

You promised 40 ms p99 end to end. Show me where the milliseconds go.

Staff

The request carries entity IDs (user, restaurant, session). A SageMaker real-time Endpoint hosts the model behind an auto-scaling fleet, deployed inside our VPC with VPC interface endpoints to Feature Store and Kinesis — no public-internet path, and no NAT Gateway data-processing charge on the feature traffic. Inside the request the budget is brutal, so feature retrieval and model compute have to share it.

One scaling subtlety: spinning up a new endpoint instance takes 3–10 minutes, so reactive scale-out can't protect a 40 ms p99 against a traffic burst. We run a minimum instance floor with real headroom, target-tracking on ConcurrentRequestsPerModel, and asymmetric cooldowns — scale out aggressively (~30 s), scale in slowly (~300 s) — so the fleet is already warm before the burst lands rather than chasing it.

Napkin math — the 40 ms budget

Lyft's published online SLA: p50 8 ms, p95 20 ms, p99 40 ms. Decompose it:

$$ T_{\text{total}} = T_{\text{net}} + T_{\text{features}} + T_{\text{model}} \le 40\ \text{ms} $$

Network in/out ~5 ms, model forward pass ~15 ms, which leaves ~20 ms for features. Fetching 6,000 features one-at-a-time at 1 ms each is 6 s — 150× over budget. Batched into a handful of Redis MGET pipelines, 6,000 features land in 2–3 round trips at ~1 ms each. That batching is the design; the rest is plumbing.

Staff

DoorDash hits 100,000+ predictions/second by precomputing every embedding offline and serving them from a Redis feature store behind a gRPC service — same shape. We keep SageMaker Endpoints because they give managed auto-scaling, multi-model endpoints (many models on one fleet to amortize GPU/CPU cost), and built-in Model Monitor for drift. Model Monitor compares the live feature distribution against a training baseline and alarms on skew — the automated tripwire for the 8% silent decay from Beat 01.

Failure, recovery, security, cost — and did we leave AWS?

Principal

Redis falls over at peak. What does inference do?

Staff

Fail soft, never fail blank — but bound the fail. Here's the trap in the naive version: ElastiCache failover takes 15 s to ~1 min, and "fall back to GetRecord" for 6,000 features is hundreds of ms to seconds per request. If we wait on that during a Redis AZ event, every ranking request brownouts for the whole failover. So the rule is a hard 5 ms per-call timeout on the Redis read: if it doesn't answer, we trip immediately to the default-feature path — a baked-in default snapshot shipped in the serving container (no network hop) — and tag the response degraded=true so downstream ranking discounts it. We don't synchronously wait on GetRecord inside a live AZ event; GetRecord fallback is for the steady-state single-key miss, not for a mass failover. An alarm fires on degraded-response rate crossing a threshold. A model serving zeros silently is worse than a model that knows it's flying blind.

For recovery: ElastiCache runs Multi-AZ with automatic failover; we set Kinesis retention to 7 days so the streaming replay window comfortably covers the time to rebuild a 30 TB online store, and the offline S3 store can fully rebuild the online store from scratch — that's the DR backstop. A bulk rebuild paces its writes (SQS-buffered, using the SageMaker ingest() path with retries) so it doesn't saturate the PutRecord limit. And because a silently-stalled Glue job is how the offline store goes stale without anyone noticing, we alarm on feature freshness — a CloudWatch metric on max partition event-time lag — and run Glue with job bookmarks so a retry reprocesses idempotently.

Principal

Multi-tenant. Several product teams share this platform. Compliance?

Staff

Isolation is per feature group. Each group has its own IAM resource policy and KMS key, so Team A's GetRecord can't touch Team B's group — the least-privilege control auditors want for SOC 2 and ISO 27001. On the Redis hot path a key prefix is just a convention, not a boundary, so we enforce it: ElastiCache Redis ACLs give each tenant a Redis user scoped to its key pattern, and high-sensitivity tenants get a dedicated cluster outright. Everything is encrypted at rest (KMS) and in transit (TLS).

For GDPR erasure the record identifier is the entity ID, so deletion maps to a targeted DeleteRecord in the online store plus tombstone-and-compact in the S3 offline store — but two copies are easy to miss: we also DEL the entity's Redis key, and we cap the Kinesis replay at the deletion timestamp so a cache rewarm can't resurrect erased data from the retention window. Lineage in the registry proves which models consumed that entity.

On audit: I'll be precise rather than wave at "CloudTrail logs everything." CloudTrail covers all management-plane operations, and — once data events are explicitly enabled — Feature Store PutRecord/GetRecord. It does not see the Redis data plane; that's covered by CloudWatch metrics plus application-level audit logs. We turn on CloudTrail log-file validation and put S3 Object Lock (WORM) on the trail bucket so the trail itself is tamper-evident for NIST. PII isn't protected by a hand-wave tag either: feature groups carry a pii=true resource tag, and an org-level SCP denies s3:PutObject and kinesis:PutRecord to any destination that lacks the matching tag condition, so tagged PII physically can't be written outside the sanctioned boundary.

Napkin math — monthly cost sketch

Online tier (ElastiCache). ~30–40 shards with replicas; take ~32× cache.r7g.xlarge at the correct on-demand rate of $0.437/h (not the $0.30 I'd half-remembered) $\Rightarrow 32 \times 0.437 \times 730 \approx \$10{,}200/\text{month}$, or ~$6{,}100 at 1-year reserved (~$0.263/h). Add cross-AZ Multi-AZ replication and read traffic at 1.67M reads/s: order $2{,}000–5{,}000/month — and note the VPC interface endpoints above remove NAT data-processing charges.

Feature Store API — the dominant variable. This is the line the first draft hid. At a 1% Redis miss rate, $1.67\times10^{6}\,\text{GET/s}$ implies ~1.44 billion GetRecord/month $\Rightarrow \approx \$21{,}500/\text{month}$, and streaming PutRecord compounds on top. Miss rate is the knob: at 0.1% it's ~$2{,}150. This single line can exceed every other line combined, which is why we drive miss rate down hard and treat it as the headline cost risk, not a footnote.

Kinesis. Anchored to the event rate: at 50k records/s on-demand ~$550/month; at 1M records/s ~$11{,}000/month. We assume ~50k–100k records/s here.

Offline store + lifecycle. 30 TB/month of Parquet at $0.023/GB is ~$700 for one month, but it accumulates — 360 TB after a year is ~$8{,}280/month — so an S3 lifecycle rule transitions partitions older than 90 days to Glacier Instant Retrieval and expires beyond the regulatory retention window. Model Monitor adds ~$200–500/month for the monitoring job instance plus data-capture S3.

Serving. SageMaker Endpoints: ~40× ml.c7g.2xlarge at ~$0.34/h $\Rightarrow \approx \$10{,}000/\text{month}$; multi-model endpoints packing many models onto one fleet is the biggest lever and easily halves this line. Total: the conservative baseline lands ~$30k–45k/month, and Feature Store API call volume is the single largest variable — it sets whether the bill is closer to the floor or well above it.

Principal

Did you ever leave AWS?

Staff

No. Feature Store, ElastiCache, S3, Kinesis, Managed Service for Apache Flink, Glue, SageMaker Endpoints and Model Monitor cover every box — and the registry isn't an unnamed mystery box either: it's SageMaker Feature Store's built-in feature metadata and lineage, with the Glue Data Catalog tracking offline-store schema. Flink, the one component people reach for self-hosted, is consumed here as a fully managed AWS service. Michelangelo and Lyft's Dryft built bespoke because the managed services didn't exist when they did. They do now, so the honest answer is: this entire platform is AWS-managed.

↓ podcast script (.txt)