Skip to main content

Feature stores

Definition

A feature store is a data system specifically designed to manage the lifecycle of ML features — from raw data transformation, through storage, to low-latency serving — in a way that is consistent between model training and production inference. Without a feature store, teams commonly encounter training-serving skew: the feature computation logic executed offline during training differs subtly from the logic used at serving time, causing production models to underperform relative to offline evaluation.

Feature stores address this by storing feature definitions as code and running the same transformation logic in both contexts. They maintain two complementary storage layers: an offline store (a data warehouse or data lake, e.g. BigQuery, Redshift, Parquet files on S3) that holds large historical datasets used for training and batch scoring, and an online store (a low-latency key-value database, e.g. Redis, DynamoDB, Cassandra) that serves precomputed feature values to models at inference time with sub-millisecond latency.

The problem of training-serving skew and the need for feature reuse become acute at scale. A large organization may have dozens of teams each computing similar features (customer spend in the last 7 days, session length, device type) independently, with subtle differences in business logic. A feature store provides a governed catalog where features are defined once, validated, and reused across teams and models, dramatically reducing duplicated engineering effort and the risk of inconsistent feature logic.

How it works

Feature definition and transformation pipelines

Features are defined as code — Python classes or YAML manifests — that specify the data source, transformation logic, and the entity key (the identifier used to look up features, e.g., user_id, product_id). Batch transformation pipelines run on a schedule to materialize features into the offline store. Stream transformation pipelines (e.g., using Flink or Spark Structured Streaming) keep the online store fresh for time-sensitive features such as real-time fraud signals.

Offline store: training data retrieval

When training a model, you generate a dataset by providing a list of entity keys and a set of timestamps (a "point-in-time join"). The feature store retrieves the feature values that were correct as of each timestamp, avoiding future data leakage. This point-in-time correctness is one of the hardest things to implement correctly without a feature store and one of the most valuable guarantees it provides.

Online store: low-latency serving

Before a model serves a prediction, it needs feature values for the entity being scored (e.g., the user making a request). The feature store client queries the online store by entity key and returns a feature vector in milliseconds. Because the same feature definitions underpin both the offline and online stores, the values are guaranteed to be computed identically.

Feature registry and governance

A feature catalog documents each feature: its definition, owner, data type, freshness guarantee, and which models consume it. This governance layer enables discoverability — a new team can browse existing features before writing their own — and impact analysis — understanding which models are affected if a feature's upstream data source changes.

Materialization jobs

Materialization is the process of running transformation pipelines and writing results to the stores. Offline materialization runs as a scheduled batch job. Online materialization copies a subset of offline data into the online store for fast retrieval, or is driven by streaming pipelines when real-time freshness is required. Feast, Tecton, and Hopsworks all provide CLI commands or orchestration integrations to trigger and monitor materialization.

When to use / When NOT to use

Use whenAvoid when
Multiple teams or models share the same feature logic and consistency is criticalYou have a single model with a small, stable feature set that never changes
Training-serving skew has caused production incidents or accuracy gapsYour inference latency requirements are relaxed and batch scoring is sufficient
You need point-in-time correct training datasets to avoid data leakageThe engineering overhead of operating a feature store exceeds the project's scale
Features must be served at sub-millisecond latency for real-time predictionsYou are in early exploration and your features are not yet stable enough to formalize
Regulatory requirements demand a governed, auditable feature catalogYour data science team is small and lacks ML engineering support to manage infrastructure

Comparisons

CriterionFeastTectonHopsworks
Open-sourceYes (Apache 2.0)No (SaaS / managed)Yes core; enterprise paid
Managed offeringNo (self-hosted only)Yes (fully managed)Yes (cloud or on-prem)
Streaming featuresLimited (via Kafka source)Native, production-gradeNative with Flink integration
Feature monitoringBasicAdvanced (built-in drift)Advanced
Best forTeams wanting OSS controlEnterprises needing managed real-time featuresTeams wanting full-stack open-source

Pros and cons

ProsCons
Eliminates training-serving skew by sharing transformation logicSignificant engineering investment to set up and operate
Enables feature reuse across teams, reducing duplicated effortAdds an operational dependency to the serving path (online store availability)
Point-in-time joins prevent data leakage in training dataFeature definitions can become a bottleneck if governance is too rigid
Centralizes feature governance and documentationLearning curve for data scientists unfamiliar with the abstraction
Supports both batch and real-time feature servingOverkill for teams with a small number of models and stable features

Code examples

# feast_feature_store_example.py
# Demonstrates defining, materializing, and retrieving features with Feast.
# Prerequisites:
# pip install feast pandas scikit-learn
# feast init my_feature_repo && cd my_feature_repo
# (Adjust the data source path below to match your environment.)

# ── feature_repo/features.py ──────────────────────────────────────────────────
# This file defines the feature views and entities in your Feast registry.

from datetime import timedelta
import pandas as pd
from feast import (
Entity,
FeatureStore,
FeatureView,
Field,
FileSource,
)
from feast.types import Float32, Int64

# 1. Define the entity — the primary key used to look up features
driver = Entity(
name="driver",
description="A taxi driver identified by driver_id",
)

# 2. Define the data source (parquet file for local demo; swap for BigQuery etc.)
driver_stats_source = FileSource(
path="data/driver_stats.parquet", # generated below
timestamp_field="event_timestamp",
created_timestamp_column="created",
)

# 3. Define a FeatureView — the transformation and storage spec
driver_stats_fv = FeatureView(
name="driver_hourly_stats",
entities=[driver],
ttl=timedelta(days=7), # how long features stay valid
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
online=True, # materialize to online store
source=driver_stats_source,
)


# ── generate_sample_data.py ───────────────────────────────────────────────────
# Run this once to create sample data before materializing.
def generate_driver_stats(path: str = "data/driver_stats.parquet") -> None:
import os
os.makedirs("data", exist_ok=True)

rng = pd.date_range(end=pd.Timestamp.now(tz="UTC"), periods=48, freq="h")
df = pd.DataFrame({
"driver_id": [1001, 1002, 1003] * 16,
"event_timestamp": list(rng[:48]),
"created": pd.Timestamp.now(tz="UTC"),
"conv_rate": [0.8, 0.6, 0.9] * 16,
"acc_rate": [0.95, 0.88, 0.92] * 16,
"avg_daily_trips": [150, 200, 175] * 16,
})
df.to_parquet(path, index=False)
print(f"Sample data written to {path}")


# ── training_data_retrieval.py ────────────────────────────────────────────────
# Retrieve a point-in-time correct training dataset.
def get_training_data(repo_path: str = ".") -> pd.DataFrame:
store = FeatureStore(repo_path=repo_path)

# Entity DataFrame: the entities and timestamps we want features for
entity_df = pd.DataFrame({
"driver_id": [1001, 1002, 1003],
"event_timestamp": [
pd.Timestamp("2024-01-15 10:00:00", tz="UTC"),
pd.Timestamp("2024-01-15 11:00:00", tz="UTC"),
pd.Timestamp("2024-01-15 12:00:00", tz="UTC"),
],
"label": [1, 0, 1], # target variable for supervised training
})

# Point-in-time join: retrieves feature values as-of each row's timestamp
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"driver_hourly_stats:conv_rate",
"driver_hourly_stats:acc_rate",
"driver_hourly_stats:avg_daily_trips",
],
).to_df()

print("Training dataset:")
print(training_df.to_string())
return training_df


# ── online_serving.py ─────────────────────────────────────────────────────────
# Retrieve features for real-time inference after materialization.
def get_online_features(driver_ids: list, repo_path: str = ".") -> dict:
store = FeatureStore(repo_path=repo_path)

# Materialize features to the online store first:
# store.materialize_incremental(end_date=pd.Timestamp.now(tz="UTC"))

feature_vector = store.get_online_features(
features=[
"driver_hourly_stats:conv_rate",
"driver_hourly_stats:acc_rate",
"driver_hourly_stats:avg_daily_trips",
],
entity_rows=[{"driver_id": did} for did in driver_ids],
).to_dict()

print("Online feature vector:")
for key, values in feature_vector.items():
print(f" {key}: {values}")
return feature_vector


# ── main ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
generate_driver_stats()
# After running `feast apply` to register the feature views:
# training_df = get_training_data()
# online_fv = get_online_features([1001, 1002])
print("Feature definitions ready. Run `feast apply` to register them.")

Practical resources

See also