Feature stores

Definition

A feature store is a data system specifically designed to manage the lifecycle of ML features — from raw data transformation, through storage, to low-latency serving — in a way that is consistent between model training and production inference. Without a feature store, teams commonly encounter training-serving skew: the feature computation logic executed offline during training differs subtly from the logic used at serving time, causing production models to underperform relative to offline evaluation.

Feature stores address this by storing feature definitions as code and running the same transformation logic in both contexts. They maintain two complementary storage layers: an offline store (a data warehouse or data lake, e.g. BigQuery, Redshift, Parquet files on S3) that holds large historical datasets used for training and batch scoring, and an online store (a low-latency key-value database, e.g. Redis, DynamoDB, Cassandra) that serves precomputed feature values to models at inference time with sub-millisecond latency.

The problem of training-serving skew and the need for feature reuse become acute at scale. A large organization may have dozens of teams each computing similar features (customer spend in the last 7 days, session length, device type) independently, with subtle differences in business logic. A feature store provides a governed catalog where features are defined once, validated, and reused across teams and models, dramatically reducing duplicated engineering effort and the risk of inconsistent feature logic.

How it works

Feature definition and transformation pipelines

Features are defined as code — Python classes or YAML manifests — that specify the data source, transformation logic, and the entity key (the identifier used to look up features, e.g., user_id, product_id). Batch transformation pipelines run on a schedule to materialize features into the offline store. Stream transformation pipelines (e.g., using Flink or Spark Structured Streaming) keep the online store fresh for time-sensitive features such as real-time fraud signals.

Offline store: training data retrieval

When training a model, you generate a dataset by providing a list of entity keys and a set of timestamps (a "point-in-time join"). The feature store retrieves the feature values that were correct as of each timestamp, avoiding future data leakage. This point-in-time correctness is one of the hardest things to implement correctly without a feature store and one of the most valuable guarantees it provides.

Online store: low-latency serving

Before a model serves a prediction, it needs feature values for the entity being scored (e.g., the user making a request). The feature store client queries the online store by entity key and returns a feature vector in milliseconds. Because the same feature definitions underpin both the offline and online stores, the values are guaranteed to be computed identically.

Feature registry and governance

A feature catalog documents each feature: its definition, owner, data type, freshness guarantee, and which models consume it. This governance layer enables discoverability — a new team can browse existing features before writing their own — and impact analysis — understanding which models are affected if a feature's upstream data source changes.

Materialization jobs

Materialization is the process of running transformation pipelines and writing results to the stores. Offline materialization runs as a scheduled batch job. Online materialization copies a subset of offline data into the online store for fast retrieval, or is driven by streaming pipelines when real-time freshness is required. Feast, Tecton, and Hopsworks all provide CLI commands or orchestration integrations to trigger and monitor materialization.

When to use / When NOT to use

Use when	Avoid when
Multiple teams or models share the same feature logic and consistency is critical	You have a single model with a small, stable feature set that never changes
Training-serving skew has caused production incidents or accuracy gaps	Your inference latency requirements are relaxed and batch scoring is sufficient
You need point-in-time correct training datasets to avoid data leakage	The engineering overhead of operating a feature store exceeds the project's scale
Features must be served at sub-millisecond latency for real-time predictions	You are in early exploration and your features are not yet stable enough to formalize
Regulatory requirements demand a governed, auditable feature catalog	Your data science team is small and lacks ML engineering support to manage infrastructure

Comparisons

Criterion	Feast	Tecton	Hopsworks
Open-source	Yes (Apache 2.0)	No (SaaS / managed)	Yes core; enterprise paid
Managed offering	No (self-hosted only)	Yes (fully managed)	Yes (cloud or on-prem)
Streaming features	Limited (via Kafka source)	Native, production-grade	Native with Flink integration
Feature monitoring	Basic	Advanced (built-in drift)	Advanced
Best for	Teams wanting OSS control	Enterprises needing managed real-time features	Teams wanting full-stack open-source

Pros and cons

Pros	Cons
Eliminates training-serving skew by sharing transformation logic	Significant engineering investment to set up and operate
Enables feature reuse across teams, reducing duplicated effort	Adds an operational dependency to the serving path (online store availability)
Point-in-time joins prevent data leakage in training data	Feature definitions can become a bottleneck if governance is too rigid
Centralizes feature governance and documentation	Learning curve for data scientists unfamiliar with the abstraction
Supports both batch and real-time feature serving	Overkill for teams with a small number of models and stable features

Code examples

# feast_feature_store_example.py
# Demonstrates defining, materializing, and retrieving features with Feast.
# Prerequisites:
#   pip install feast pandas scikit-learn
#   feast init my_feature_repo && cd my_feature_repo
#   (Adjust the data source path below to match your environment.)

# ── feature_repo/features.py ──────────────────────────────────────────────────
# This file defines the feature views and entities in your Feast registry.

from datetime import timedelta
import pandas as pd
from feast import (
    Entity,
    FeatureStore,
    FeatureView,
    Field,
    FileSource,
)
from feast.types import Float32, Int64

# 1. Define the entity — the primary key used to look up features
driver = Entity(
    name="driver",
    description="A taxi driver identified by driver_id",
)

# 2. Define the data source (parquet file for local demo; swap for BigQuery etc.)
driver_stats_source = FileSource(
    path="data/driver_stats.parquet",   # generated below
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# 3. Define a FeatureView — the transformation and storage spec
driver_stats_fv = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    ttl=timedelta(days=7),              # how long features stay valid
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,                        # materialize to online store
    source=driver_stats_source,
)


# ── generate_sample_data.py ───────────────────────────────────────────────────
# Run this once to create sample data before materializing.
def generate_driver_stats(path: str = "data/driver_stats.parquet") -> None:
    import os
    os.makedirs("data", exist_ok=True)

    rng = pd.date_range(end=pd.Timestamp.now(tz="UTC"), periods=48, freq="h")
    df = pd.DataFrame({
        "driver_id": [1001, 1002, 1003] * 16,
        "event_timestamp": list(rng[:48]),
        "created": pd.Timestamp.now(tz="UTC"),
        "conv_rate": [0.8, 0.6, 0.9] * 16,
        "acc_rate": [0.95, 0.88, 0.92] * 16,
        "avg_daily_trips": [150, 200, 175] * 16,
    })
    df.to_parquet(path, index=False)
    print(f"Sample data written to {path}")


# ── training_data_retrieval.py ────────────────────────────────────────────────
# Retrieve a point-in-time correct training dataset.
def get_training_data(repo_path: str = ".") -> pd.DataFrame:
    store = FeatureStore(repo_path=repo_path)

    # Entity DataFrame: the entities and timestamps we want features for
    entity_df = pd.DataFrame({
        "driver_id": [1001, 1002, 1003],
        "event_timestamp": [
            pd.Timestamp("2024-01-15 10:00:00", tz="UTC"),
            pd.Timestamp("2024-01-15 11:00:00", tz="UTC"),
            pd.Timestamp("2024-01-15 12:00:00", tz="UTC"),
        ],
        "label": [1, 0, 1],   # target variable for supervised training
    })

    # Point-in-time join: retrieves feature values as-of each row's timestamp
    training_df = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "driver_hourly_stats:conv_rate",
            "driver_hourly_stats:acc_rate",
            "driver_hourly_stats:avg_daily_trips",
        ],
    ).to_df()

    print("Training dataset:")
    print(training_df.to_string())
    return training_df


# ── online_serving.py ─────────────────────────────────────────────────────────
# Retrieve features for real-time inference after materialization.
def get_online_features(driver_ids: list, repo_path: str = ".") -> dict:
    store = FeatureStore(repo_path=repo_path)

    # Materialize features to the online store first:
    # store.materialize_incremental(end_date=pd.Timestamp.now(tz="UTC"))

    feature_vector = store.get_online_features(
        features=[
            "driver_hourly_stats:conv_rate",
            "driver_hourly_stats:acc_rate",
            "driver_hourly_stats:avg_daily_trips",
        ],
        entity_rows=[{"driver_id": did} for did in driver_ids],
    ).to_dict()

    print("Online feature vector:")
    for key, values in feature_vector.items():
        print(f"  {key}: {values}")
    return feature_vector


# ── main ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    generate_driver_stats()
    # After running `feast apply` to register the feature views:
    # training_df = get_training_data()
    # online_fv   = get_online_features([1001, 1002])
    print("Feature definitions ready. Run `feast apply` to register them.")

Practical resources

Feast Documentation — Official docs for the most widely used open-source feature store, including quickstart, feature view API, and deployment guides.
Tecton – Feature Store Concepts — Vendor-neutral conceptual overview of online/offline stores, point-in-time joins, and feature pipelines from the team that built Uber's Michelangelo.
Hopsworks Documentation — Full-stack feature store with native Flink streaming, feature monitoring, and a model registry.
Feature Store for ML – O'Reilly — Community resource aggregating research, blog posts, and talks on feature store design patterns.
Chip Huyen – Feature Engineering for ML Systems — Deep dive into the engineering challenges of real-time feature computation and how feature stores solve them.

Definition​

How it works​

Feature definition and transformation pipelines​

Offline store: training data retrieval​

Online store: low-latency serving​

Feature registry and governance​

Materialization jobs​

When to use / When NOT to use​

Comparisons​

Pros and cons​

Code examples​

Practical resources​

See also​