Data pipelines

Definition

A data pipeline is an automated sequence of steps that moves raw data from one or more sources to a destination where it can be consumed — by analysts, dashboards, or machine learning models. In the ML context, pipelines are not just about moving data: they ensure that data arrives in the right shape, at the right time, and with verifiable quality so that models train and serve predictably. Without reliable pipelines, every downstream artifact — features, trained models, predictions — is suspect.

Data pipelines sit at the foundation of every MLOps system. They encompass ingestion from heterogeneous sources (databases, APIs, event streams, files), transformation to produce clean and structured datasets or feature vectors, storage in data warehouses or feature stores, and serving to training jobs or online inference endpoints. The design choices made at the pipeline layer — batch vs. streaming, push vs. pull, schema-on-read vs. schema-on-write — propagate all the way to model latency, freshness, and reliability.

Data quality is the hidden contract between data engineers and model teams. Schema drift, null explosions, distribution shift, and duplicate records are among the most common causes of silent model degradation. Modern pipelines embed validation checkpoints (using tools like Great Expectations or dbt tests) to catch these issues before bad data reaches training or serving.

How it works

Batch vs. streaming

Batch pipelines process data in bounded chunks on a schedule — hourly, daily, or triggered by file arrival. They are simpler to build and reason about and are the right default when the downstream consumer (a nightly training job, a BI dashboard) does not require sub-minute freshness. Streaming pipelines process records as they arrive, enabling near-real-time features for online models. The trade-off is operational complexity: you must handle late arrivals, out-of-order events, and exactly-once semantics. Most mature ML platforms run both: batch for large-scale retraining and offline evaluation, streaming for online feature computation.

ETL vs. ELT

Extract-Transform-Load (ETL) applies transformations before data lands in the destination store. This was the dominant pattern when storage was expensive and warehouses lacked compute. Extract-Load-Transform (ELT) loads raw data first, then transforms it inside a powerful warehouse or lakehouse (e.g. BigQuery, Snowflake, Databricks). ELT preserves raw history and enables ad-hoc exploration without re-ingestion — a major advantage in ML workloads where feature engineering evolves constantly. The choice is mostly driven by tooling, governance requirements, and whether the destination system can handle the transformation compute efficiently.

Data quality and schema validation

Data quality checks should be embedded at every stage of the pipeline, not bolted on at the end. At ingestion, checks verify that source data conforms to the expected schema (column names, types, nullable constraints). At transformation, row-level checks assert business rules (non-negative prices, valid date ranges, referential integrity). At the serving layer, statistical checks detect distribution drift — the silent killer of deployed models. Schema validation can be done with tools like Pandera, Great Expectations, or dbt tests; distribution monitoring is typically handled by dedicated observability layers.

When to use / When NOT to use

Use when	Avoid when
Multiple data sources need to be consolidated for ML training	Data already lives in a single, clean table ready for direct use
Data must be refreshed on a schedule or in real time	Your analysis is a one-off exploration that will not be repeated
Quality guarantees (schema, completeness, freshness) are required by downstream models	The overhead of a full pipeline exceeds the value for a quick prototype
Transformations need to be versioned, tested, and reproducible	Data volume is trivial and a simple script in a notebook suffices
Multiple consumers (training, dashboards, APIs) share the same processed data	The source system already provides a clean, contracted API

Comparisons

Criterion	Batch pipeline	Streaming pipeline
Data freshness	Minutes to hours (schedule-driven)	Sub-second to seconds
Complexity	Low — bounded datasets, simple retries	High — late data, windowing, state
Cost	Predictable, bursty compute	Continuous compute, often higher baseline
Fault tolerance	Re-run the failed batch	Exactly-once or at-least-once semantics required
Typical ML use case	Offline training, nightly feature refresh	Online feature store, real-time scoring

Pros and cons

Pros	Cons
Centralizes and standardizes data access across teams	Non-trivial initial investment to build and maintain
Enables reproducible, tested data transformations	Pipeline failures propagate to all downstream consumers
Embeds quality checks before bad data reaches models	Debugging distributed pipelines is complex
Supports versioning and lineage tracking	Streaming adds significant operational overhead
Decouples producers from consumers	Requires data governance and ownership discipline

Code examples

"""
Simple batch data pipeline with pandas.
Reads raw CSV data, validates schema, applies transformations,
and writes a clean Parquet file ready for model training.
"""

import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check
from pathlib import Path


# --- Schema definition (contract between pipeline and consumers) ---
raw_schema = DataFrameSchema(
    {
        "user_id": Column(int, nullable=False),
        "event_ts": Column(str, nullable=False),
        "amount": Column(float, Check(lambda x: x >= 0), nullable=False),
        "category": Column(str, nullable=True),
    }
)

output_schema = DataFrameSchema(
    {
        "user_id": Column(int),
        "event_date": Column("datetime64[ns]"),
        "amount": Column(float),
        "category": Column(str),
        "log_amount": Column(float),
    }
)


def extract(source_path: str) -> pd.DataFrame:
    """Load raw data from CSV."""
    df = pd.read_csv(source_path)
    print(f"[extract] loaded {len(df):,} rows from {source_path}")
    return df


def validate(df: pd.DataFrame, schema: DataFrameSchema) -> pd.DataFrame:
    """Fail fast if data does not match the declared schema."""
    validated = schema.validate(df)
    print(f"[validate] schema check passed for {len(validated):,} rows")
    return validated


def transform(df: pd.DataFrame) -> pd.DataFrame:
    """Apply cleaning and feature engineering."""
    df = df.copy()

    # Parse timestamp column
    df["event_date"] = pd.to_datetime(df["event_ts"])
    df.drop(columns=["event_ts"], inplace=True)

    # Fill missing categories with a sentinel value
    df["category"] = df["category"].fillna("unknown")

    # Feature engineering: log-transform amount (handles skew)
    import numpy as np
    df["log_amount"] = np.log1p(df["amount"])

    # Drop duplicates based on user_id + date
    df.drop_duplicates(subset=["user_id", "event_date"], inplace=True)

    print(f"[transform] produced {len(df):,} clean rows")
    return df


def load(df: pd.DataFrame, dest_path: str) -> None:
    """Write clean data to Parquet for efficient downstream reads."""
    Path(dest_path).parent.mkdir(parents=True, exist_ok=True)
    df.to_parquet(dest_path, index=False)
    print(f"[load] wrote {len(df):,} rows to {dest_path}")


def run_pipeline(source: str, destination: str) -> None:
    """Orchestrate the full ETL pipeline."""
    raw = extract(source)
    validated_raw = validate(raw, raw_schema)
    clean = transform(validated_raw)
    validated_clean = validate(clean, output_schema)
    load(validated_clean, destination)
    print("[pipeline] completed successfully")


if __name__ == "__main__":
    run_pipeline(
        source="data/raw/events.csv",
        destination="data/processed/events.parquet",
    )

Practical resources

The Data Engineering Cookbook (Andreas Kretz) — Comprehensive open-source guide covering ingestion, storage, and processing patterns
dbt documentation — The standard for ELT transformations in SQL with built-in testing and lineage
Great Expectations — Data quality and validation framework that integrates with most pipeline tools
Pandera — Lightweight schema validation for pandas and Spark DataFrames in Python
Fundamentals of Data Engineering (O'Reilly) — Book covering the full data engineering lifecycle from ingestion to serving

Definition​

How it works​

Batch vs. streaming​

ETL vs. ELT​

Data quality and schema validation​

When to use / When NOT to use​

Comparisons​

Pros and cons​

Code examples​

Practical resources​

See also​