Skip to main content

CI/CD for ML

Definition

Continuous Integration and Continuous Delivery (CI/CD) is a software engineering practice that automates building, testing, and deploying code on every change. When applied to machine learning, the scope expands beyond code: data quality, model performance, and artifact versioning all become first-class citizens of the pipeline. A broken ML CI/CD pipeline can ship a model that silently degrades in production without a single line of application code changing.

Traditional CI/CD validates logic and API contracts. ML CI/CD must additionally validate statistical properties of data (schema, distributions, missing-value rates), model quality thresholds (accuracy, latency, fairness), and reproducibility — the ability to re-train the exact same model from the exact same inputs. Tools like DVC for data versioning and CML (Continuous Machine Learning) for reporting metrics inside pull requests make this practical.

The end goal is a fully automated path from a code or data change to a safely deployed model, with human gates only where they genuinely add value — such as reviewing a model card before a production promotion.

How it works

Data Validation

Before training begins, the pipeline checks that incoming data matches the expected schema and statistical profile. Great Expectations or TensorFlow Data Validation (TFDV) can assert that column types are correct, value ranges are sensible, and there are no unexpected spikes in missing values. Failing this gate early prevents wasted compute on corrupt batches. Any schema drift is surfaced as a failed check in the pull request, which blocks the merge until the issue is understood and either fixed or explicitly accepted. This step is the ML equivalent of type-checking code before running tests.

Model Training

Training is executed as a reproducible, parameterized job — ideally containerized so the exact environment (CUDA version, library pinning) is captured. A good CI/CD system passes hyperparameters through configuration files tracked in version control, not hard-coded into scripts. Tools like DVC track which dataset version and which config produced which model artifact, so any trained model can be traced back to its inputs. Training runs are recorded in an experiment tracker (MLflow, W&B) so the comparison to the previous champion model is automatic.

Model Evaluation

After training, automated evaluation scripts compute the target metrics on a held-out test set and compare them against a defined threshold or against the current production model. CML (from Iterative.ai) can post a Markdown report with metrics tables and plots directly on the GitHub or GitLab pull request, so reviewers see performance regressions without leaving their code review workflow. Evaluation should also cover slice-based fairness metrics for regulated domains. The quality gate passes only if the new model meets or exceeds the thresholds.

Deployment and Monitoring

On passing the quality gate, the model artifact is registered in a model registry and deployed to a staging environment where smoke tests run against live (or representative) traffic. Promotion to production can be manual (a click in the registry UI) or fully automated. Once in production, a monitoring layer tracks data drift, prediction drift, and business KPIs, and can trigger a re-training run — completing the feedback loop back to the Commit step.

When to use / When NOT to use

Use whenAvoid when
Multiple data scientists commit to shared model codeWorking solo on a one-off notebook experiment
Models are retrained regularly on fresh dataThe model is static and trained once, never updated
Production failures are costly (fraud, health, safety)Prototype stage where speed of iteration outweighs correctness
Team needs reproducibility and audit trailsInfrastructure / DevOps maturity is very low
Regulatory compliance requires documented model versioningDataset is tiny and fits in a single notebook end-to-end

Comparisons

CriterionTraditional CI/CDML CI/CD
Primary artifactBinary / Docker imageModel artifact + data version
Test typesUnit, integration, E2EUnit + data quality + model quality + fairness
TriggerCode pushCode push OR new data OR scheduled retraining
RollbackRedeploy previous imageRedeploy previous model version from registry
ObservabilityApplication logs, tracesData drift, prediction drift, business metrics

Pros and cons

ProsCons
Catches regressions before they reach productionHigher setup cost than traditional CI/CD
Full audit trail of data + code + model versionsData validation requires domain expertise to define correctly
Enables safe, frequent model updatesTraining jobs can be slow, making CI feedback loops longer
Reduces manual handoffs between data science and opsRequires alignment between data, ML, and platform teams
Metrics in PRs improve code review qualityMisconfigured thresholds can block valid improvements

Code examples

# .github/workflows/ml-pipeline.yml
# GitHub Actions workflow for a full ML CI/CD pipeline with CML reporting

name: ML Pipeline

on:
push:
branches: [main, "feat/**"]
pull_request:
branches: [main]

jobs:
ml-pipeline:
runs-on: ubuntu-latest

steps:
# 1. Check out the repository with full git history (needed for DVC)
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0

# 2. Set up Python and install dependencies
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install dependencies
run: pip install -r requirements.txt

# 3. Pull data and model artifacts from DVC remote
- name: Pull DVC artifacts
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull

# 4. Validate data quality before training
- name: Validate data
run: python src/validate_data.py --data data/train.csv

# 5. Train the model and save metrics to metrics.json
- name: Train model
run: python src/train.py --config configs/train.yaml

# 6. Evaluate model and write report for CML
- name: Evaluate model
run: python src/evaluate.py --output reports/metrics.md

# 7. Post CML report as a comment on the pull request
- name: Post CML report
uses: iterative/setup-cml@v2
with:
version: latest

- name: Publish CML report
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Append the confusion matrix image to the report
echo "## Model evaluation report" >> reports/metrics.md
cml comment create reports/metrics.md

# 8. Push updated DVC artifacts (only on main)
- name: Push DVC artifacts
if: github.ref == 'refs/heads/main'
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc push
# src/validate_data.py
# Simple data validation gate using pandas — replace with Great Expectations for production

import argparse
import sys
import pandas as pd

EXPECTED_COLUMNS = {"feature_a", "feature_b", "label"}
MAX_MISSING_RATE = 0.05 # 5% threshold


def validate(path: str) -> None:
df = pd.read_csv(path)

# Check that all required columns are present
missing_cols = EXPECTED_COLUMNS - set(df.columns)
if missing_cols:
print(f"FAIL: Missing columns: {missing_cols}")
sys.exit(1)

# Check missing-value rates
for col in EXPECTED_COLUMNS:
rate = df[col].isna().mean()
if rate > MAX_MISSING_RATE:
print(f"FAIL: Column '{col}' has {rate:.1%} missing values (threshold: {MAX_MISSING_RATE:.0%})")
sys.exit(1)

print("Data validation passed.")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--data", required=True)
args = parser.parse_args()
validate(args.data)

Practical resources

See also