Skip to main content

Experiment tracking

Definition

Experiment tracking is the practice of systematically recording every detail of an ML training run so that results can be reproduced, compared, and audited. Without it, teams lose track of which hyperparameters produced which results, waste compute rediscovering configurations, and cannot demonstrate compliance when models influence high-stakes decisions.

A complete experiment record captures four categories of information. Parameters are the inputs to training: learning rate, batch size, model architecture choices, feature sets. Metrics are the outputs: loss curves, accuracy, F1, AUC, latency. Artifacts are the produced files: trained model weights, preprocessed datasets, evaluation plots, confusion matrices. Metadata is the context: code version (git commit), environment (library versions, hardware), dataset version, wall-clock time, and the name of the person who ran it.

Model versioning is the natural extension: once you track experiments, you can promote the best run's artifact to a model registry, tag it with a semantic version, and tie every serving deployment back to a specific experiment. This closes the loop between experimentation and production, making rollbacks straightforward and audits possible.

How it works

Instrumentation

The training script is instrumented with a few lines of SDK code that open a "run" context and log data to a central server during training. Most frameworks (PyTorch Lightning, Hugging Face Trainer, Keras) have native integrations that auto-log common metrics with zero additional code.

Centralized storage

Logged data is persisted to a backend store — a local file system, a managed cloud database, or a SaaS platform. Parameters and metrics are stored as structured records; artifacts are pushed to object storage (S3, GCS, Azure Blob). The backend is queried by the UI and the SDK.

Comparison and analysis

The tracking UI lets you filter, sort, and compare runs across all four dimensions. You can plot metric curves for many runs on the same chart, group by parameter values, and export results to a dataframe for custom analysis. This makes it easy to identify the Pareto-optimal runs (best accuracy for given latency budget, for example).

Model promotion

The best run's artifact is registered in a model registry with a version number and transition state (Staging → Production → Archived). Downstream CI/CD systems query the registry to know which model version to deploy, creating a clean handoff between experimentation and serving.

When to use / When NOT to use

Use whenAvoid when
You run more than a handful of experiments and need to compare resultsYou are running a single, one-shot training and will never revisit it
Reproducibility is required (regulated industry, research publication)The experiment is trivial (e.g., a two-parameter grid search with obvious outcomes)
Multiple team members share experiment resultsThe team works alone and keeps notes in a personal spreadsheet that is sufficient
You want to promote model versions to production systematicallyThe model is never deployed and results do not need to be audited

Comparisons

CriterionMLflowWeights & Biases (W&B)
Ease of setupSelf-hostable with mlflow ui; pip install onlySaaS account required; CLI install; free tier available
UI qualityFunctional but spartan; good for tabular comparisonPolished, interactive; excellent for media and curve overlays
CollaborationShared server required; no built-in access control in OSSTeam workspaces, role-based access, and sharing built in
PricingFree and open-source; managed offering via DatabricksFree tier for individuals; paid for large teams
IntegrationsDeep integration with Databricks, Spark, sklearn, PyTorchBroad integrations; strong in research and academia

Code examples

# generic_tracking.py
# Framework-agnostic experiment tracking pattern.
# Works with any ML library; swap out the model training code as needed.
# pip install mlflow scikit-learn pandas

import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# --- Configuration ---
EXPERIMENT_NAME = "binary-classification-demo"
PARAMS = {
"C": 0.1, # Regularization strength
"max_iter": 1000,
"solver": "lbfgs",
"random_state": 42,
}

# --- Data preparation ---
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=10, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# --- Tracking boilerplate (works with MLflow, swap with wandb.init() for W&B) ---
mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run(run_name=f"logreg-C{PARAMS['C']}") as run:
# 1. Log all hyperparameters at the start
mlflow.log_params(PARAMS)

# 2. Train the model
model = LogisticRegression(**PARAMS)
model.fit(X_train, y_train)

# 3. Evaluate and log metrics
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"roc_auc": roc_auc_score(y_test, y_prob),
"n_train": len(X_train),
"n_test": len(X_test),
}
mlflow.log_metrics(metrics)

# 4. Log the model artifact
mlflow.sklearn.log_model(model, artifact_path="model")

# 5. Log any extra files (e.g., feature importance, plots)
import json, tempfile, os
with tempfile.TemporaryDirectory() as tmp:
meta_path = os.path.join(tmp, "run_metadata.json")
with open(meta_path, "w") as f:
json.dump({"git_commit": "abc1234", "dataset_version": "v1.3"}, f)
mlflow.log_artifact(meta_path)

print(f"Run ID : {run.info.run_id}")
print(f"Accuracy: {metrics['accuracy']:.4f} | ROC-AUC: {metrics['roc_auc']:.4f}")

Practical resources

See also