Weights & Biases (W&B)

Definition

Weights & Biases (commonly abbreviated W&B or wandb) is a cloud-native MLOps platform that provides experiment tracking, dataset and model versioning, hyperparameter optimization, and interactive reporting in a single integrated product. Founded in 2017 and adopted widely in both academic research and industry, W&B is particularly popular among teams training deep learning models that produce rich media outputs — images, audio, video, point clouds — that benefit from visual inspection during training.

W&B's core value proposition is that it requires almost no infrastructure to get started: you sign up for a free account, install the wandb Python package, add wandb.init() to your script, and everything is logged automatically to W&B's cloud. The platform is organized into projects (collections of related runs), runs (individual training executions), artifacts (versioned datasets and model files), sweeps (automated hyperparameter search), and reports (shareable narrative documents embedding live charts).

Unlike self-hosted solutions such as MLflow, W&B manages all backend infrastructure. This eliminates operational burden but means data leaves your premises — a relevant consideration for regulated industries. W&B offers private cloud and on-premise deployment options for enterprise customers who need data residency guarantees, though these require a paid plan.

How it works

Initialization and auto-logging

Calling wandb.init(project="...", config={...}) starts a run, sends the config to W&B, and returns a run object. Many popular frameworks (PyTorch Lightning, Hugging Face Trainer, Keras, XGBoost, scikit-learn) offer W&B callbacks or integrations that auto-log gradients, learning rate schedules, and evaluation metrics without additional code. Under the hood, a background thread batches and compresses log data before sending it over HTTPS, minimizing training overhead.

Real-time dashboards

The W&B UI renders metric curves, system utilization (GPU/CPU/memory), and media as the run progresses. Multiple runs can be overlaid on the same chart with automatic color coding. Runs can be filtered and grouped by any config dimension (e.g., group by learning rate to see its effect across all experiments at once), enabling rapid visual diagnosis.

Sweeps

A sweep is defined by a YAML or Python dict specifying search space, search strategy (grid, random, or Bayesian), and stopping criteria (e.g., early termination of underperforming runs). The W&B sweep controller coordinates multiple agents running in parallel, each picking hyperparameter combinations from the controller and logging results back. Bayesian search adapts based on observed results, converging faster than grid search.

Artifacts

W&B Artifacts version datasets, model checkpoints, and evaluation outputs as content-addressed objects. An artifact is linked to the run that produced it and to the runs that consumed it, creating a data lineage graph. You can download a specific artifact version with two lines of Python, making dataset and model reproducibility as simple as specifying a version string.

Reports

Reports are interactive documents that embed live W&B charts, run comparisons, and markdown narrative. They are the primary collaboration surface: a researcher can link a report in a Slack message or GitHub PR to share reproducible experimental evidence without exporting static images.

When to use / When NOT to use

Use when	Avoid when
You train deep learning models and need rich media logging (images, audio, embeddings)	Data cannot leave your premises and you cannot afford the enterprise on-premise plan
Team collaboration, sharing results, and narrative reports are important	You need a fully open-source, self-hosted solution with no SaaS dependency
You want built-in hyperparameter optimization without additional tooling	Your experiments are simple and the overhead of a SaaS account is unwarranted
Your team works in research or academia and benefits from free-tier access	You are on a tight budget and the paid tier's features are necessary for your team size

Comparisons

Criterion	W&B	MLflow
Ease of setup	Free SaaS account; no infra; `wandb login` + two lines of code	Self-hostable locally; no account needed; `mlflow ui` to start
UI quality	Polished, interactive; built for visual and media-heavy workloads	Clean and functional; better for tabular metric comparison
Collaboration	Native team workspaces, reports, sharing links, Slack integration	Requires shared server; no built-in collaboration features in OSS
Pricing	Free for individuals; paid for larger teams; enterprise for on-prem	Free and open-source; Databricks Managed MLflow costs extra
Hyperparameter optimization	Sweeps built in with Bayesian/grid/random + early stopping	Requires external tools (Optuna, Ray Tune)

Code examples

# wandb_tracking_example.py
# W&B experiment tracking: logs config, metrics, images, and registers a model artifact.
# pip install wandb scikit-learn matplotlib Pillow

import wandb
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
)
import os, tempfile

# ── 1. Initialize the W&B run ─────────────────────────────────────────────────
run = wandb.init(
    project="digits-classification",
    name="random-forest-v1",
    config={                         # All hyperparameters go here
        "n_estimators": 150,
        "max_depth": 12,
        "min_samples_split": 4,
        "random_state": 7,
        "dataset": "sklearn-digits",
    },
)
cfg = wandb.config  # Access config values through this proxy

# ── 2. Data ───────────────────────────────────────────────────────────────────
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=cfg.random_state
)

# ── 3. Train ──────────────────────────────────────────────────────────────────
clf = RandomForestClassifier(
    n_estimators=cfg.n_estimators,
    max_depth=cfg.max_depth,
    min_samples_split=cfg.min_samples_split,
    random_state=cfg.random_state,
)
clf.fit(X_train, y_train)

# ── 4. Evaluate and log metrics ───────────────────────────────────────────────
y_pred = clf.predict(X_test)
metrics = {
    "accuracy": accuracy_score(y_test, y_pred),
    "f1_macro": f1_score(y_test, y_pred, average="macro"),
    "n_train": len(X_train),
    "n_test": len(X_test),
}
wandb.log(metrics)

# ── 5. Log a confusion matrix image ──────────────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ConfusionMatrixDisplay(cm).plot(ax=ax)
ax.set_title("Confusion Matrix – digits RF")
wandb.log({"confusion_matrix": wandb.Image(fig)})
plt.close(fig)

# ── 6. Save model as a versioned W&B Artifact ─────────────────────────────────
import joblib

with tempfile.TemporaryDirectory() as tmp:
    model_path = os.path.join(tmp, "model.joblib")
    joblib.dump(clf, model_path)

    artifact = wandb.Artifact(
        name="digits-rf-model",
        type="model",
        description="Random Forest trained on sklearn digits dataset",
        metadata=dict(metrics),
    )
    artifact.add_file(model_path)
    run.log_artifact(artifact)

# ── 7. Finish the run ─────────────────────────────────────────────────────────
run.finish()
print(f"Accuracy: {metrics['accuracy']:.4f} | F1 macro: {metrics['f1_macro']:.4f}")
print(f"View run at: {run.url}")

Practical resources

W&B Official Documentation — Full reference covering the Python SDK, integrations, sweeps, artifacts, and reports.
W&B Quickstart — Log your first W&B run in under five minutes with a minimal example.
W&B Sweeps Documentation — Comprehensive guide to configuring and running distributed hyperparameter searches.
W&B Fully Connected Blog — Practitioner blog with in-depth tutorials, benchmark reports, and ML engineering articles.
Hugging Face + W&B Integration — Guide to auto-logging all Hugging Face Trainer metrics with a single report_to="wandb" argument.

Definition​

How it works​

Initialization and auto-logging​

Real-time dashboards​

Sweeps​

Artifacts​

Reports​

When to use / When NOT to use​

Comparisons​

Code examples​

Practical resources​

See also​