MLflow

Definition

MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. Originally released by Databricks in 2018, it has become one of the most widely adopted MLOps tools due to its simplicity, framework-agnosticism, and the fact that it can be run entirely on-premise without any cloud dependency. A single pip install mlflow and a two-line code change is enough to start tracking experiments.

MLflow organizes functionality into four tightly integrated components. Tracking records parameters, metrics, and artifacts for every training run. Projects package ML code into reproducible, runnable units defined by a MLproject file. Models provide a standard format for packaging models that can be served by any supported deployment target. Model Registry provides a centralized model store with lifecycle management (Staging, Production, Archived states) and version history. Together these components cover the journey from raw experiment to production deployment.

MLflow can be run locally (SQLite backend, local filesystem artifacts), on a self-managed server (PostgreSQL + S3), or as a fully managed service via Databricks Managed MLflow. The open-source core is Apache 2.0 licensed, making it suitable for regulated industries where data cannot leave on-premise infrastructure.

How it works

Tracking server

When you call mlflow.start_run(), the client opens a run on the tracking server and begins buffering logs. Parameters (log_param, log_params) and metrics (log_metric, log_metrics) are written to the backend store (SQLite or PostgreSQL). Artifacts are uploaded to the artifact store (local filesystem, S3, GCS, Azure Blob, HDFS). The server exposes a REST API consumed by the client SDK and the web UI.

MLflow Projects

A project is a directory (or git repo) with an MLproject YAML file that declares the entry points, parameters, and conda/pip environment. Running mlflow run . -P lr=0.01 resolves the environment, sets parameters, and launches the entry point — producing a tracked run automatically. This makes experiments reproducible by anyone with access to the repo.

MLflow Models

A model saved with mlflow.<flavor>.log_model() is stored in the MLmodel format: a directory containing the serialized model, a MLmodel YAML descriptor, and a conda.yaml / requirements.txt. The pyfunc flavor provides a uniform model.predict(data) interface regardless of the underlying framework, enabling the same model to be loaded by different serving backends.

Model Registry

The registry stores named model versions with transition states. Automated CI/CD systems query the registry for the latest Production version to deploy. Human approvers or automated validation jobs transition versions between states. Every version links back to its source run, preserving full provenance.

When to use / When NOT to use

Use when	Avoid when
You need a fully self-hosted, open-source MLOps platform	Your team needs rich collaborative features (shared reports, Slack notifications) out of the box
Data cannot leave your infrastructure (regulated industries)	You prefer a SaaS product with zero infrastructure to manage
You already use Databricks and want native integration	Your workflow is notebook-only with no production deployment planned
Framework agnosticism is important (sklearn, XGBoost, PyTorch, TF, etc.)	You need advanced sweep/hyperparameter optimization built in
Cost control is critical; open-source licensing is required	Your team lacks the engineering bandwidth to manage a server and artifact store

Comparisons

Criterion	MLflow	Weights & Biases (W&B)
Ease of setup	Self-hostable with one command; no account needed	SaaS; free account required; no infrastructure to manage
UI quality	Clean but basic; focused on tabular metrics and run comparison	Highly polished; excellent media logging, custom charts, reports
Collaboration	Shared server required; no built-in RBAC in OSS	Built-in team workspaces, sharing links, and role-based access
Pricing	Free and open-source; Databricks Managed MLflow costs extra	Free for individuals; paid plans for teams
Hyperparameter optimization	Integrates with Optuna, Ray Tune externally	Sweeps built in with Bayesian/grid/random search

Code examples

# mlflow_full_example.py
# Full MLflow tracking example: logs params, metrics, a custom artifact,
# and registers the model in the Model Registry.
# pip install mlflow scikit-learn matplotlib

import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, roc_auc_score, classification_report
)
import os, tempfile, json

# ── 1. Data ──────────────────────────────────────────────────────────────────
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=0
)

# ── 2. Hyperparameters ────────────────────────────────────────────────────────
params = {
    "n_estimators": 200,
    "learning_rate": 0.05,
    "max_depth": 4,
    "subsample": 0.8,
    "random_state": 0,
}

# ── 3. MLflow run ─────────────────────────────────────────────────────────────
mlflow.set_experiment("breast-cancer-gbt")

with mlflow.start_run(run_name="gbt-tuned") as run:

    # Log hyperparameters
    mlflow.log_params(params)

    # Train
    clf = GradientBoostingClassifier(**params)
    clf.fit(X_train, y_train)

    # Evaluate
    y_pred = clf.predict(X_test)
    y_prob = clf.predict_proba(X_test)[:, 1]
    cv_scores = cross_val_score(clf, X_train, y_train, cv=5, scoring="roc_auc")

    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "roc_auc": roc_auc_score(y_test, y_prob),
        "cv_roc_auc_mean": cv_scores.mean(),
        "cv_roc_auc_std": cv_scores.std(),
    }
    mlflow.log_metrics(metrics)

    # Log a feature importance plot as an artifact
    with tempfile.TemporaryDirectory() as tmp:
        fig, ax = plt.subplots(figsize=(8, 5))
        feat_imp = clf.feature_importances_
        top_idx = np.argsort(feat_imp)[-10:]
        ax.barh(range(10), feat_imp[top_idx])
        ax.set_title("Top 10 feature importances")
        fig.tight_layout()
        plot_path = os.path.join(tmp, "feature_importance.png")
        fig.savefig(plot_path)
        plt.close(fig)
        mlflow.log_artifact(plot_path, artifact_path="plots")

        # Log classification report as JSON
        report = classification_report(y_test, y_pred, output_dict=True)
        report_path = os.path.join(tmp, "classification_report.json")
        with open(report_path, "w") as f:
            json.dump(report, f, indent=2)
        mlflow.log_artifact(report_path, artifact_path="evaluation")

    # Log and register the model
    mlflow.sklearn.log_model(
        clf,
        artifact_path="model",
        registered_model_name="breast-cancer-gbt",  # creates registry entry
    )

    print(f"Run ID  : {run.info.run_id}")
    for k, v in metrics.items():
        print(f"  {k}: {v:.4f}")

# ── 4. Load a registered model (simulates downstream serving) ─────────────────
# model_uri = "models:/breast-cancer-gbt/1"
# loaded = mlflow.sklearn.load_model(model_uri)
# print(loaded.predict(X_test[:3]))

Practical resources

MLflow Official Documentation — Complete reference covering all four components, REST API, and deployment targets.
MLflow GitHub Repository — Source code, issue tracker, and examples; useful for understanding internals and contributing.
Databricks – MLflow Tutorials — Production-grade MLflow usage on Databricks with Unity Catalog integration.
Towards Data Science – MLflow in Production — Community walkthrough of deploying a self-hosted MLflow server with Docker Compose, PostgreSQL, and MinIO.

Definition​

How it works​

Tracking server​

MLflow Projects​

MLflow Models​

Model Registry​

When to use / When NOT to use​

Comparisons​

Code examples​

Practical resources​

See also​