Prometheus

Definition

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud and now a graduated CNCF project. It stores all data as time-series: streams of timestamped floating-point values identified by a metric name and a set of key-value labels. This model is a natural fit for operational data — CPU usage, request counts, error rates — and for ML-specific signals such as prediction latency, throughput, and feature value distributions over time.

The defining architectural choice in Prometheus is its pull-based scraping model. Rather than requiring instrumented applications to push metrics to a central collector, Prometheus periodically scrapes HTTP endpoints (by default /metrics) exposed by targets. This inversion of control makes service discovery, access control, and debugging significantly simpler: you can curl any target's metrics endpoint directly to see what Prometheus will collect. Targets are discovered via static configuration or dynamic service discovery (Kubernetes, Consul, EC2, etc.).

Prometheus is not a long-term storage solution by design. Its local time-series database (TSDB) is optimized for fast ingest and query of recent data, typically retaining 15 days. For long-term storage, Prometheus can remote-write to systems like Thanos, Cortex, or VictoriaMetrics. In ML contexts, Prometheus is the collection and alerting layer; Grafana provides the visualization and dashboarding layer on top.

How it works

Target instrumentation

Applications expose metrics via an HTTP /metrics endpoint in the Prometheus exposition format — a plain-text format of metric_name{label="value"} numeric_value timestamp lines. In Python, the prometheus_client library provides Counter, Gauge, Histogram, and Summary types that handle the exposition format automatically. An ML serving process typically exposes counters for total prediction requests, histograms for request latency, and gauges for currently loaded model versions and resource utilization.

Scrape and storage

Prometheus evaluates its configuration file to determine which targets to scrape and at what interval (default: 15 seconds). On each scrape it fetches the /metrics endpoint, parses the exposition format, and writes the samples to its local TSDB in compressed chunks. The TSDB uses a write-ahead log (WAL) for durability and compacts data into blocks over time. Label cardinality is the main performance lever: each unique combination of label values creates a separate time series, so unbounded labels (e.g., user IDs) must be avoided.

PromQL querying and alerting

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data. Instant vectors select the current value of a set of series; range vectors select a window of samples; functions compute rates, averages, quantiles, and predictions over those vectors. Alerting rules are PromQL expressions evaluated at a configurable interval; when an expression returns a non-empty result, the alert fires and is sent to Alertmanager.

Alertmanager

Alertmanager receives alerts from Prometheus (and other sources), deduplicates them, applies grouping and routing rules, and dispatches notifications to receivers (PagerDuty, Slack, email, webhooks). Silences and inhibition rules prevent alert storms during known maintenance windows or cascading failures. In ML systems, Alertmanager routes model degradation alerts to the ML team's Slack channel while infrastructure alerts (high CPU, OOM kills) go to the platform team.

Remote storage and federation

For multi-cluster or long-retention scenarios, Prometheus remote-writes samples to a durable backend. Federation allows a global Prometheus to scrape aggregated metrics from regional Prometheus instances. Both patterns are common in large ML platforms where training clusters and serving clusters each run their own Prometheus, and a central instance aggregates service-level metrics.

When to use / When NOT to use

Use when	Avoid when
You need operational metrics for ML serving infrastructure (latency, throughput, error rate)	You need to store raw prediction logs or high-cardinality event data
You want a pull-based, self-hosted monitoring stack with no vendor lock-in	Your team lacks infrastructure experience to operate and tune a Prometheus stack
You are running on Kubernetes and want native service discovery	You need long-term retention (>15 days) without additional remote storage setup
You need powerful alerting with deduplication and routing via Alertmanager	You need sub-second scrape intervals; Prometheus is designed for 10–60 second intervals
You want a standard backend for Grafana dashboards	Your application generates unbounded label cardinality, which will degrade TSDB performance

Comparisons

Prometheus and Grafana are complementary, not competing tools. The table below describes when to use them together versus alternatives.

Criterion	Prometheus	Grafana
Role	Collect, store, and alert on metrics	Visualize and explore metrics from any data source
Query language	PromQL (metrics-optimized functional language)	Per-datasource (PromQL for Prometheus, SQL for others)
Alerting	Built-in alerting rules + Alertmanager	Grafana Alerting (unified, multi-datasource)
Data sources	Self (TSDB)	Prometheus, InfluxDB, Loki, Elasticsearch, databases, etc.
Storage	Local TSDB, remote-write for long-term	No storage — purely a query and visualization layer
When to use together	Always — Prometheus collects, Grafana shows	Always — use Grafana as the UI for Prometheus data

Pros and cons

Aspect	Pros	Cons
Pull-based architecture	Simple debugging, access control at target level	Requires targets to expose HTTP endpoints
PromQL	Expressive, composable, purpose-built for metrics	Steep learning curve compared to SQL
Local TSDB	Fast ingest and query for recent data	Limited retention; needs remote storage for long-term
Label model	Flexible multi-dimensional filtering and aggregation	High cardinality labels cause memory and query performance issues
Alertmanager	Rich routing, grouping, and silencing	Separate component to operate; configuration can become complex
Ecosystem	Huge library of exporters and client libraries	Operational overhead for self-hosted deployments

Code examples

# ml_metrics_server.py
# Exposes ML model metrics via prometheus_client for Prometheus scraping.
# Run: pip install prometheus_client flask scikit-learn numpy
# Then configure Prometheus to scrape localhost:8000

import time
import threading
import random
import numpy as np
from prometheus_client import (
    Counter,
    Histogram,
    Gauge,
    start_http_server,
    REGISTRY,
)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# --- Define metrics ---

PREDICTION_COUNTER = Counter(
    "ml_predictions_total",
    "Total number of prediction requests",
    ["model_name", "model_version", "status"],  # labels
)

PREDICTION_LATENCY = Histogram(
    "ml_prediction_latency_seconds",
    "Prediction request latency in seconds",
    ["model_name", "model_version"],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5],
)

MODEL_CONFIDENCE = Histogram(
    "ml_prediction_confidence",
    "Distribution of model prediction confidence scores",
    ["model_name", "model_version"],
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
)

ACTIVE_MODEL_VERSION = Gauge(
    "ml_active_model_version",
    "Currently active model version (encoded as numeric)",
    ["model_name"],
)

DATA_DRIFT_SCORE = Gauge(
    "ml_data_drift_score",
    "Current data drift score (PSI) for the primary feature set",
    ["model_name", "feature_set"],
)

# --- Load and train a simple model ---
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=50, random_state=42)
clf.fit(X, y)

MODEL_NAME = "iris-classifier"
MODEL_VERSION = "1.0.0"
ACTIVE_MODEL_VERSION.labels(model_name=MODEL_NAME).set(1)


def simulate_prediction(features: np.ndarray) -> dict:
    """Run a prediction and record Prometheus metrics."""
    start = time.time()
    try:
        proba = clf.predict_proba(features.reshape(1, -1))[0]
        predicted_class = int(np.argmax(proba))
        confidence = float(np.max(proba))

        # Record latency and confidence
        duration = time.time() - start
        PREDICTION_LATENCY.labels(
            model_name=MODEL_NAME, model_version=MODEL_VERSION
        ).observe(duration)
        MODEL_CONFIDENCE.labels(
            model_name=MODEL_NAME, model_version=MODEL_VERSION
        ).observe(confidence)
        PREDICTION_COUNTER.labels(
            model_name=MODEL_NAME,
            model_version=MODEL_VERSION,
            status="success",
        ).inc()

        return {"class": predicted_class, "confidence": confidence}
    except Exception as exc:
        PREDICTION_COUNTER.labels(
            model_name=MODEL_NAME,
            model_version=MODEL_VERSION,
            status="error",
        ).inc()
        raise exc


def simulate_drift_monitoring():
    """Periodically update a synthetic drift score gauge."""
    while True:
        # In production this would run a real PSI/KS test
        drift_score = random.uniform(0.01, 0.35)
        DATA_DRIFT_SCORE.labels(
            model_name=MODEL_NAME, feature_set="sepal"
        ).set(drift_score)
        time.sleep(30)


def simulate_traffic():
    """Generate synthetic prediction traffic for demonstration."""
    samples = X[np.random.choice(len(X), size=10)]
    for sample in samples:
        simulate_prediction(sample)
        time.sleep(random.uniform(0.05, 0.3))


if __name__ == "__main__":
    # Start Prometheus metrics HTTP server on port 8000
    start_http_server(8000)
    print("Prometheus metrics server running on http://localhost:8000/metrics")
    print("Configure Prometheus to scrape this endpoint.")

    # Start background drift monitor
    drift_thread = threading.Thread(target=simulate_drift_monitoring, daemon=True)
    drift_thread.start()

    # Simulate continuous prediction traffic
    print("Simulating prediction traffic...")
    while True:
        simulate_traffic()
        time.sleep(1)

Practical resources

Prometheus documentation — Official docs covering architecture, configuration, PromQL, alerting, and best practices.
prometheus_client Python library — Official Python client for instrumenting applications; covers all metric types and the exposition format.
PromQL cheat sheet — Concise reference for PromQL operators, functions, and common patterns.
Robust Perception — Monitoring with Prometheus — Brian Brazil's in-depth blog covering Prometheus internals, PromQL patterns, and operational advice.
Awesome Prometheus — Curated list of Prometheus exporters, dashboards, and community resources.

Definition​

How it works​

Target instrumentation​

Scrape and storage​

PromQL querying and alerting​

Alertmanager​

Remote storage and federation​

When to use / When NOT to use​

Comparisons​

Pros and cons​

Code examples​

Practical resources​

See also​