Skip to main content

Prometheus

Definition

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud and now a graduated CNCF project. It stores all data as time-series: streams of timestamped floating-point values identified by a metric name and a set of key-value labels. This model is a natural fit for operational data — CPU usage, request counts, error rates — and for ML-specific signals such as prediction latency, throughput, and feature value distributions over time.

The defining architectural choice in Prometheus is its pull-based scraping model. Rather than requiring instrumented applications to push metrics to a central collector, Prometheus periodically scrapes HTTP endpoints (by default /metrics) exposed by targets. This inversion of control makes service discovery, access control, and debugging significantly simpler: you can curl any target's metrics endpoint directly to see what Prometheus will collect. Targets are discovered via static configuration or dynamic service discovery (Kubernetes, Consul, EC2, etc.).

Prometheus is not a long-term storage solution by design. Its local time-series database (TSDB) is optimized for fast ingest and query of recent data, typically retaining 15 days. For long-term storage, Prometheus can remote-write to systems like Thanos, Cortex, or VictoriaMetrics. In ML contexts, Prometheus is the collection and alerting layer; Grafana provides the visualization and dashboarding layer on top.

How it works

Target instrumentation

Applications expose metrics via an HTTP /metrics endpoint in the Prometheus exposition format — a plain-text format of metric_name{label="value"} numeric_value timestamp lines. In Python, the prometheus_client library provides Counter, Gauge, Histogram, and Summary types that handle the exposition format automatically. An ML serving process typically exposes counters for total prediction requests, histograms for request latency, and gauges for currently loaded model versions and resource utilization.

Scrape and storage

Prometheus evaluates its configuration file to determine which targets to scrape and at what interval (default: 15 seconds). On each scrape it fetches the /metrics endpoint, parses the exposition format, and writes the samples to its local TSDB in compressed chunks. The TSDB uses a write-ahead log (WAL) for durability and compacts data into blocks over time. Label cardinality is the main performance lever: each unique combination of label values creates a separate time series, so unbounded labels (e.g., user IDs) must be avoided.

PromQL querying and alerting

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data. Instant vectors select the current value of a set of series; range vectors select a window of samples; functions compute rates, averages, quantiles, and predictions over those vectors. Alerting rules are PromQL expressions evaluated at a configurable interval; when an expression returns a non-empty result, the alert fires and is sent to Alertmanager.

Alertmanager

Alertmanager receives alerts from Prometheus (and other sources), deduplicates them, applies grouping and routing rules, and dispatches notifications to receivers (PagerDuty, Slack, email, webhooks). Silences and inhibition rules prevent alert storms during known maintenance windows or cascading failures. In ML systems, Alertmanager routes model degradation alerts to the ML team's Slack channel while infrastructure alerts (high CPU, OOM kills) go to the platform team.

Remote storage and federation

For multi-cluster or long-retention scenarios, Prometheus remote-writes samples to a durable backend. Federation allows a global Prometheus to scrape aggregated metrics from regional Prometheus instances. Both patterns are common in large ML platforms where training clusters and serving clusters each run their own Prometheus, and a central instance aggregates service-level metrics.

When to use / When NOT to use

Use whenAvoid when
You need operational metrics for ML serving infrastructure (latency, throughput, error rate)You need to store raw prediction logs or high-cardinality event data
You want a pull-based, self-hosted monitoring stack with no vendor lock-inYour team lacks infrastructure experience to operate and tune a Prometheus stack
You are running on Kubernetes and want native service discoveryYou need long-term retention (>15 days) without additional remote storage setup
You need powerful alerting with deduplication and routing via AlertmanagerYou need sub-second scrape intervals; Prometheus is designed for 10–60 second intervals
You want a standard backend for Grafana dashboardsYour application generates unbounded label cardinality, which will degrade TSDB performance

Comparisons

Prometheus and Grafana are complementary, not competing tools. The table below describes when to use them together versus alternatives.

CriterionPrometheusGrafana
RoleCollect, store, and alert on metricsVisualize and explore metrics from any data source
Query languagePromQL (metrics-optimized functional language)Per-datasource (PromQL for Prometheus, SQL for others)
AlertingBuilt-in alerting rules + AlertmanagerGrafana Alerting (unified, multi-datasource)
Data sourcesSelf (TSDB)Prometheus, InfluxDB, Loki, Elasticsearch, databases, etc.
StorageLocal TSDB, remote-write for long-termNo storage — purely a query and visualization layer
When to use togetherAlways — Prometheus collects, Grafana showsAlways — use Grafana as the UI for Prometheus data

Pros and cons

AspectProsCons
Pull-based architectureSimple debugging, access control at target levelRequires targets to expose HTTP endpoints
PromQLExpressive, composable, purpose-built for metricsSteep learning curve compared to SQL
Local TSDBFast ingest and query for recent dataLimited retention; needs remote storage for long-term
Label modelFlexible multi-dimensional filtering and aggregationHigh cardinality labels cause memory and query performance issues
AlertmanagerRich routing, grouping, and silencingSeparate component to operate; configuration can become complex
EcosystemHuge library of exporters and client librariesOperational overhead for self-hosted deployments

Code examples

# ml_metrics_server.py
# Exposes ML model metrics via prometheus_client for Prometheus scraping.
# Run: pip install prometheus_client flask scikit-learn numpy
# Then configure Prometheus to scrape localhost:8000

import time
import threading
import random
import numpy as np
from prometheus_client import (
Counter,
Histogram,
Gauge,
start_http_server,
REGISTRY,
)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# --- Define metrics ---

PREDICTION_COUNTER = Counter(
"ml_predictions_total",
"Total number of prediction requests",
["model_name", "model_version", "status"], # labels
)

PREDICTION_LATENCY = Histogram(
"ml_prediction_latency_seconds",
"Prediction request latency in seconds",
["model_name", "model_version"],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5],
)

MODEL_CONFIDENCE = Histogram(
"ml_prediction_confidence",
"Distribution of model prediction confidence scores",
["model_name", "model_version"],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
)

ACTIVE_MODEL_VERSION = Gauge(
"ml_active_model_version",
"Currently active model version (encoded as numeric)",
["model_name"],
)

DATA_DRIFT_SCORE = Gauge(
"ml_data_drift_score",
"Current data drift score (PSI) for the primary feature set",
["model_name", "feature_set"],
)

# --- Load and train a simple model ---
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=50, random_state=42)
clf.fit(X, y)

MODEL_NAME = "iris-classifier"
MODEL_VERSION = "1.0.0"
ACTIVE_MODEL_VERSION.labels(model_name=MODEL_NAME).set(1)


def simulate_prediction(features: np.ndarray) -> dict:
"""Run a prediction and record Prometheus metrics."""
start = time.time()
try:
proba = clf.predict_proba(features.reshape(1, -1))[0]
predicted_class = int(np.argmax(proba))
confidence = float(np.max(proba))

# Record latency and confidence
duration = time.time() - start
PREDICTION_LATENCY.labels(
model_name=MODEL_NAME, model_version=MODEL_VERSION
).observe(duration)
MODEL_CONFIDENCE.labels(
model_name=MODEL_NAME, model_version=MODEL_VERSION
).observe(confidence)
PREDICTION_COUNTER.labels(
model_name=MODEL_NAME,
model_version=MODEL_VERSION,
status="success",
).inc()

return {"class": predicted_class, "confidence": confidence}
except Exception as exc:
PREDICTION_COUNTER.labels(
model_name=MODEL_NAME,
model_version=MODEL_VERSION,
status="error",
).inc()
raise exc


def simulate_drift_monitoring():
"""Periodically update a synthetic drift score gauge."""
while True:
# In production this would run a real PSI/KS test
drift_score = random.uniform(0.01, 0.35)
DATA_DRIFT_SCORE.labels(
model_name=MODEL_NAME, feature_set="sepal"
).set(drift_score)
time.sleep(30)


def simulate_traffic():
"""Generate synthetic prediction traffic for demonstration."""
samples = X[np.random.choice(len(X), size=10)]
for sample in samples:
simulate_prediction(sample)
time.sleep(random.uniform(0.05, 0.3))


if __name__ == "__main__":
# Start Prometheus metrics HTTP server on port 8000
start_http_server(8000)
print("Prometheus metrics server running on http://localhost:8000/metrics")
print("Configure Prometheus to scrape this endpoint.")

# Start background drift monitor
drift_thread = threading.Thread(target=simulate_drift_monitoring, daemon=True)
drift_thread.start()

# Simulate continuous prediction traffic
print("Simulating prediction traffic...")
while True:
simulate_traffic()
time.sleep(1)

Practical resources

See also