Tracking ML Services with Prometheus and Grafana

03 September 2025
Summary: In this guide, you'll set up Prometheus for metric collection and Grafana for visualization of production ML services. You'll build dashboards for request rate, latency, and xx error ratio, and define basic alerting rules.

Tracking ML Services with Prometheus and Grafana

Introduction

Observability is essential for production ML services. The steps below walk you through collecting metrics with Prometheus, visualizing them with Grafana, and defining simple alert rules.

Prerequisites

  • Linux server or Docker environment
  • Access to ports 9090 (Prometheus) and 3000 (Grafana)
  • A running HTTP API (e.g., FastAPI) — examples use :8000

    Note: Commands are provided in bash blocks; a non-root user with sudo is assumed.


Step 1 — Start Prometheus and define a target

Prometheus is a server that scrapes time-series metrics at regular intervals. First, write a configuration that adds your service as a scrape target.

Configuration (prometheus.yml):

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "ml-service"
    metrics_path: /metrics
    static_configs:
      - targets: ["host.docker.internal:8000"]  # On Linux you may prefer localhost:8000

Start Prometheus:

docker run -d --name prom   -p 9090:9090   -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml"   prom/prometheus

Verify:

curl -s localhost:9090/-/ready  # Expect 200 OK

Step 2 — Add a Prometheus exporter to your app (FastAPI example)

Your service must expose Prometheus-formatted metrics at /metrics. The example below collects request count and latency metrics.

Install:

python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn prometheus-client

App (app.py):

from fastapi import FastAPI, Response, Request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = FastAPI()

REQ_COUNT   = Counter("http_requests_total", "Total HTTP requests", ["path","method","status"])
REQ_LATENCY = Histogram("http_request_latency_seconds", "Request latency (s)", ["path","method"])

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    latency = time.time() - start
    REQ_LATENCY.labels(request.url.path, request.method).observe(latency)
    REQ_COUNT.labels(request.url.path, request.method, str(response.status_code)).inc()
    return response

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.get("/predict")
def predict():
    # ... model call ...
    return {"ok": True}

Run:

uvicorn app:app --host 0.0.0.0 --port 8000

Verify:

curl -s localhost:8000/metrics | head

Step 3 — Install Grafana and build dashboards

Grafana visualizes data from Prometheus. When adding a data source, use http://localhost:9090 as the URL.

Start Grafana:

docker run -d --name grafana   -p 3000:3000   grafana/grafana-oss

Example panels (PromQL)

Request rate:

sum(rate(http_requests_total[5m])) by (status)

Latency P90:

histogram_quantile(0.90, sum(rate(http_request_latency_seconds_bucket[5m])) by (le))

5xx error ratio:

sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

Step 4 — Basics of alerting and SLOs (recording & alerting rules)

Use recording rules to pre-compute frequently used expressions and alerting rules to trigger notifications on high error rates.

Recording rule (example):

groups:
- name: mlops-recording
  rules:
  - record: job:http_request_duration_seconds:p90
    expr: histogram_quantile(0.90, sum(rate(http_request_latency_seconds_bucket[5m])) by (le))

Alert rule (example):

groups:
- name: mlops-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      (sum(rate(http_requests_total{status=~"5.."}[5m])) /
       sum(rate(http_requests_total[5m]))) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 5xx error ratio exceeded 5%
      description: The 5xx/s rate over the last 5 minutes crossed the threshold.

Troubleshooting

  • Target DOWN: Check container networking and targets. On Linux, use localhost instead of host.docker.internal.
  • Empty /metrics: Ensure counters/histograms are registered and the middleware runs.
  • Empty Grafana panels: Try the same PromQL in the Prometheus UI; if no series returns, review the scrape settings.

Conclusion

With this setup, you can monitor request rate, latency, and error ratio for your ML service, build dashboards, and catch issues with basic alert rules.

Next Steps

  • Add model quality metrics (accuracy/recall) via batch jobs
  • Configure Alertmanager to send Slack/Teams notifications
  • On Kubernetes, use Prometheus Operator + ServiceMonitor

(Admin metadata suggestion)

  • Slug: model-monitoring-izleme-loglama
  • Category: MLOps & AI Ops / Model Monitoring
  • Keywords: mlops, model monitoring, prometheus, grafana, exporter, observability, alerting
  • Summary (short): “Collect metrics and build dashboards for latency, request rate, and 5xx error ratio using Prometheus and Grafana for ML services.”
Back to Article Page