Introduction
Observability is essential for production ML services. The steps below walk you through collecting metrics with Prometheus, visualizing them with Grafana, and defining simple alert rules.
Prerequisites
- Linux server or Docker environment
- Access to ports 9090 (Prometheus) and 3000 (Grafana)
- A running HTTP API (e.g., FastAPI) — examples use
:8000
Note: Commands are provided in
bash
blocks; a non-root user withsudo
is assumed.
Step 1 — Start Prometheus and define a target
Prometheus is a server that scrapes time-series metrics at regular intervals. First, write a configuration that adds your service as a scrape target.
Configuration (prometheus.yml):
global:
scrape_interval: 15s
scrape_configs:
- job_name: "ml-service"
metrics_path: /metrics
static_configs:
- targets: ["host.docker.internal:8000"] # On Linux you may prefer localhost:8000
Start Prometheus:
docker run -d --name prom -p 9090:9090 -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml" prom/prometheus
Verify:
curl -s localhost:9090/-/ready # Expect 200 OK
Step 2 — Add a Prometheus exporter to your app (FastAPI example)
Your service must expose Prometheus-formatted metrics at /metrics
. The example below collects request count and latency metrics.
Install:
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn prometheus-client
App (app.py):
from fastapi import FastAPI, Response, Request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time
app = FastAPI()
REQ_COUNT = Counter("http_requests_total", "Total HTTP requests", ["path","method","status"])
REQ_LATENCY = Histogram("http_request_latency_seconds", "Request latency (s)", ["path","method"])
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.time()
response = await call_next(request)
latency = time.time() - start
REQ_LATENCY.labels(request.url.path, request.method).observe(latency)
REQ_COUNT.labels(request.url.path, request.method, str(response.status_code)).inc()
return response
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
@app.get("/predict")
def predict():
# ... model call ...
return {"ok": True}
Run:
uvicorn app:app --host 0.0.0.0 --port 8000
Verify:
curl -s localhost:8000/metrics | head
Step 3 — Install Grafana and build dashboards
Grafana visualizes data from Prometheus. When adding a data source, use http://localhost:9090
as the URL.
Start Grafana:
docker run -d --name grafana -p 3000:3000 grafana/grafana-oss
Example panels (PromQL)
Request rate:
sum(rate(http_requests_total[5m])) by (status)
Latency P90:
histogram_quantile(0.90, sum(rate(http_request_latency_seconds_bucket[5m])) by (le))
5xx error ratio:
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Step 4 — Basics of alerting and SLOs (recording & alerting rules)
Use recording rules to pre-compute frequently used expressions and alerting rules to trigger notifications on high error rates.
Recording rule (example):
groups:
- name: mlops-recording
rules:
- record: job:http_request_duration_seconds:p90
expr: histogram_quantile(0.90, sum(rate(http_request_latency_seconds_bucket[5m])) by (le))
Alert rule (example):
groups:
- name: mlops-alerts
rules:
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: 5xx error ratio exceeded 5%
description: The 5xx/s rate over the last 5 minutes crossed the threshold.
Troubleshooting
- Target DOWN: Check container networking and
targets
. On Linux, uselocalhost
instead ofhost.docker.internal
. - Empty
/metrics
: Ensure counters/histograms are registered and the middleware runs. - Empty Grafana panels: Try the same PromQL in the Prometheus UI; if no series returns, review the scrape settings.
Conclusion
With this setup, you can monitor request rate, latency, and error ratio for your ML service, build dashboards, and catch issues with basic alert rules.
Next Steps
- Add model quality metrics (accuracy/recall) via batch jobs
- Configure Alertmanager to send Slack/Teams notifications
- On Kubernetes, use Prometheus Operator + ServiceMonitor
(Admin metadata suggestion)
- Slug:
model-monitoring-izleme-loglama
- Category:
MLOps & AI Ops / Model Monitoring
- Keywords:
mlops, model monitoring, prometheus, grafana, exporter, observability, alerting
- Summary (short): “Collect metrics and build dashboards for latency, request rate, and 5xx error ratio using Prometheus and Grafana for ML services.”