Humayun Hasimler

Introduction

Observability is essential for production ML services. The steps below walk you through collecting metrics with Prometheus, visualizing them with Grafana, and defining simple alert rules.

Prerequisites

Linux server or Docker environment
Access to ports 9090 (Prometheus) and 3000 (Grafana)
A running HTTP API (e.g., FastAPI) — examples use :8000

Note: Commands are provided in bash blocks; a non-root user with sudo is assumed.

Step 1 — Start Prometheus and define a target

Prometheus is a server that scrapes time-series metrics at regular intervals. First, write a configuration that adds your service as a scrape target.

Configuration (prometheus.yml):

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "ml-service"
    metrics_path: /metrics
    static_configs:
      - targets: ["host.docker.internal:8000"]  # On Linux you may prefer localhost:8000

Start Prometheus:

docker run -d --name prom   -p 9090:9090   -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml"   prom/prometheus

Verify:

curl -s localhost:9090/-/ready  # Expect 200 OK

Step 2 — Add a Prometheus exporter to your app (FastAPI example)

Your service must expose Prometheus-formatted metrics at /metrics. The example below collects request count and latency metrics.

Install:

python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn prometheus-client

App (app.py):

from fastapi import FastAPI, Response, Request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = FastAPI()

REQ_COUNT   = Counter("http_requests_total", "Total HTTP requests", ["path","method","status"])
REQ_LATENCY = Histogram("http_request_latency_seconds", "Request latency (s)", ["path","method"])

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    latency = time.time() - start
    REQ_LATENCY.labels(request.url.path, request.method).observe(latency)
    REQ_COUNT.labels(request.url.path, request.method, str(response.status_code)).inc()
    return response

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.get("/predict")
def predict():
    # ... model call ...
    return {"ok": True}

Run:

uvicorn app:app --host 0.0.0.0 --port 8000

Verify:

curl -s localhost:8000/metrics | head

Step 3 — Install Grafana and build dashboards

Grafana visualizes data from Prometheus. When adding a data source, use http://localhost:9090 as the URL.

Start Grafana:

docker run -d --name grafana   -p 3000:3000   grafana/grafana-oss

Example panels (PromQL)

Request rate:

sum(rate(http_requests_total[5m])) by (status)

Latency P90:

histogram_quantile(0.90, sum(rate(http_request_latency_seconds_bucket[5m])) by (le))

5xx error ratio:

sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

Step 4 — Basics of alerting and SLOs (recording & alerting rules)

Use recording rules to pre-compute frequently used expressions and alerting rules to trigger notifications on high error rates.

Recording rule (example):

groups:
- name: mlops-recording
  rules:
  - record: job:http_request_duration_seconds:p90
    expr: histogram_quantile(0.90, sum(rate(http_request_latency_seconds_bucket[5m])) by (le))

Alert rule (example):

groups:
- name: mlops-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      (sum(rate(http_requests_total{status=~"5.."}[5m])) /
       sum(rate(http_requests_total[5m]))) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: 5xx error ratio exceeded 5%
      description: The 5xx/s rate over the last 5 minutes crossed the threshold.

Troubleshooting

Target DOWN: Check container networking and targets. On Linux, use localhost instead of host.docker.internal.
Empty /metrics: Ensure counters/histograms are registered and the middleware runs.
Empty Grafana panels: Try the same PromQL in the Prometheus UI; if no series returns, review the scrape settings.

Conclusion

With this setup, you can monitor request rate, latency, and error ratio for your ML service, build dashboards, and catch issues with basic alert rules.

Next Steps

Add model quality metrics (accuracy/recall) via batch jobs
Configure Alertmanager to send Slack/Teams notifications
On Kubernetes, use Prometheus Operator + ServiceMonitor

(Admin metadata suggestion)

Slug: model-monitoring-izleme-loglama
Category: MLOps & AI Ops / Model Monitoring
Keywords: mlops, model monitoring, prometheus, grafana, exporter, observability, alerting
Summary (short): “Collect metrics and build dashboards for latency, request rate, and 5xx error ratio using Prometheus and Grafana for ML services.”

Tracking ML Services with Prometheus and Grafana

Introduction

Prerequisites

Step 1 — Start Prometheus and define a target

Step 2 — Add a Prometheus exporter to your app (FastAPI example)

Step 3 — Install Grafana and build dashboards

Example panels (PromQL)

Step 4 — Basics of alerting and SLOs (recording & alerting rules)

Troubleshooting

Conclusion

Next Steps

(Admin metadata suggestion)