Introduction
Model Monitoring ensures that machine learning models running in production are accurate, efficient, and reliable.
It tracks system resources, data quality, latency, and performance over time to maintain consistent behavior.
Why It’s Critical
- Models experience drift (changes in data distribution or target variables)
- Inaccurate predictions erode user trust
- Latency or cost increases raise operational risk
- Lack of alerting leads to unnoticed failures
“You can’t optimize what you don’t measure.” — Model Monitoring brings this principle to machine learning.
Prerequisites
Before implementing model monitoring, ensure the following components are in place:
- MLOps pipeline: model build → deploy → monitor workflow defined
- Prometheus: for collecting model, system, and API metrics
- Grafana: for dashboard visualization
- Alertmanager: for alert routing (email, Telegram, Slack, etc.)
- Loki / ELK: for centralized log collection
- Inference logging: API call logging enabled
- Namespace separation: prod / staging isolation in Kubernetes
1️⃣ Identifying Metrics to Monitor
Model monitoring should include behavioral and functional metrics, not just infrastructure usage.
a) System Metrics
| Metric | Description | Tool |
|---|---|---|
| CPU / RAM Usage | Resource consumption of the model service | Node Exporter |
| Disk I/O & Network | Latency and packet loss | Prometheus |
| Container Health | Pod restarts and statuses | Kubernetes metrics |
b) Application Metrics
| Metric | Description | Tool |
|---|---|---|
| Request Count / Latency | API traffic and latency | FastAPI / Prometheus |
| Error Rate | 5xx response ratio | Grafana |
| Throughput (RPS) | Requests per second | PromQL |
c) Model Metrics
| Metric | Description | Tool |
|---|---|---|
| Accuracy / F1 / Recall | Model performance | Test results |
| Drift Rate | Shift in data distribution | Evidently AI / custom scripts |
| Data Freshness | Timeliness of input data | Airflow / MLflow logs |
2️⃣ Prometheus Configuration
2.1 prometheus.yml Example
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'model-api'
static_configs:
- targets: ['10.1.2.20:8000']
- job_name: 'node-exporter'
static_configs:
- targets: ['10.1.2.21:9100']
2.2 Model Metrics Endpoint
Expose /metrics endpoint in your model API:
from prometheus_client import start_http_server, Counter, Histogram
import time
from fastapi import FastAPI
app = FastAPI()
REQUESTS = Counter("model_requests_total", "Total number of requests")
LATENCY = Histogram("model_latency_seconds", "Request latency in seconds")
@app.post("/predict")
def predict(input: dict):
REQUESTS.inc()
start = time.time()
result = {"prediction": "ok"}
LATENCY.observe(time.time() - start)
return result
3️⃣ Grafana Dashboard Setup
3.1 Example Dashboards
- System Overview: CPU, RAM, Disk I/O
- Application Metrics: Latency, Throughput, Error Rate
- Model Performance: Accuracy, Drift, Confidence
3.2 Datasource Configuration
http://10.1.2.22:3000
Add Prometheus as a data source.
Query examples:
rate(model_requests_total[1m])
histogram_quantile(0.95, sum(rate(model_latency_seconds_bucket[5m])) by (le))
3.3 Alert Panel Example
expr: rate(model_requests_total[5m]) < 1
for: 5m
labels:
severity: critical
annotations:
description: "Model service might be down!"
4️⃣ Model Drift & Data Quality Monitoring
4.1 Drift Detection
Drift occurs when live predictions deviate from training data distributions.
Tool: Evidently AI
pip install evidently
Code example:
from evidently.report import Report
from evidently.metrics import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift_report.html")
If drift exceeds 30%, retraining should be triggered automatically.
4.2 Data Quality
- Missing value ratio and distribution change
- Class imbalance detection
- Outlier and anomaly frequency monitoring
5️⃣ Alertmanager Configuration
alertmanager.yml Example:
global:
smtp_smarthost: 'smtp.office365.com:587'
smtp_from: 'alerts@hmyn.net'
smtp_auth_username: 'alerts@hmyn.net'
smtp_auth_password: '********'
route:
receiver: 'team-hmyn'
receivers:
- name: 'team-hmyn'
email_configs:
- to: 'devops@hmyn.net'
Optional Telegram webhook:
receivers:
- name: 'telegram'
webhook_configs:
- url: 'http://10.1.2.22:5678/telegram'
6️⃣ Anomaly Detection & Incident Response
6.1 Runtime Detection
- Falco: syscall-level intrusion monitoring
- Prometheus Alerts: resource anomaly detection
- Grafana Alerts: custom metric-based triggers
6.2 Incident Management
| Stage | Tool | Responsible |
|---|---|---|
| Alert Notification | Alertmanager / Telegram | DevOps |
| Root Cause Analysis | Grafana / Loki | SRE / MLOps |
| Post-mortem Report | Confluence / GitHub Wiki | Team Lead |
6.3 Self-Healing Example
kubectl rollout restart deployment model-api -n prod
7️⃣ Production Checklist
| Category | Check Item | Status |
|---|---|---|
| System Monitoring | CPU, RAM, Disk, Network metrics collected? | ✅ |
| App Monitoring | Latency, throughput, error rate tracked? | ✅ |
| Model Monitoring | Accuracy / Drift / Confidence logged? | ✅ |
| Logging | Centralized inference logs stored? | ✅ |
| Alerting | Alerts delivered via email / Telegram? | ✅ |
| Data Quality | Drift analysis scheduled weekly? | ✅ |
| SLA Tracking | Model response < 500ms? | ⚙️ |
| Incident Response | Runbook and on-call defined? | ⚙️ |
| Security | Grafana & Prometheus RBAC enforced? | ✅ |
Conclusion
Model monitoring is essential to understand a model’s real-world performance and act as an early warning system for issues.
A well-structured observability stack combines metrics, logs, and alerts for full visibility.
Summary:
- Prometheus & Grafana form the core observability layer
- Drift tracking triggers retraining workflows
- Alertmanager ensures timely notifications
- Checklists bring standardization across MLOps teams
“A strong monitoring system keeps the heartbeat of every production model.”