Skip to content

Prometheus Metrics

All Shoehorn services expose Prometheus-compatible metrics for monitoring and alerting.

Each service exposes metrics on its metrics port (default: 9090):

http://<service>:9090/metrics
MetricTypeDescription
http_requests_totalCounterTotal HTTP requests by method, path, status
http_request_duration_secondsHistogramRequest latency distribution
http_requests_in_flightGaugeCurrent active requests
MetricTypeDescription
grpc_server_handled_totalCounterTotal gRPC calls by method and status
grpc_server_handling_secondsHistogramgRPC call latency
MetricTypeDescription
db_connections_openGaugeOpen database connections
db_connections_in_useGaugeActive database connections
db_query_duration_secondsHistogramQuery execution time
MetricTypeDescription
cache_hits_totalCounterCache hit count
cache_misses_totalCounterCache miss count
MetricTypeDescription
entities_totalGaugeTotal entity count
k8s_agent_pushes_totalCounterAgent data push count
search_queries_totalCounterSearch query count

If using Prometheus operator with ServiceMonitor:

# Automatically configured when monitoring.serviceMonitor.enabled=true
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: shoehorn
spec:
selector:
matchLabels:
app: shoehorn
endpoints:
- port: metrics
interval: 15s

Example Prometheus alert rules:

groups:
- name: shoehorn
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
- alert: SlowResponses
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 1s on {{ $labels.service }}"
- alert: DatabaseConnectionPoolExhausted
expr: db_connections_in_use / db_connections_open > 0.9
for: 2m
labels:
severity: critical
- alert: APIServerErrors
expr: increase(http_5xx_errors_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "HTTP 5xx errors detected"
description: "{{ $value }} server errors in the last 5 minutes on {{ $labels.path }}"
- alert: APIHighErrorRate
expr: |
sum(rate(http_5xx_errors_total[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "API error rate above 1%"
description: "{{ $value | humanizePercentage }} of requests are failing"

Add Shoehorn to your prometheus.yml:

scrape_configs:
- job_name: shoehorn
static_configs:
- targets: ['shoehorn-api:9090']
scrape_interval: 15s