Prometheus Metrics
All Shoehorn services expose Prometheus-compatible metrics for monitoring and alerting.
Metrics Endpoint
Section titled “Metrics Endpoint”Each service exposes metrics on its metrics port (default: 9090):
http://<service>:9090/metricsKey Metrics
Section titled “Key Metrics”HTTP Metrics
Section titled “HTTP Metrics”| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests by method, path, status |
http_request_duration_seconds | Histogram | Request latency distribution |
http_requests_in_flight | Gauge | Current active requests |
gRPC Metrics
Section titled “gRPC Metrics”| Metric | Type | Description |
|---|---|---|
grpc_server_handled_total | Counter | Total gRPC calls by method and status |
grpc_server_handling_seconds | Histogram | gRPC call latency |
Database Metrics
Section titled “Database Metrics”| Metric | Type | Description |
|---|---|---|
db_connections_open | Gauge | Open database connections |
db_connections_in_use | Gauge | Active database connections |
db_query_duration_seconds | Histogram | Query execution time |
Cache Metrics
Section titled “Cache Metrics”| Metric | Type | Description |
|---|---|---|
cache_hits_total | Counter | Cache hit count |
cache_misses_total | Counter | Cache miss count |
Business Metrics
Section titled “Business Metrics”| Metric | Type | Description |
|---|---|---|
entities_total | Gauge | Total entity count |
k8s_agent_pushes_total | Counter | Agent data push count |
search_queries_total | Counter | Search query count |
Scrape Configuration
Section titled “Scrape Configuration”If using Prometheus operator with ServiceMonitor:
# Automatically configured when monitoring.serviceMonitor.enabled=trueapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: shoehornspec: selector: matchLabels: app: shoehorn endpoints: - port: metrics interval: 15sAlert Rules
Section titled “Alert Rules”Example Prometheus alert rules:
groups: - name: shoehorn rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
- alert: SlowResponses expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "P95 latency above 1s on {{ $labels.service }}"
- alert: DatabaseConnectionPoolExhausted expr: db_connections_in_use / db_connections_open > 0.9 for: 2m labels: severity: critical - alert: APIServerErrors expr: increase(http_5xx_errors_total[5m]) > 0 for: 1m labels: severity: warning annotations: summary: "HTTP 5xx errors detected" description: "{{ $value }} server errors in the last 5 minutes on {{ $labels.path }}"
- alert: APIHighErrorRate expr: | sum(rate(http_5xx_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.01 for: 5m labels: severity: critical annotations: summary: "API error rate above 1%" description: "{{ $value | humanizePercentage }} of requests are failing"Prometheus Scrape Configuration
Section titled “Prometheus Scrape Configuration”Add Shoehorn to your prometheus.yml:
scrape_configs: - job_name: shoehorn static_configs: - targets: ['shoehorn-api:9090'] scrape_interval: 15s