Prometheus Metrics

All Shoehorn services expose Prometheus-compatible metrics for monitoring and alerting.

Metrics Endpoint

Each service exposes metrics on its metrics port (default: 9090):

http://<service>:9090/metrics

Key Metrics

HTTP Metrics

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, path, status
`http_request_duration_seconds`	Histogram	Request latency distribution
`http_requests_in_flight`	Gauge	Current active requests

gRPC Metrics

Metric	Type	Description
`grpc_server_handled_total`	Counter	Total gRPC calls by method and status
`grpc_server_handling_seconds`	Histogram	gRPC call latency

Database Metrics

Metric	Type	Description
`db_connections_open`	Gauge	Open database connections
`db_connections_in_use`	Gauge	Active database connections
`db_query_duration_seconds`	Histogram	Query execution time

Cache Metrics

Metric	Type	Description
`cache_hits_total`	Counter	Cache hit count
`cache_misses_total`	Counter	Cache miss count

Business Metrics

Metric	Type	Description
`entities_total`	Gauge	Total entity count
`k8s_agent_pushes_total`	Counter	Agent data push count
`search_queries_total`	Counter	Search query count

Scrape Configuration

If using Prometheus operator with ServiceMonitor:

# Automatically configured when monitoring.serviceMonitor.enabled=true
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: shoehorn
spec:
  selector:
    matchLabels:
      app: shoehorn
  endpoints:
    - port: metrics
      interval: 15s

Alert Rules

Example Prometheus alert rules:

groups:
  - name: shoehorn
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"

      - alert: SlowResponses
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 1s on {{ $labels.service }}"

      - alert: DatabaseConnectionPoolExhausted
        expr: db_connections_in_use / db_connections_open > 0.9
        for: 2m
        labels:
          severity: critical
      - alert: APIServerErrors
        expr: increase(http_5xx_errors_total[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "HTTP 5xx errors detected"
          description: "{{ $value }} server errors in the last 5 minutes on {{ $labels.path }}"

      - alert: APIHighErrorRate
        expr: |
          sum(rate(http_5xx_errors_total[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API error rate above 1%"
          description: "{{ $value | humanizePercentage }} of requests are failing"

Prometheus Scrape Configuration

Add Shoehorn to your prometheus.yml:

scrape_configs:
  - job_name: shoehorn
    static_configs:
      - targets: ['shoehorn-api:9090']
    scrape_interval: 15s