Agent Health and Readiness
The K8s agent exposes HTTP endpoints for Kubernetes liveness, readiness, and metrics probes. The Helm chart configures these probes automatically.
Endpoints
Section titled “Endpoints”| Endpoint | Purpose | Probe type |
|---|---|---|
/healthz | Is the process alive? | Liveness |
/livez | Alias for /healthz | Liveness |
/readyz | Is the agent ready to serve traffic? | Readiness |
/metrics | Prometheus metrics | Scrape target |
All endpoints return JSON and are rate-limited (100 req/s by default).
Liveness
Section titled “Liveness”/healthz always returns 200 OK as long as the process is running. It does not check downstream dependencies. If this endpoint stops responding, the kubelet restarts the pod.
Readiness
Section titled “Readiness”/readyz reflects whether the agent is able to do useful work. The response depends on the pod’s role.
Leader
Section titled “Leader”A leader pod is marked ready when OnStartedLeading fires. After that, readiness depends on continued operational success: if no successful push or heartbeat has been recorded in the last 5 minutes, the endpoint returns 503 Service Unavailable with status degraded. This removes the pod from the Service’s endpoint set, which is the correct signal to Kubernetes that the agent is not functioning.
Once the pusher or heartbeat records a success, the pod returns to 200 OK with status ready.
Follower
Section titled “Follower”A follower pod is marked ready when leader election determines it is not the leader. Followers do not run the pusher or heartbeat, so they never record operational successes. To avoid false degradation, followers skip the 5-minute success check entirely. A follower remains ready as long as the process is alive.
This means follower pods stay in the Service’s endpoint set indefinitely, which is the desired behavior — they are healthy standby replicas waiting for failover.
Readiness State Machine
Section titled “Readiness State Machine”Pod starts | v[not_ready] -- 503 | |-- OnStartedLeading --> [ready (leader)] -- 200 | | | +-- no success in 5m --> [degraded] -- 503 | | | | +------- success recorded ---+ | +-- OnNewLeader (someone else) --> [ready (follower)] -- 200 (indefinitely)Prometheus Metrics
Section titled “Prometheus Metrics”The /metrics endpoint exposes standard Prometheus metrics. Key agent-specific metrics:
All agent metrics live under the shoehorn_ namespace.
| Metric | Type | Description |
|---|---|---|
shoehorn_health_checks_total | Counter | Health check requests by endpoint and status |
shoehorn_health_rate_limit_hits_total | Counter | Rate limit rejections by endpoint |
shoehorn_k8s_watcher_events_total{kind,type} | Counter | Kubernetes events processed by the watcher |
shoehorn_k8s_watcher_events_dropped_total | Counter | Events dropped because the event channel was full |
shoehorn_k8s_watcher_events_filtered_total{reason} | Counter | Events filtered by namespace, label, or rate limit |
shoehorn_k8s_watcher_drop_rate | Gauge | Current event drop rate (per second, last minute) |
shoehorn_batcher_batches_dropped_total{reason} | Counter | Batches the pusher could not accept in time |
shoehorn_pusher_payload_size_too_large_total | Counter | Payloads rejected for exceeding the size limit |
shoehorn_pusher_payload_size_bytes | Histogram | Payload size distribution |
The Helm chart can create a ServiceMonitor for automatic Prometheus scraping.
Example alert rules
Section titled “Example alert rules”groups:- name: shoehorn-k8s-agent rules: - alert: ShoehornAgentDroppingEvents expr: rate(shoehorn_k8s_watcher_events_dropped_total[5m]) > 0 for: 10m annotations: summary: K8s agent is dropping events description: | The agent's event channel is overflowing. Raise agent.eventChannelSize or stretch agent.resyncPeriod (in the Helm chart) and redeploy. - alert: ShoehornAgentBatchesDropped expr: increase(shoehorn_batcher_batches_dropped_total[15m]) > 0 annotations: summary: K8s agent is dropping batches before they reach the platform - alert: ShoehornReconcileGuardSkipping expr: rate(shoehorn_k8s_reconcile_skipped_total[30m]) > 0 for: 1h annotations: summary: Reconciliation sweeps repeatedly skipped by a guard description: | The platform's partial-sync or empty-sync guard is firing. Stale catalogue rows may be accumulating. Check the agent's drop counters.Monitoring Considerations
Section titled “Monitoring Considerations”- Leader degradation alerts: If the leader pod enters
degradedstate, it means pushes and heartbeats have stopped succeeding. Investigate API connectivity, token validity, and circuit breaker state in the logs. - Frequent leader transitions: If
OnNewLeaderfires repeatedly in logs, check for resource pressure, pod evictions, or network partitions affecting the leader election lease. - All pods not-ready: If all replicas show
not_ready, leader election may not be completing. Check that the agent has permission to create Leases in the configured namespace.