Skip to content

Agent Health and Readiness

The K8s agent exposes HTTP endpoints for Kubernetes liveness, readiness, and metrics probes. The Helm chart configures these probes automatically.

EndpointPurposeProbe type
/healthzIs the process alive?Liveness
/livezAlias for /healthzLiveness
/readyzIs the agent ready to serve traffic?Readiness
/metricsPrometheus metricsScrape target

All endpoints return JSON and are rate-limited (100 req/s by default).

/healthz always returns 200 OK as long as the process is running. It does not check downstream dependencies. If this endpoint stops responding, the kubelet restarts the pod.

/readyz reflects whether the agent is able to do useful work. The response depends on the pod’s role.

A leader pod is marked ready when OnStartedLeading fires. After that, readiness depends on continued operational success: if no successful push or heartbeat has been recorded in the last 5 minutes, the endpoint returns 503 Service Unavailable with status degraded. This removes the pod from the Service’s endpoint set, which is the correct signal to Kubernetes that the agent is not functioning.

Once the pusher or heartbeat records a success, the pod returns to 200 OK with status ready.

A follower pod is marked ready when leader election determines it is not the leader. Followers do not run the pusher or heartbeat, so they never record operational successes. To avoid false degradation, followers skip the 5-minute success check entirely. A follower remains ready as long as the process is alive.

This means follower pods stay in the Service’s endpoint set indefinitely, which is the desired behavior — they are healthy standby replicas waiting for failover.

Pod starts
|
v
[not_ready] -- 503
|
|-- OnStartedLeading --> [ready (leader)] -- 200
| |
| +-- no success in 5m --> [degraded] -- 503
| | |
| +------- success recorded ---+
|
+-- OnNewLeader (someone else) --> [ready (follower)] -- 200 (indefinitely)

The /metrics endpoint exposes standard Prometheus metrics. Key agent-specific metrics:

MetricTypeDescription
health_checks_totalCounterHealth check requests by endpoint and status
health_rate_limit_hits_totalCounterRate limit rejections by endpoint

The Helm chart can create a ServiceMonitor for automatic Prometheus scraping.

  • Leader degradation alerts: If the leader pod enters degraded state, it means pushes and heartbeats have stopped succeeding. Investigate API connectivity, token validity, and circuit breaker state in the logs.
  • Frequent leader transitions: If OnNewLeader fires repeatedly in logs, check for resource pressure, pod evictions, or network partitions affecting the leader election lease.
  • All pods not-ready: If all replicas show not_ready, leader election may not be completing. Check that the agent has permission to create Leases in the configured namespace.