Agent Health and Readiness
The K8s agent exposes HTTP endpoints for Kubernetes liveness, readiness, and metrics probes. The Helm chart configures these probes automatically.
Endpoints
Section titled “Endpoints”| Endpoint | Purpose | Probe type |
|---|---|---|
/healthz | Is the process alive? | Liveness |
/livez | Alias for /healthz | Liveness |
/readyz | Is the agent ready to serve traffic? | Readiness |
/metrics | Prometheus metrics | Scrape target |
All endpoints return JSON and are rate-limited (100 req/s by default).
Liveness
Section titled “Liveness”/healthz always returns 200 OK as long as the process is running. It does not check downstream dependencies. If this endpoint stops responding, the kubelet restarts the pod.
Readiness
Section titled “Readiness”/readyz reflects whether the agent is able to do useful work. The response depends on the pod’s role.
Leader
Section titled “Leader”A leader pod is marked ready when OnStartedLeading fires. After that, readiness depends on continued operational success: if no successful push or heartbeat has been recorded in the last 5 minutes, the endpoint returns 503 Service Unavailable with status degraded. This removes the pod from the Service’s endpoint set, which is the correct signal to Kubernetes that the agent is not functioning.
Once the pusher or heartbeat records a success, the pod returns to 200 OK with status ready.
Follower
Section titled “Follower”A follower pod is marked ready when leader election determines it is not the leader. Followers do not run the pusher or heartbeat, so they never record operational successes. To avoid false degradation, followers skip the 5-minute success check entirely. A follower remains ready as long as the process is alive.
This means follower pods stay in the Service’s endpoint set indefinitely, which is the desired behavior — they are healthy standby replicas waiting for failover.
Readiness State Machine
Section titled “Readiness State Machine”Pod starts | v[not_ready] -- 503 | |-- OnStartedLeading --> [ready (leader)] -- 200 | | | +-- no success in 5m --> [degraded] -- 503 | | | | +------- success recorded ---+ | +-- OnNewLeader (someone else) --> [ready (follower)] -- 200 (indefinitely)Prometheus Metrics
Section titled “Prometheus Metrics”The /metrics endpoint exposes standard Prometheus metrics. Key agent-specific metrics:
| Metric | Type | Description |
|---|---|---|
health_checks_total | Counter | Health check requests by endpoint and status |
health_rate_limit_hits_total | Counter | Rate limit rejections by endpoint |
The Helm chart can create a ServiceMonitor for automatic Prometheus scraping.
Monitoring Considerations
Section titled “Monitoring Considerations”- Leader degradation alerts: If the leader pod enters
degradedstate, it means pushes and heartbeats have stopped succeeding. Investigate API connectivity, token validity, and circuit breaker state in the logs. - Frequent leader transitions: If
OnNewLeaderfires repeatedly in logs, check for resource pressure, pod evictions, or network partitions affecting the leader election lease. - All pods not-ready: If all replicas show
not_ready, leader election may not be completing. Check that the agent has permission to create Leases in the configured namespace.