K8s Agent Reference
The Shoehorn K8s Agent is a lightweight, push-based agent that discovers Kubernetes workloads and syncs them to the Shoehorn catalog. It runs inside each cluster you want to monitor.
For operational considerations and installation, see K8s Agent.
Architecture
Section titled “Architecture”Kubernetes Cluster├── Watcher (client-go informers)│ ├── Deployments, StatefulSets, DaemonSets│ ├── CronJobs, Jobs, Services, Ingresses, Pods│ ├── Events, Namespaces, NetworkPolicies│ └── Filters: namespace, label selector, annotations│├── GitOps Watcher (optional)│ ├── ArgoCD Applications│ ├── FluxCD Kustomizations, HelmReleases│ └── Event cache (max 20 events, last 1h)│├── Metrics Collector (optional)│ ├── Samples metrics-server every 5m│ ├── 7-day rolling window (p95, max, avg)│ └── Graceful degradation if unavailable│├── Batcher│ ├── Time-based flush: every 30s│ ├── Size-based flush: at 100 events│ └── 2000-event channel buffer│├── Pusher│ ├── POST /api/v1/k8s/agents/push│ ├── Circuit breaker (5 failures, 60s recovery)│ ├── Retry with exponential backoff (3 attempts)│ └── Max payload: 10MB│├── Heartbeat│ └── POST /api/v1/k8s/agents/heartbeat (every 5m)│└── Leader Election (HA mode) ├── Kubernetes Leases (coordination.k8s.io) └── Only leader runs watchers and pushersConfiguration
Section titled “Configuration”All settings are environment variables prefixed with SHOEHORN_.
Required
Section titled “Required”| Variable | Description |
|---|---|
SHOEHORN_API_ENDPOINT | Shoehorn API URL |
SHOEHORN_API_TOKEN | Agent bearer token (from cluster registration) |
SHOEHORN_CLUSTER_ID | Unique cluster slug (lowercase, hyphens allowed) |
SHOEHORN_CLUSTER_NAME | Display name (defaults to cluster ID) |
Agent Behavior
Section titled “Agent Behavior”| Variable | Default | Range | Description |
|---|---|---|---|
SHOEHORN_BATCH_INTERVAL | 30s | 1s-10m | Event batch flush interval |
SHOEHORN_BATCH_SIZE | 100 | 1-10000 | Max events per batch |
SHOEHORN_PUSH_RETRIES | 3 | 1-10 | Retry attempts on push failure |
SHOEHORN_PUSH_TIMEOUT | 30s | 1s-5m | HTTP timeout for push |
SHOEHORN_HEARTBEAT_INTERVAL | 5m | 1m-30m | Heartbeat frequency |
SHOEHORN_HEALTH_PORT | 8080 | — | Health check server port |
SHOEHORN_LOG_LEVEL | info | debug/info/warn/error | Log verbosity |
SHOEHORN_LOG_FORMAT | json | json/console | Log output format |
Namespace Filtering
Section titled “Namespace Filtering”| Variable | Default | Description |
|---|---|---|
SHOEHORN_NAMESPACES | — (all) | Whitelist (comma-separated) |
SHOEHORN_EXCLUDE_NAMESPACES | — | Blacklist (comma-separated) |
SHOEHORN_LABEL_SELECTOR | — | Kubernetes label selector |
SHOEHORN_WATCHED_KINDS | — | Specific resource kinds |
Monitoring Control
Section titled “Monitoring Control”| Variable | Default | Description |
|---|---|---|
SHOEHORN_ANNOTATION_DEFAULT_BEHAVIOR | monitor-all | monitor-all, require-annotation, monitor-none |
SHOEHORN_ANNOTATION_DEFAULT_LEVEL | basic | Default monitoring level |
Monitoring Levels
Section titled “Monitoring Levels”Per-resource annotation: shoehorn.dev/monitoring-level
| Level | Collected Data |
|---|---|
basic | Workload status, replicas, image info |
detailed | Basic + restart counts, container states, pod events |
full | Detailed + CPU/memory usage, resource limits, QoS class |
Metrics Collection
Section titled “Metrics Collection”| Variable | Default | Range | Description |
|---|---|---|---|
SHOEHORN_METRICS_SAMPLE_INTERVAL | 5m | 1m-30m | Metrics-server sample rate |
SHOEHORN_METRICS_WINDOW_HOURS | 168 (7d) | 1-720 | Rolling window size in hours |
GitOps Integration
Section titled “GitOps Integration”| Variable | Default | Description |
|---|---|---|
SHOEHORN_GITOPS_TOOL | — (disabled) | argocd or fluxcd |
SHOEHORN_GITOPS_ARGOCD_NAMESPACE | argocd | ArgoCD install namespace |
SHOEHORN_GITOPS_ARGOCD_SERVER_URL | — | ArgoCD server URL (for UI links) |
SHOEHORN_GITOPS_ARGOCD_TOKEN | — | ArgoCD API token |
SHOEHORN_GITOPS_WATCH_ALL_NAMESPACES | false | Watch GitOps CRDs cluster-wide |
SHOEHORN_GITOPS_COMMAND_POLL_INTERVAL | 30s | Command polling frequency |
RBAC Permissions
Section titled “RBAC Permissions”The Helm chart creates a ClusterRole with these permissions:
Core (Always Required)
Section titled “Core (Always Required)”# Workloads- apiGroups: ["apps"] resources: [deployments, statefulsets, daemonsets] verbs: [get, list, watch]
# Batch- apiGroups: ["batch"] resources: [cronjobs, jobs] verbs: [get, list, watch]
# Core resources- apiGroups: [""] resources: [namespaces, pods, services, events] verbs: [get, list, watch]
# Networking- apiGroups: ["networking.k8s.io"] resources: [ingresses, networkpolicies] verbs: [get, list, watch]
# Cilium (if present)- apiGroups: ["cilium.io"] resources: [ciliumnetworkpolicies, ciliumclusterwidenetworkpolicies] verbs: [get, list, watch]
# Metrics- apiGroups: ["metrics.k8s.io"] resources: [pods] verbs: [get, list]
# Leader election- apiGroups: ["coordination.k8s.io"] resources: [leases] verbs: [get, list, watch, create, update, patch]ArgoCD (When Enabled)
Section titled “ArgoCD (When Enabled)”- apiGroups: ["argoproj.io"] resources: [applications] verbs: [get, list, watch]FluxCD (When Enabled)
Section titled “FluxCD (When Enabled)”- apiGroups: ["kustomize.toolkit.fluxcd.io"] resources: [kustomizations] verbs: [get, list, watch, patch]
- apiGroups: ["helm.toolkit.fluxcd.io"] resources: [helmreleases] verbs: [get, list, watch, patch]
- apiGroups: ["source.toolkit.fluxcd.io"] resources: [gitrepositories, helmrepositories] verbs: [get, list, watch]Team Ownership Inference
Section titled “Team Ownership Inference”The agent infers team ownership for discovered workloads, checked in this order:
- Annotation
shoehorn.dev/teamon the workload - Annotation
shoehorn.dev/owneron the workload - Label
owneron the workload - Label
shoehorn.dev/teamon the namespace - Namespace name pattern extraction (e.g.,
payments-prod->payments) - Default:
unassigned
Label namespaces for zero-config team assignment:
kubectl label namespace payments shoehorn.dev/team=payments-teamAnnotations Reference
Section titled “Annotations Reference”| Annotation | Description |
|---|---|
shoehorn.dev/monitor | true/false - opt in or out of monitoring |
shoehorn.dev/monitoring-level | basic, detailed, or full |
shoehorn.dev/team | Team slug for ownership |
shoehorn.dev/owner | Alternative to shoehorn.dev/team |
shoehorn.dev/entityFile | Path to .shoehorn/catalog.yaml for entity enrichment |
See Annotations Reference for the complete list.
Health Endpoints
Section titled “Health Endpoints”| Endpoint | Purpose |
|---|---|
/healthz | Liveness probe (checks leader eligibility) |
/readyz | Readiness probe (checks leader status + push health) |
/livez | Live status |
/metrics | Prometheus metrics |
See Agent Health for the full readiness model.
High Availability
Section titled “High Availability”For production clusters, run 2-3 replicas with leader election:
replicaCount: 3leaderElection: enabled: truepodDisruptionBudget: minAvailable: 2Only the leader runs watchers and pushes data. Followers are hot standbys that pass readiness probes and take over within seconds if the leader dies.
Security
Section titled “Security”- Read-only RBAC — no create/update/delete on workloads
- Non-root container — runs as UID 1000, drops all capabilities
- Read-only filesystem —
readOnlyRootFilesystem: true - HTTPS enforced — warns if API endpoint uses
http:// - Token redaction — API token cannot appear in logs or traces
- No redirect following — prevents bearer token leakage
- Annotation sanitization — strips
kubectl.kubernetes.io/last-applied-configuration
See Also
Section titled “See Also”- K8s Agent - Operational overview and installation
- Registering Clusters - Cluster registration workflow
- Entity Enrichment - Enrichment modes
- Annotations Reference - All supported annotations
- GitOps Integration - ArgoCD and FluxCD support
- Agent Health - Health and readiness model
- Network Observer - eBPF network topology