Skip to content

K8s Agent Reference

The Shoehorn K8s Agent is a lightweight, push-based agent that discovers Kubernetes workloads and syncs them to the Shoehorn catalog. It runs inside each cluster you want to monitor.

For operational considerations and installation, see K8s Agent.

Kubernetes Cluster
├── Watcher (client-go informers)
│ ├── Deployments, StatefulSets, DaemonSets
│ ├── CronJobs, Jobs, Services, Ingresses, Pods
│ ├── Events, Namespaces, NetworkPolicies
│ └── Filters: namespace, label selector, annotations
├── GitOps Watcher (optional)
│ ├── ArgoCD Applications
│ ├── FluxCD Kustomizations, HelmReleases
│ └── Event cache (max 20 events, last 1h)
├── Metrics Collector (optional)
│ ├── Samples metrics-server every 5m
│ ├── 7-day rolling window (p95, max, avg)
│ └── Graceful degradation if unavailable
├── Batcher
│ ├── Time-based flush: every 30s
│ ├── Size-based flush: at 100 events
│ └── 2000-event channel buffer
├── Pusher
│ ├── POST /api/v1/k8s/agents/push
│ ├── Circuit breaker (5 failures, 60s recovery)
│ ├── Retry with exponential backoff (3 attempts)
│ └── Max payload: 10MB
├── Heartbeat
│ └── POST /api/v1/k8s/agents/heartbeat (every 5m)
└── Leader Election (HA mode)
├── Kubernetes Leases (coordination.k8s.io)
└── Only leader runs watchers and pushers

All settings are environment variables prefixed with SHOEHORN_.

VariableDescription
SHOEHORN_API_ENDPOINTShoehorn API URL
SHOEHORN_API_TOKENAgent bearer token (from cluster registration)
SHOEHORN_CLUSTER_IDUnique cluster slug (lowercase, hyphens allowed)
SHOEHORN_CLUSTER_NAMEDisplay name (defaults to cluster ID)
VariableDefaultRangeDescription
SHOEHORN_BATCH_INTERVAL30s1s-10mEvent batch flush interval
SHOEHORN_BATCH_SIZE1001-10000Max events per batch
SHOEHORN_PUSH_RETRIES31-10Retry attempts on push failure
SHOEHORN_PUSH_TIMEOUT30s1s-5mHTTP timeout for push
SHOEHORN_HEARTBEAT_INTERVAL5m1m-30mHeartbeat frequency
SHOEHORN_HEALTH_PORT8080Health check server port
SHOEHORN_LOG_LEVELinfodebug/info/warn/errorLog verbosity
SHOEHORN_LOG_FORMATjsonjson/consoleLog output format
VariableDefaultDescription
SHOEHORN_NAMESPACES— (all)Whitelist (comma-separated)
SHOEHORN_EXCLUDE_NAMESPACESBlacklist (comma-separated)
SHOEHORN_LABEL_SELECTORKubernetes label selector
SHOEHORN_WATCHED_KINDSSpecific resource kinds
VariableDefaultDescription
SHOEHORN_ANNOTATION_DEFAULT_BEHAVIORmonitor-allmonitor-all, require-annotation, monitor-none
SHOEHORN_ANNOTATION_DEFAULT_LEVELbasicDefault monitoring level

Per-resource annotation: shoehorn.dev/monitoring-level

LevelCollected Data
basicWorkload status, replicas, image info
detailedBasic + restart counts, container states, pod events
fullDetailed + CPU/memory usage, resource limits, QoS class
VariableDefaultRangeDescription
SHOEHORN_METRICS_SAMPLE_INTERVAL5m1m-30mMetrics-server sample rate
SHOEHORN_METRICS_WINDOW_HOURS168 (7d)1-720Rolling window size in hours
VariableDefaultDescription
SHOEHORN_GITOPS_TOOL— (disabled)argocd or fluxcd
SHOEHORN_GITOPS_ARGOCD_NAMESPACEargocdArgoCD install namespace
SHOEHORN_GITOPS_ARGOCD_SERVER_URLArgoCD server URL (for UI links)
SHOEHORN_GITOPS_ARGOCD_TOKENArgoCD API token
SHOEHORN_GITOPS_WATCH_ALL_NAMESPACESfalseWatch GitOps CRDs cluster-wide
SHOEHORN_GITOPS_COMMAND_POLL_INTERVAL30sCommand polling frequency

The Helm chart creates a ClusterRole with these permissions:

# Workloads
- apiGroups: ["apps"]
resources: [deployments, statefulsets, daemonsets]
verbs: [get, list, watch]
# Batch
- apiGroups: ["batch"]
resources: [cronjobs, jobs]
verbs: [get, list, watch]
# Core resources
- apiGroups: [""]
resources: [namespaces, pods, services, events]
verbs: [get, list, watch]
# Networking
- apiGroups: ["networking.k8s.io"]
resources: [ingresses, networkpolicies]
verbs: [get, list, watch]
# Cilium (if present)
- apiGroups: ["cilium.io"]
resources: [ciliumnetworkpolicies, ciliumclusterwidenetworkpolicies]
verbs: [get, list, watch]
# Metrics
- apiGroups: ["metrics.k8s.io"]
resources: [pods]
verbs: [get, list]
# Leader election
- apiGroups: ["coordination.k8s.io"]
resources: [leases]
verbs: [get, list, watch, create, update, patch]
- apiGroups: ["argoproj.io"]
resources: [applications]
verbs: [get, list, watch]
- apiGroups: ["kustomize.toolkit.fluxcd.io"]
resources: [kustomizations]
verbs: [get, list, watch, patch]
- apiGroups: ["helm.toolkit.fluxcd.io"]
resources: [helmreleases]
verbs: [get, list, watch, patch]
- apiGroups: ["source.toolkit.fluxcd.io"]
resources: [gitrepositories, helmrepositories]
verbs: [get, list, watch]

The agent infers team ownership for discovered workloads, checked in this order:

  1. Annotation shoehorn.dev/team on the workload
  2. Annotation shoehorn.dev/owner on the workload
  3. Label owner on the workload
  4. Label shoehorn.dev/team on the namespace
  5. Namespace name pattern extraction (e.g., payments-prod -> payments)
  6. Default: unassigned

Label namespaces for zero-config team assignment:

Terminal window
kubectl label namespace payments shoehorn.dev/team=payments-team
AnnotationDescription
shoehorn.dev/monitortrue/false - opt in or out of monitoring
shoehorn.dev/monitoring-levelbasic, detailed, or full
shoehorn.dev/teamTeam slug for ownership
shoehorn.dev/ownerAlternative to shoehorn.dev/team
shoehorn.dev/entityFilePath to .shoehorn/catalog.yaml for entity enrichment

See Annotations Reference for the complete list.

EndpointPurpose
/healthzLiveness probe (checks leader eligibility)
/readyzReadiness probe (checks leader status + push health)
/livezLive status
/metricsPrometheus metrics

See Agent Health for the full readiness model.

For production clusters, run 2-3 replicas with leader election:

replicaCount: 3
leaderElection:
enabled: true
podDisruptionBudget:
minAvailable: 2

Only the leader runs watchers and pushes data. Followers are hot standbys that pass readiness probes and take over within seconds if the leader dies.

  • Read-only RBAC — no create/update/delete on workloads
  • Non-root container — runs as UID 1000, drops all capabilities
  • Read-only filesystemreadOnlyRootFilesystem: true
  • HTTPS enforced — warns if API endpoint uses http://
  • Token redaction — API token cannot appear in logs or traces
  • No redirect following — prevents bearer token leakage
  • Annotation sanitization — strips kubectl.kubernetes.io/last-applied-configuration