K8s Agent Reference

The Shoehorn K8s Agent is a lightweight, push-based agent that discovers Kubernetes workloads and syncs them to the Shoehorn catalog. It runs inside each cluster you want to monitor.

For operational considerations and installation, see K8s Agent.

Architecture

Kubernetes Cluster
├── Watcher (client-go informers)
│   ├── Deployments, StatefulSets, DaemonSets
│   ├── CronJobs, Jobs, Services, Ingresses, Pods
│   ├── Events, Namespaces, NetworkPolicies
│   └── Filters: namespace, label selector, annotations
│
├── GitOps Watcher (optional)
│   ├── ArgoCD Applications
│   ├── FluxCD Kustomizations, HelmReleases
│   └── Event cache (max 20 events, last 1h)
│
├── Metrics Collector (optional)
│   ├── Samples metrics-server every 5m
│   ├── 7-day rolling window (p95, max, avg)
│   └── Graceful degradation if unavailable
│
├── Batcher
│   ├── Time-based flush: every 30s
│   ├── Size-based flush: at 100 events
│   └── 2000-event channel buffer
│
├── Pusher
│   ├── POST /api/v1/k8s/agents/push
│   ├── Circuit breaker (5 failures, 60s recovery)
│   ├── Retry with exponential backoff (3 attempts)
│   └── Max payload: 10MB
│
├── Heartbeat
│   └── POST /api/v1/k8s/agents/heartbeat (every 5m)
│
└── Leader Election (HA mode)
    ├── Kubernetes Leases (coordination.k8s.io)
    └── Only leader runs watchers and pushers

Configuration

All settings are environment variables prefixed with SHOEHORN_.

Required

Variable	Description
`SHOEHORN_API_ENDPOINT`	Shoehorn API URL
`SHOEHORN_API_TOKEN`	Agent bearer token (from cluster registration)
`SHOEHORN_CLUSTER_ID`	Unique cluster slug (lowercase, hyphens allowed)
`SHOEHORN_CLUSTER_NAME`	Display name (defaults to cluster ID)

Agent Behavior

Variable	Default	Range	Description
`SHOEHORN_BATCH_INTERVAL`	`30s`	1s-10m	Event batch flush interval
`SHOEHORN_BATCH_SIZE`	`100`	1-10000	Max events per batch
`SHOEHORN_PUSH_RETRIES`	`3`	1-10	Retry attempts on push failure
`SHOEHORN_PUSH_TIMEOUT`	`30s`	1s-5m	HTTP timeout for push
`SHOEHORN_HEARTBEAT_INTERVAL`	`5m`	1m-30m	Heartbeat frequency
`SHOEHORN_HEALTH_PORT`	`8080`	-	Health check server port
`SHOEHORN_LOG_LEVEL`	`info`	debug/info/warn/error	Log verbosity
`SHOEHORN_LOG_FORMAT`	`json`	json/console	Log output format

Namespace Filtering

Variable	Default	Description
`SHOEHORN_NAMESPACES`	: (all)	Whitelist (comma-separated)
`SHOEHORN_EXCLUDE_NAMESPACES`	-	Blacklist (comma-separated)
`SHOEHORN_LABEL_SELECTOR`	-	Kubernetes label selector
`SHOEHORN_WATCHED_KINDS`	-	Specific resource kinds

Monitoring Control

Variable	Default	Description
`SHOEHORN_ANNOTATION_DEFAULT_BEHAVIOR`	`monitor-all`	`monitor-all`, `require-annotation`, `monitor-none`
`SHOEHORN_ANNOTATION_DEFAULT_LEVEL`	`basic`	Default monitoring level

Monitoring Levels

Per-resource annotation: shoehorn.dev/monitoring-level

Level	Collected Data
`basic`	Workload status, replicas, image info
`detailed`	Basic + restart counts, container states, pod events
`full`	Detailed + CPU/memory usage, resource limits, QoS class

Metrics Collection

Variable	Default	Range	Description
`SHOEHORN_METRICS_SAMPLE_INTERVAL`	`5m`	1m-30m	Metrics-server sample rate
`SHOEHORN_METRICS_WINDOW_HOURS`	`168` (7d)	1-720	Rolling window size in hours

GitOps Integration

Variable	Default	Description
`SHOEHORN_GITOPS_TOOL`	: (disabled)	`argocd` or `fluxcd`
`SHOEHORN_GITOPS_ARGOCD_NAMESPACE`	`argocd`	ArgoCD install namespace
`SHOEHORN_GITOPS_ARGOCD_SERVER_URL`	-	ArgoCD server URL (for UI links)
`SHOEHORN_GITOPS_ARGOCD_TOKEN`	-	ArgoCD API token
`SHOEHORN_GITOPS_WATCH_ALL_NAMESPACES`	`false`	Watch GitOps CRDs cluster-wide
`SHOEHORN_GITOPS_COMMAND_POLL_INTERVAL`	`30s`	Command polling frequency

RBAC Permissions

The Helm chart creates a ClusterRole with these permissions:

Core (Always Required)

# Workloads
- apiGroups: ["apps"]
  resources: [deployments, statefulsets, daemonsets]
  verbs: [get, list, watch]

# Batch
- apiGroups: ["batch"]
  resources: [cronjobs, jobs]
  verbs: [get, list, watch]

# Core resources
- apiGroups: [""]
  resources: [namespaces, pods, services, events]
  verbs: [get, list, watch]

# Networking
- apiGroups: ["networking.k8s.io"]
  resources: [ingresses, networkpolicies]
  verbs: [get, list, watch]

# Cilium (if present)
- apiGroups: ["cilium.io"]
  resources: [ciliumnetworkpolicies, ciliumclusterwidenetworkpolicies]
  verbs: [get, list, watch]

# Metrics
- apiGroups: ["metrics.k8s.io"]
  resources: [pods]
  verbs: [get, list]

# Leader election
- apiGroups: ["coordination.k8s.io"]
  resources: [leases]
  verbs: [get, list, watch, create, update, patch]

ArgoCD (When Enabled)

- apiGroups: ["argoproj.io"]
  resources: [applications]
  verbs: [get, list, watch]

FluxCD (When Enabled)

- apiGroups: ["kustomize.toolkit.fluxcd.io"]
  resources: [kustomizations]
  verbs: [get, list, watch, patch]

- apiGroups: ["helm.toolkit.fluxcd.io"]
  resources: [helmreleases]
  verbs: [get, list, watch, patch]

- apiGroups: ["source.toolkit.fluxcd.io"]
  resources: [gitrepositories, helmrepositories]
  verbs: [get, list, watch]

Team Ownership Inference

The agent infers team ownership for discovered workloads, checked in this order:

Annotation shoehorn.dev/team on the workload
Annotation shoehorn.dev/owner on the workload
Label owner on the workload
Label shoehorn.dev/team on the namespace
Namespace name pattern extraction (e.g., payments-prod -> payments)
Default: unassigned

Label namespaces for zero-config team assignment:

kubectl label namespace payments shoehorn.dev/team=payments-team

Annotations Reference

Annotation	Description
`shoehorn.dev/monitor`	`true`/`false` - opt in or out of monitoring
`shoehorn.dev/monitoring-level`	`basic`, `detailed`, or `full`
`shoehorn.dev/team`	Team slug for ownership
`shoehorn.dev/owner`	Alternative to `shoehorn.dev/team`
`shoehorn.dev/entityFile`	Path to `.shoehorn/catalog.yaml` for entity enrichment

See Annotations Reference for the complete list.

Health Endpoints

Endpoint	Purpose
`/healthz`	Liveness probe (checks leader eligibility)
`/readyz`	Readiness probe (checks leader status + push health)
`/livez`	Live status
`/metrics`	Prometheus metrics

See Agent Health for the full readiness model.

High Availability

For production clusters, run 2-3 replicas with leader election:

replicaCount: 3
leaderElection:
  enabled: true
podDisruptionBudget:
  minAvailable: 2

Only the leader runs watchers and pushes data. Followers are hot standbys that pass readiness probes and take over within seconds if the leader dies.

Security

Read-only RBAC — no create/update/delete on workloads
Non-root container — runs as UID 1000, drops all capabilities
Read-only filesystem — readOnlyRootFilesystem: true
HTTPS enforced — warns if API endpoint uses http://
Token redaction — API token cannot appear in logs or traces
No redirect following — prevents bearer token leakage
Annotation sanitization — strips kubectl.kubernetes.io/last-applied-configuration