Changelog

Notable changes across the Shoehorn platform. Format follows Keep a Changelog; versions follow SemVer.

Per-component release notes live in their own repos:

[Unreleased]

Added

Helm release detection. The Kubernetes agent can now spot the Helm releases running in your clusters and show which workloads each release manages, right in the Operations view. It’s off by default. Turn it on in the agent’s Helm chart or with the Terraform module. See Helm Release Detection.
Attach documentation to a team. Add related_teams: [your-team] to a doc’s frontmatter and it shows up on that team’s Documentation tab, under “Bound to this team”, even when the doc isn’t tied to a single service. Good for team charters, on-call policies, and ways of working. The tab now splits docs bound to the team from docs that come in through the team’s services.
Catalog list now has a Tier column. A small set of bars shows each service’s criticality (Critical, High, Medium, Low) at a glance, and you can sort by it to bring the most critical services to the top.
Notification subscriptions. Route alerts to Slack and email destinations alongside the existing internal inbox.
Notification channel secrets. Paste a Slack or webhook secret and Shoehorn stores it encrypted, or point a subscription at a secret you’ve already configured. Choose per field in the wizard.
Team notifications tab. Team admins see their team’s subscriptions in one table, can fire a test send, and resume a paused subscription.
Tenant feature flags gitops and metrics. They control whether the GitOps signal card and the resource-usage section show up in the Operations drawer. Turn them on once you’ve connected a GitOps tool or metrics-server.
Knowledge Risk tile on the home dashboard. Shows how many repositories lean on a single maintainer, so you can spot bus-factor risk at a glance. Click through to the full breakdown.
Admin → Directory → Users now lists every user synced from your IdP, not only people who have signed in. A new “Signed in” column shows whether each user has completed at least one login.
In-place upgrades for the bundled Meilisearch search engine. When a release moves to a newer search version, you can migrate the existing data instead of rebuilding the index from scratch. See Upgrading Shoehorn for the steps.

Changed

The Repositories page got an overview strip showing totals, public/private split, languages, and repos without a license. Click a repository to see its details in a side panel, including license, topics, and ownership. You can also filter by license now. The direct link to GitHub moved to its own column.
Service tier now reads as Critical, High, Medium, or Low. If your manifests use the older Bronze, Silver, Gold, or Platinum labels, they still work and show up as Low, Medium, High, and Critical.
Home dashboard KPI cards. Each card now shows a 14-day sparkline and trend, built from a daily snapshot. History starts when you upgrade, so cards show just the number until there are two days to chart.
Operations resource table has a new Signals column. A row with nothing firing stays empty.
Visual refresh across the Operations page, Entity page, Dashboard, and Insights.
Operations is now one page. Workloads, Overview, and GitOps merged into a single Resources view: workloads aggregated across clusters, a detail drawer per resource, and filters that live in the URL so a filtered view is shareable. The old sub-page URLs redirect to it.
The Governance page now groups actions by urgency, so overdue work sits at the top. You can resolve or dismiss several at once, filter to overdue, critical, or high priority, and reach resolved actions in their own group. The duplicate completeness card is gone (it still lives on the Insights overview).
Free tier no longer caps node count. Licensing is now per active K8s agent. Free stays at 1 active agent (one cluster). If you were running more than 5 nodes on Free, your agent heartbeats stop returning a 402.
Admin → Directory → Users provider column now shows the real IdP (okta, zitadel, entra-id) instead of always showing “local”. “local” is reserved for users with no IdP identity row.
K8s agent Prometheus metrics renamed to the shoehorn_ namespace prefix to match the rest of the platform. k8s_watcher_events_total becomes shoehorn_k8s_watcher_events_total, and the same applies to events_dropped_total, events_dropped_current, events_filtered_total, noop_updates_total, and drop_rate. Dashboards and alert rules that referenced the bare names need updating.
Health checks on Eventbus and worker binaries no longer advertise unregistered gRPC services. Probes hitting the empty-key (overall) health still report SERVING; per-service probes for eventbus.v1.EventBusService and worker-service now return NOT_FOUND.
K8s-imported catalog entities now show their real workload kind as the type. A Deployment shows as deployment, a StatefulSet as statefulset, a CronJob as cronjob, the same classification the Operations page already uses. Until now they all showed as service (DaemonSets as resource), so every k8s entity looked alike in the catalog. If you have scorecard rules or list filters matching type = 'service' or type = 'resource' on k8s entities, switch them to the kind value. To classify a workload differently, set the shoehorn.dev/type annotation (a kebab-case value up to 32 characters) and Shoehorn uses that instead.
Dark theme: primary buttons and focus rings no longer glow. The blue was tuned down so Save, Review, and other primary actions read as confident, not neon.

Removed

The standalone GitOps fleet page. Its ArgoCD/Flux sync-status view isn’t in the unified Operations page yet; for now it lives on the entity GitOps tab.
The Pipelines page under Code, along with its home dashboard widget and KPI tile.
The optional network observer add-on. Shoehorn now maps connections between services from your cluster’s configuration, so there’s no separate component to install.

Fixed

Org chart now shows team managers by name. Clicking a team opened a drawer showing the manager’s raw Okta ID instead of their name, and the team’s box on the chart had the same issue. Both now show the name, and fall back to “Unknown user” when the person isn’t in the directory.
Stale cloud resources no longer linger in the catalog. When the crawler stops seeing a cloud resource (a deleted UpCloud Kubernetes cluster, database, or server), it marks the entry end-of-life after three missed syncs, hides it from the catalog list right away, and removes it from the database after a 7-day grace window. If the resource comes back during that window it reappears, and deep links to a hidden entry still resolve until it’s removed.
Going back from a service’s detail page no longer resets the catalog. Your filters, search, and place in the list are kept when you return.
Team dashboard widgets now link to the team they’re scoped to. “View all” on Team Members opens the team page; Repositories, Pipelines, and Governance open their respective list page filtered by ?owner=<teamId>. Before, all four jumped to unfiltered global pages.
K8s reconciliation now runs in production. Bugs in the agent’s resource count and push behavior were causing the startup sweep to be skipped, so out-of-scope catalog entries were never pruned. The count now reflects only kinds that land in the catalog as instances (Deployment, StatefulSet, DaemonSet, CronJob, Pod), matching what the platform actually stamps. The agent no longer POSTs empty batches that the platform rejects with 400. The completing sync marker still goes through unchanged. Both fixed in the K8s agent; no platform change.
K8s agent no longer watches individual Jobs. CronJob is the catalog entity; an individual Job run is its execution and was creating catalog churn (each CronJob tick = one new instance row that immediately got deleted on completion) plus tripping reconciliation guards when a Job completed mid-sync. If you relied on seeing standalone Jobs in the catalog, the CronJob view is the supported replacement.
Stale Kubernetes resources now get cleaned out of the catalog. On startup the agent sends a full cluster snapshot, and the platform prunes anything no longer there: workloads in namespaces you’ve added to SHOEHORN_EXCLUDE_NAMESPACES, and resources deleted while the agent was offline. Before, the agent only sent incremental add/update/delete events, so those resources stayed in the catalog indefinitely. Reconciliation kicks in once both the platform and the K8s agent are on this version; an older agent keeps working unchanged, just without the cleanup.
Bootstrap admin on first login: TENANT_ADMIN_USER again grants the admin role. v0.5.3’s email_verified gate locked operators out when their IdP didn’t surface the claim (Okta API-created users, sandbox accounts). The operator already controls both the IdP and TENANT_ADMIN_USER, so the IdP itself is the gatekeeper; the extra gate was overcorrect for the enterprise IdPs we support.
OIDC callback now reads email_verified from the verified ID token instead of the access token. Okta puts the claim only in the ID token per OIDC Core spec.
Helm install on clusters without a default StorageClass now works. PVCs for postgres, meilisearch, and redpanda no longer emit storageClassName: ""; they omit the field when no class is configured, so the cluster default applies. Valkey already worked; the three other statefulsets now match its pattern.
Valkey now enforces the password it was wired up to use. The server starts with --requirepass $(VALKEY_PASSWORD) when valkey.passwordSecretRef.key is set, and the liveness probe authenticates via REDISCLI_AUTH so it fails on a NOAUTH reply.
Chart no longer renders :latest image tags when image.tag is unset. The shoehorn.componentImage helper fails with a clear error message instead.
RBAC sync Job (--set rbac.sync.enabled=true --set rbac.sync.mode=automatic) now renders. It previously referenced undefined helpers, the wrong postgresql.sslMode path, a missing rbac.defaultRole value, used a hard-coded valkey:9.0.0 image, and had an indentation bug that produced invalid YAML.
Chart README TL;DR helm install command now includes the required values (global.domain, organization slug, Zitadel projectId/clientId/externalUrl) and no longer references a custom-values.yaml file that was never explained.
global.domain placeholder now blocks render. The chart fails fast if you forget to set your hostname instead of silently deploying with idp.example.com.
Forge service now honors postgresql.tls.enabled. It was hard-coded to DB_SSLMODE=disable regardless of TLS config, which silently disabled TLS for forge while the other services used it.
PodDisruptionBudgets only render for components with 2+ replicas (or autoscaling enabled). With a single replica and minAvailable: 1 the PDB blocked node drains indefinitely, which matched no real intent.
RBAC sync Job name is now deterministic (uses Release.Revision). The previous {{ now | date ... }} suffix made helm diff think every render changed the job and broke idempotent apply.
Install guide image-tag example updated to v0.5.22 (current Chart appVersion). The docs were stuck on a future-pretending v0.7.0.
API keys admin page now deep-clones the SSR data prop before assigning to $state. Without the clone, a child component mutating a row could crash the page with state_unsafe_mutation.
The 30-second pod-status poller in K8sWorkloadsCard now stops when you navigate to a different entity, not only on component destroy. Old pollers no longer write back into a stale entity’s state.
Profile dropdown in the top bar is no longer hidden behind page headers. The dropdown’s z-index was clamped by the top bar’s stacking context, so page-level sticky sub-headers on catalog, admin, forge, and other section pages painted over it. The top bar now sits above page content.
Templates that referenced shoehorn.io/ for labels and annotations now use shoehorn.dev/, the actual org domain.
Helm install on a cluster without Traefik now fails fast with a clear message instead of producing IngressRoute manifests Kubernetes can’t accept. The check is skipped during helm template dry-runs.
Meilisearch env var standardized to MEILISEARCH_MASTER_KEY across api, worker, eventbus, crawler, and forge, matching upstream Meilisearch naming. Existing deployments using MEILISEARCH_API_KEY still work via a fallback in the client code; new installs use the canonical name.
Removed the unused global.tracing.sampleRate value. It was declared but no backend consumed it. If sampling becomes wired later it’ll surface as OTEL_TRACES_SAMPLER_ARG, with docs.
Meilisearch pod now sets seccompProfile: RuntimeDefault, matching the other backend pods.
“Bootstrap admin” renamed to “initial admin” in get-started docs. The bootstrap jargon stays internal-only.
Get-started prerequisites table now says Helm 4.0+ (was incorrectly Helm 3.12+); matches the chart README and install guide.
Drawer types in the operations UI (DeploymentDetailDrawer, SLODetailDrawer) now use named LinkedIncident and SLOViolation interfaces instead of any[]. Backend shape changes surface as type errors at build time instead of silently breaking the drawer.
getEntityDocs now distinguishes “empty docs” from “API returned no data”. The latter logs a warning via the structured logger.
GITHUB_FORGE_PRIVATE_KEY_PATH env var on api + forge pods is only set when auth.github.forge.appId is configured. Stops the pod from advertising a path to a file that’s not mounted on default installs.
Startup fatal errors now flush the log entry before exiting. The five service binaries (api, forge, worker, eventbus, crawler) used log.Fatal, which can lose buffered stdout on k8s pipes; they now go through a logger.Die helper that calls Sync() first.

Security

Agent push API now rejects dashboardUrl values whose scheme isn’t http or https. Closes a stored-XSS vector where a malicious or misconfigured SHOEHORN_DASHBOARD_URL (e.g. javascript:..., data:text/html,...) would render as a clickable link in the UI. Empty dashboardUrl stays valid.

Docs

Chart README now points to a Traefik install command for clusters without an ingress controller, and includes a PowerShell-native equivalent of the openssl rand credential generator.

[v0.5.3] - 2026-05-13

Changed

Pricing model: clusters and nodes replace entities. Free covers 1 cluster with up to 5 nodes. Standard covers 3 clusters at €299/month with everything else unlimited. Extra clusters cost €99/month each.
Existing licenses auto-convert. Beta customers keep going. When the Beta period ends, you get a 30-day read-only window to switch to Standard or drop back to Free. Catalog reads, dashboards, and agent traffic stay live during the window. Only catalog writes pause.
After the 30-day window, tenants still over the Free limit get auto-trimmed to one cluster (smallest node count, oldest first). Data stays in place. The disconnected agents stop authenticating until reactivated under a paid license.

Added

Pending cluster registrations panel. When an agent registers past your licensed cluster count, it shows up here so an admin can request a license upgrade or dismiss the row.

[v0.5.22] - 2026-05-09

Added

Filter for profile entities by query, type, and team.

Fixed

Infinite login loop when a user without any assigned roles signed in. They’re now redirected to an error page explaining the missing access.

[v0.5.21] - 2026-05-08

Added

Public launch of the Shoehorn Platform.