Files
k3s-cluster/platform-engineer/README.md
2026-06-27 00:09:39 +02:00

17 KiB
Raw Blame History

Platform Engineer Agent — Deployment Plan

An autonomous Hermes Agent that runs inside the k3s cluster, watches its health on a schedule, tries to fix simple problems, and notifies me (via Discord) when something needs my attention or a fix failed.

Docs: https://hermes-agent.nousresearch.com/docs/user-guide/docker


1. Goal & operating model

  • One Hermes container in a new namespace platform-engineer, scheduled on the powerful amd64 node (roger-nucbox-evo-x2, 24 GiB RAM).
  • Hermes runs in gateway mode under s6 supervision (command: gateway run), so the built-in cron scheduler is active and survives restarts.
  • The agent talks to the cluster with kubectl from inside the container (terminal backend = local). We give the pod a ServiceAccount + ClusterRole scoped to read-mostly + restart/scale/delete-pod permissions.
  • LLM calls are routed through the in-cluster LiteLLM proxy (litellm.rogi.casa) — no external API keys needed in the cluster.
  • Notifications go to Discord (reuse the pattern from myorg-assistant).
  • A set of cron jobs (Hermes-native, not Kubernetes CronJobs) make the agent run periodic checks. Watchdog checks use [SILENT] so it only pings me when something is wrong.

Why Hermes-native cron (not k8s CronJobs):

  • Hermes cron ticks inside the gateway, runs in an isolated agent session, supports [SILENT] suppression, deliver="discord", workdir, and context_from chaining — far less plumbing than spawning a fresh pod per run.
  • Cron jobs live in ~/.hermes/cron/jobs.json on the PVC, so they survive pod restarts and can be edited live via hermes cron edit without redeploying.

2. Files to create (this directory)

platform-engineer/
├── namespace.yaml              # namespace platform-engineer
├── rbac.yaml                    # ServiceAccount + ClusterRole (+binding)
├── configmap.yaml               # hermes config.yaml + SOUL.md + cron seed script
├── secret.yaml                  # DISCORD bot token, LITELLM_API_KEY, kubeconfig-less SA token
├── pvc.yaml                     # persistent /opt/data (HERMES_HOME)
├── dockerfile                   # derived image: hermes-agent + kubectl + helm
├── deployment.yaml              # Deployment, schedules on amd64, mounts kube SA token
├── ingress.yaml                 # hermes.rogi.casa → dashboard (optional)
└── README.md                    # this file

Then add a line to argocd/gen-apps.sh APPS=(...):

"platform-engineer|platform-engineer|platform-engineer|true|true"

and re-run ./argocd/gen-apps.sh to generate argocd/apps/platform-engineer.yaml so ArgoCD reconciles it like every other app in the repo.


3. RBAC — least privilege

ServiceAccount platform-engineer in ns platform-engineer, bound to a ClusterRole scoped to platform engineer actions:

Read (get/list/watch): nodes, pods, services, deployments, statefulsets, daemonsets, replicasets, jobs, cronjobs, events, configmaps, secrets, PVCs, ingresses, namespaces.

Act (patch/update on a allowlist):

  • podsdelete (force-restart a stuck pod), patch (/evict, annotations)
  • deployments, statefulsets, daemonsets, replicasetspatch (restart via kubectl rollout restart / scale), update
  • jobs, cronjobsdelete, patch
  • pods/exec (subresource) → create (only if we want the agent to kubectl exec into pods for log-style debugging — optional; keep off initially)
  • eventsget/list/watch only

No cluster-scoped writes (no creating namespaces, no node taints, no RBAC edits, no CRDs). The agent can propose those and tell me; it cannot do them itself. All mutating calls are auditable via Kubernetes audit logs and kubectl auth can-i --as=system:serviceaccount:platform-engineer:platform-engineer.

The pod uses the k3s in-cluster ServiceAccount token (/var/run/secrets/... /serviceaccount/token) + the KUBERNETES_SERVICE_HOST/PORT env vars k3s already injects — no kubeconfig file, no long-lived token on disk.


4. Image — thin derived Dockerfile

FROM nousresearch/hermes-agent:latest
USER root
RUN apt-get update \
 && apt-get install -y --no-install-recommends curl gnupg \
 && curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
    | gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
 && echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
    > /etc/apt/sources.list.d/kubernetes.list \
 && apt-get update \
 && apt-get install -y --no-install-recommends kubectl \
 && curl -fsSL https://get.helm.sh/helm-v3.16.0-linux-amd64.tar.gz \
    | tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
 && rm -rf /var/lib/apt/lists/*
USER hermes

Note: the cluster is mixed arch (arm64/amd64/arm). The agent pod is pinned to the amd64 node, so linux-amd64 helm + kubectl packages are fine. If you later want it portable, switch to a multi-arch build with TARGETARCH and install matching helm arch.

Build & push to your Gitea registry (git.rogi.casa/roger/...) — same imagePullSecrets: gitea-registry pattern as gym-tracker. Tag with the hermes version + a short git sha.


5. Hermes configuration (mounted via ConfigMap → /opt/data/config.yaml)

# config.yaml (seeded into the PVC on first boot)
model:
  provider: openai-api
  default: claude-4.5-haiku
  base_url: "https://litellm.rogi.casa/v1"
  api_mode: chat_completions

# Use a cheap, fast model for auxiliary tasks (titling, compression)
auxiliary:
  compression:
    provider: openai-api
    model: gemini-3-flash
  title_generation:
    provider: openai-api
    model: gemini-3-flash

terminal:
  backend: local
  cwd: /workspace            # a working dir for any kubectl output / scratch
  timeout: 180
  home_mode: profile        # isolate tool credentials under HERMES_HOME/home

# Unattended gateway → circuit-breaker on tool-call loops
tool_loop_guardrails:
  hard_stop_enabled: true
  hard_stop_after:
    exact_failure: 5
    idempotent_no_progress: 5

sessions:
  auto_prune: true
  retention_days: 90

cron:
  wrap_response: false      # cleaner Discord messages

memory:
  memory_enabled: true
  user_profile_enabled: true

.env (from Secret, mounted to /opt/data/.env):

OPENAI_API_KEY=<LITELLM_API_KEY value, i.e. sk-...>
OPENAI_BASE_URL=https://litellm.rogi.casa/v1
DISCORD_BOT_TOKEN=<new dedicated bot token>
DISCORD_HOME_CHANNEL=<your user/channel id for alerts>
# Dashboard auth (homelab, trusted LAN behind ingress)
HERMES_DASHBOARD_BASIC_AUTH_USERNAME=roger
HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=<strong password>

Why OPENAI_API_KEY + OPENAI_BASE_URL: the openai-api provider honours OPENAI_BASE_URL, so this is the simplest way to point Hermes at the in-cluster LiteLLM. claude-4.5-haiku / gemini-3-flash are the model names already exposed by your litellm/litellm.yaml ConfigMap.

SOUL.md (personality + guardrails) — see configmap.yaml. Key points:

  • Identity: "Platform Engineer for the rogi.casa k3s cluster."
  • Knows the cluster layout (3 nodes, ArgoCD GitOps, Traefik+cert-manager, LiteLLM, services list).
  • Operating rules: read-first; only act on the allowlisted verbs; never edit RBAC / taints / namespaces / CRDs; when in doubt, notify instead of acting; always cite the resource and the command used.
  • How to reach me: deliver="discord".

6. Deployment

  • replicas: 1 (Hermes data dir is single-writer — never scale >1).
  • nodeSelector: kubernetes.io/arch: amd64 + preferred hardware: high-memory affinity → lands on the NUC.
  • resources: requests 512Mi/250m, limits 2Gi/1 core (Hermes recommends 24 GiB; 1 GiB is fine without browser tools, which we keep off).
  • Volume: PVC mounted at /opt/data (HERMES_HOME), RWX not needed (single pod).
  • Ports: 8642 (gateway API, internal only) and 9119 (dashboard) → exposed via Ingress hermes.rogi.casa with TLS + basic-auth (already enforced by the HERMES_DASHBOARD_BASIC_AUTH_* env vars).
  • imagePullSecrets: gitea-registry.
  • env from Secret; HERMES_DASHBOARD=1.
  • Init: on first boot the s6 01-hermes-setup hook seeds config/SOUL/.env from the ConfigMap if the volume is empty. We mount the ConfigMap as a readonly projection at /opt/seed/ and run a tiny initContainer to copy it into /opt/data only when /opt/data/config.yaml doesn't exist (so ArgoCD self-heal never fights the agent's live-edited config).

7. Cron jobs to seed (Hermes-native)

These are written by an init script (one-shot Job hermes-cron-seed) that runs hermes cron create ... against the gateway on first install, and is idempotent (it checks existing job names). All deliver to Discord. Examples:

Name Schedule Prompt (abbreviated)
cluster-health-check every 15m Run kubectl get nodes,pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded and kubectl get events -A --field-selector type=Warning --since=20m. If everything healthy, reply with only [SILENT]. Otherwise summarize failures and root-cause briefly.
pod-restart-loop every 10m Find pods in CrashLoopBackOff/ImagePullBackOff across all namespaces. For CrashLoopBackOff, fetch logs and if a clear transient cause (OOM, config parse, missing secret) is visible, attempt kubectl rollout restart <deploy>; otherwise notify me with the log excerpt. Reply [SILENT] if none found.
pvc-pressure every 30m kubectl get pv + node disk via kubectl top nodes. Alert if any PVC Bound to a near-full volume or node disk >85%. [SILENT] otherwise.
argocd-sync-health every 1h kubectl get applications -n argocd -o wide (or argocd app sync --dry-run if CLI present). Report any OutOfSync/Degraded app. [SILENT] if all Synced+Healthy.
cert-expiry every 1d at 09:00 List cert-manager Certificate resources with expiry < 21 days. Notify only if any. [SILENT] otherwise.
node-resource-drift every 30m kubectl top nodes. Alert if any node CPU>90% or mem>90% sustained, or any node NotReady. [SILENT] otherwise.
daily-cluster-report 0 8 * * * Summarize: node count/status, top 5 pods by CPU/mem, # pods not Running, # ArgoCD apps OutOfSync, cert warnings. Always deliver (no [SILENT]).

Design rules baked into SOUL.md:

  • Read-only checks run frequently (1030m) and stay silent unless wrong.
  • Mutating actions are restricted to safe idempotent ones (rollout restart, delete stuck pod so controller recreates). Anything riskier → notify me with a proposed command and wait for me to run it (I can reply in Discord to the continuable thread).
  • Cron sessions are isolated and cannot create new cron jobs (Hermes disables that inside cron runs) → no runaway loops.

8. Safety & guardrails

  1. RBAC is the real boundary. Even if the agent goes rogue, the SA can't touch other namespaces' secrets beyond read, can't change RBAC, can't taint nodes, can't create namespaces.
  2. tool_loop_guardrails.hard_stop_enabled: true — circuit-breaks a stuck gateway (recommended in the Docker doc for unattended deployments).
  3. skills.write_approval: false but memory.write_approval: true (so the agent can build skills/memories but I review memory writes lazily — flip this if it gets noisy).
  4. No pods/exec subresource initially (keep the agent from shelling into workloads). Enable later only if you want log-grep-style debugging.
  5. Dashboard behind ingress TLS + basic auth (the June-2026 hardening makes auth mandatory on non-loopback binds; we satisfy it with the bundled basic-auth provider).
  6. Single replica / single-writer PVC — the Docker doc is explicit that two gateways on the same /opt/data corrupt session/memory stores. Use a podAntiAffinity so an accidental scale-up doesn't co-run.
  7. ArgoCD interaction: keep syncPolicy.automated.prune+selfHeal but exclude the live-edited hermes state. Practically: Argo owns the manifests (deployment, configmap, secret, pvc), while /opt/data (config.yaml, cron/jobs.json, SOUL.md edits made via the dashboard) is runtime state on the PVC and is not reconciled by Argo. The ConfigMap only seeds it on first boot. Document this clearly in the README so future-you doesn't expect Argo to reset the agent's personality.

9. Rollout plan

  1. Build & push the derived image to git.rogi.casa/roger/hermes-agent (tag v1.35-<sha>).
  2. Create the namespace + RBAC + Secret + ConfigMap + PVC: kubectl apply -f platform-engineer/.
  3. Create the platform-engineer Discord bot, invite it, put its token + your channel id in secret.yaml (base64).
  4. Apply the Deployment; wait for the pod to go Running.
  5. kubectl exec in and run the one-shot cron seed: hermes cron create ... (or apply the cron-seed Job).
  6. Trigger the first cluster-health-check manually: hermes cron run cluster-health-check.
  7. Add the app to argocd/gen-apps.sh, regenerate, commit, push.

10. Decisions (locked in)

  1. Notifications: dedicated platform-engineer Discord bot → its own token in secret.yaml (DISCORD_BOT_TOKEN, DISCORD_HOME_CHANNEL).
  2. Dashboard: public at hermes.rogi.casa (Traefik TLS + cert-manager + the bundled Hermes basic-auth provider). Reach the dashboard on port 9119; the gateway API on 8642 is ClusterIP-only.
  3. Image: derived image pushed to git.rogi.casa/roger/hermes-agent, pulled via the existing gitea-registry imagePullSecret (must also exist in the platform-engineer ns — see deploy steps).
  4. Model: qwen-3.6:27b via the in-cluster Ollama box (10.88.20.12:11434), exposed through LiteLLM as qwen-3.6:27b. Added to litellm/litellm.yaml. Hermes reaches LiteLLM at https://litellm.rogi.casa/v1 (never Ollama directly).
  5. pods/exec: granted (pods/execcreate in the ClusterRole) so the agent can kubectl exec/kubectl logs for debugging.

11. Deployment checklist (do in this order)

  1. Add the Ollama model to LiteLLM (already done in litellm/litellm.yaml): the qwen-3.6:27b entry points at http://10.88.20.12:11434. Make sure qwen3.6:27b is actually pulled on that Ollama host (ollama pull qwen3.6:27b). Apply: kubectl apply -f litellm/ and restart the LiteLLM pod so the new config takes effect.
  2. Create the gitea-registry secret in the new namespace (ArgoCD won't create it — it's not in the repo):
    kubectl create namespace platform-engineer
    kubectl create secret docker-registry gitea-registry \
      --docker-server=git.rogi.casa \
      --docker-username=<your-gitea-user> \
      --docker-password=<gitea-access-token> \
      --docker-email=<your-email> \
      -n platform-engineer
    
  3. Build & push the image: ./platform-engineer/build-and-push.sh (after docker login git.rogi.casa).
  4. Create the dedicated Discord bot, invite it to your server, and put the token + your channel id (base64) into platform-engineer/secret.yaml. Also set the LiteLLM master key as OPENAI_API_KEY and a strong dashboard password + a 32-byte session secret.
  5. Commit & push the whole change. ArgoCD will create the namespace resources, deploy the pod, and bring up the ingress at hermes.rogi.casa.
  6. Seed the cron jobs: kubectl apply -f platform-engineer/cron-seed.yaml (one-shot Job) — it waits for the hermes pod, then runs hermes cron create ... for each watchdog. Re-run it any time you want to re-seed after a wipe.
  7. Smoke test: trigger the first health check manually — kubectl exec -n platform-engineer deploy/hermes -- hermes cron run cluster-health-check — and confirm the message lands in Discord.
  8. ArgoCD: the Application (argocd/apps/platform-engineer.yaml) is already generated. After commit, Argo will reconcile it like every other app.

12. What ArgoCD owns vs. what is runtime state

  • ArgoCD owns (in git): namespace, RBAC, Secret, ConfigMap (seed), PVC, Deployment, Service, Ingress, cron-seed Job.
  • Runtime state (on the PVC, NOT reconciled): config.yaml, SOUL.md, .env, cron/jobs.json, sessions/, memories/, skills/. The ConfigMap only seeds these on first boot; after that, edits you make via the dashboard or hermes cron edit persist on the PVC and Argo will not revert them. If you ever want a hard reset, delete the PVC and re-apply.

Files in this directory

File Purpose
namespace.yaml namespace platform-engineer
rbac.yaml ServiceAccount + ClusterRole (+binding), least-privilege
configmap.yaml seed config.yaml + SOUL.md
secret.yaml Discord token, LiteLLM key, dashboard auth (PLACEHOLDERS — fill in)
pvc.yaml 5 Gi PVC for /opt/data
dockerfile derived image: hermes-agent + kubectl + helm (linux/amd64)
build-and-push.sh builds & pushes the image to the Gitea registry
deployment.yaml Deployment (1 replica, Recreate, pinned to amd64 NUC) + Service
ingress.yaml hermes.rogi.casa → dashboard (TLS + basic auth)
cron-seed.yaml one-shot Job that creates the Hermes cron schedule

Also changed outside this directory:

  • litellm/litellm.yaml — added qwen-3.6:27b model entry.
  • argocd/gen-apps.sh + argocd/apps/platform-engineer.yaml — ArgoCD Application for this folder.