Files
2026-06-27 00:09:39 +02:00

368 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Platform Engineer Agent — Deployment Plan
An autonomous **Hermes Agent** that runs inside the k3s cluster, watches its
health on a schedule, tries to fix simple problems, and notifies me (via
Discord) when something needs my attention or a fix failed.
Docs: https://hermes-agent.nousresearch.com/docs/user-guide/docker
---
## 1. Goal & operating model
- **One Hermes container** in a new namespace `platform-engineer`, scheduled on
the powerful amd64 node (`roger-nucbox-evo-x2`, 24 GiB RAM).
- Hermes runs in **gateway mode** under s6 supervision (`command: gateway run`),
so the built-in **cron scheduler** is active and survives restarts.
- The agent talks to the cluster with `kubectl` from *inside* the container
(terminal backend = `local`). We give the pod a **ServiceAccount + ClusterRole**
scoped to read-mostly + restart/scale/delete-pod permissions.
- LLM calls are routed through the in-cluster **LiteLLM** proxy
(`litellm.rogi.casa`) — no external API keys needed in the cluster.
- Notifications go to **Discord** (reuse the pattern from `myorg-assistant`).
- A set of **cron jobs** (Hermes-native, not Kubernetes CronJobs) make the agent
run periodic checks. Watchdog checks use `[SILENT]` so it only pings me when
something is wrong.
Why Hermes-native cron (not k8s CronJobs):
- Hermes cron ticks inside the gateway, runs in an isolated agent session,
supports `[SILENT]` suppression, `deliver="discord"`, `workdir`, and
`context_from` chaining — far less plumbing than spawning a fresh pod per run.
- Cron jobs live in `~/.hermes/cron/jobs.json` on the PVC, so they survive pod
restarts and can be edited live via `hermes cron edit` without redeploying.
---
## 2. Files to create (this directory)
```
platform-engineer/
├── namespace.yaml # namespace platform-engineer
├── rbac.yaml # ServiceAccount + ClusterRole (+binding)
├── configmap.yaml # hermes config.yaml + SOUL.md + cron seed script
├── secret.yaml # DISCORD bot token, LITELLM_API_KEY, kubeconfig-less SA token
├── pvc.yaml # persistent /opt/data (HERMES_HOME)
├── dockerfile # derived image: hermes-agent + kubectl + helm
├── deployment.yaml # Deployment, schedules on amd64, mounts kube SA token
├── ingress.yaml # hermes.rogi.casa → dashboard (optional)
└── README.md # this file
```
Then add a line to `argocd/gen-apps.sh` `APPS=(...)`:
```
"platform-engineer|platform-engineer|platform-engineer|true|true"
```
and re-run `./argocd/gen-apps.sh` to generate `argocd/apps/platform-engineer.yaml`
so ArgoCD reconciles it like every other app in the repo.
---
## 3. RBAC — least privilege
ServiceAccount `platform-engineer` in ns `platform-engineer`, bound to a
**ClusterRole** scoped to *platform engineer* actions:
**Read (get/list/watch):** nodes, pods, services, deployments, statefulsets,
daemonsets, replicasets, jobs, cronjobs, events, configmaps, secrets, PVCs,
ingresses, namespaces.
**Act (patch/update on a allowlist):**
- `pods``delete` (force-restart a stuck pod), `patch` (`/evict`, annotations)
- `deployments`, `statefulsets`, `daemonsets`, `replicasets``patch` (restart
via `kubectl rollout restart` / scale), `update`
- `jobs`, `cronjobs``delete`, `patch`
- `pods/exec` (subresource) → `create` (only if we want the agent to `kubectl
exec` into pods for log-style debugging — optional; keep off initially)
- `events` → `get/list/watch` only
**No cluster-scoped writes** (no creating namespaces, no node taints, no RBAC
edits, no CRDs). The agent can *propose* those and tell me; it cannot do them
itself. All mutating calls are auditable via Kubernetes audit logs and
`kubectl auth can-i --as=system:serviceaccount:platform-engineer:platform-engineer`.
The pod uses the k3s in-cluster ServiceAccount token (`/var/run/secrets/...
/serviceaccount/token`) + the `KUBERNETES_SERVICE_HOST/PORT` env vars k3s already
injects — **no kubeconfig file, no long-lived token on disk**.
---
## 4. Image — thin derived Dockerfile
```dockerfile
FROM nousresearch/hermes-agent:latest
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl gnupg \
&& curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
| gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
&& echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
> /etc/apt/sources.list.d/kubernetes.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends kubectl \
&& curl -fsSL https://get.helm.sh/helm-v3.16.0-linux-amd64.tar.gz \
| tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
&& rm -rf /var/lib/apt/lists/*
USER hermes
```
> Note: the cluster is mixed arch (arm64/amd64/arm). The agent pod is pinned to
> the amd64 node, so `linux-amd64` helm + `kubectl` packages are fine. If you
> later want it portable, switch to a multi-arch build with
> `TARGETARCH` and install matching helm arch.
Build & push to your Gitea registry (`git.rogi.casa/roger/...`) — same
`imagePullSecrets: gitea-registry` pattern as `gym-tracker`. Tag with the
hermes version + a short git sha.
---
## 5. Hermes configuration (mounted via ConfigMap → /opt/data/config.yaml)
```yaml
# config.yaml (seeded into the PVC on first boot)
model:
provider: openai-api
default: claude-4.5-haiku
base_url: "https://litellm.rogi.casa/v1"
api_mode: chat_completions
# Use a cheap, fast model for auxiliary tasks (titling, compression)
auxiliary:
compression:
provider: openai-api
model: gemini-3-flash
title_generation:
provider: openai-api
model: gemini-3-flash
terminal:
backend: local
cwd: /workspace # a working dir for any kubectl output / scratch
timeout: 180
home_mode: profile # isolate tool credentials under HERMES_HOME/home
# Unattended gateway → circuit-breaker on tool-call loops
tool_loop_guardrails:
hard_stop_enabled: true
hard_stop_after:
exact_failure: 5
idempotent_no_progress: 5
sessions:
auto_prune: true
retention_days: 90
cron:
wrap_response: false # cleaner Discord messages
memory:
memory_enabled: true
user_profile_enabled: true
```
`.env` (from Secret, mounted to `/opt/data/.env`):
```
OPENAI_API_KEY=<LITELLM_API_KEY value, i.e. sk-...>
OPENAI_BASE_URL=https://litellm.rogi.casa/v1
DISCORD_BOT_TOKEN=<new dedicated bot token>
DISCORD_HOME_CHANNEL=<your user/channel id for alerts>
# Dashboard auth (homelab, trusted LAN behind ingress)
HERMES_DASHBOARD_BASIC_AUTH_USERNAME=roger
HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=<strong password>
```
> Why `OPENAI_API_KEY` + `OPENAI_BASE_URL`: the `openai-api` provider honours
> `OPENAI_BASE_URL`, so this is the simplest way to point Hermes at the
> in-cluster LiteLLM. `claude-4.5-haiku` / `gemini-3-flash` are the model names
> already exposed by your `litellm/litellm.yaml` ConfigMap.
`SOUL.md` (personality + guardrails) — see `configmap.yaml`. Key points:
- Identity: "Platform Engineer for the rogi.casa k3s cluster."
- Knows the cluster layout (3 nodes, ArgoCD GitOps, Traefik+cert-manager,
LiteLLM, services list).
- Operating rules: read-first; only act on the allowlisted verbs; never edit
RBAC / taints / namespaces / CRDs; when in doubt, notify instead of acting;
always cite the resource and the command used.
- How to reach me: `deliver="discord"`.
---
## 6. Deployment
- `replicas: 1` (Hermes data dir is single-writer — never scale >1).
- `nodeSelector: kubernetes.io/arch: amd64` + preferred `hardware: high-memory`
affinity → lands on the NUC.
- `resources`: requests 512Mi/250m, limits 2Gi/1 core (Hermes recommends
24 GiB; 1 GiB is fine without browser tools, which we keep off).
- Volume: PVC mounted at `/opt/data` (HERMES_HOME), RWX not needed (single pod).
- Ports: 8642 (gateway API, internal only) and 9119 (dashboard) → exposed via
Ingress `hermes.rogi.casa` with TLS + basic-auth (already enforced by the
`HERMES_DASHBOARD_BASIC_AUTH_*` env vars).
- `imagePullSecrets: gitea-registry`.
- env from Secret; `HERMES_DASHBOARD=1`.
- Init: on first boot the s6 `01-hermes-setup` hook seeds config/SOUL/.env from
the ConfigMap if the volume is empty. We mount the ConfigMap as a readonly
projection at `/opt/seed/` and run a tiny initContainer to copy it into
`/opt/data` only when `/opt/data/config.yaml` doesn't exist (so ArgoCD
self-heal never fights the agent's live-edited config).
---
## 7. Cron jobs to seed (Hermes-native)
These are written by an init script (one-shot Job `hermes-cron-seed`) that runs
`hermes cron create ...` against the gateway on first install, and is idempotent
(it checks existing job names). All deliver to Discord. Examples:
| Name | Schedule | Prompt (abbreviated) |
|------|----------|------------------------|
| `cluster-health-check` | `every 15m` | Run `kubectl get nodes,pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` and `kubectl get events -A --field-selector type=Warning --since=20m`. If everything healthy, reply with only `[SILENT]`. Otherwise summarize failures and root-cause briefly. |
| `pod-restart-loop` | `every 10m` | Find pods in `CrashLoopBackOff`/`ImagePullBackOff` across all namespaces. For `CrashLoopBackOff`, fetch logs and if a clear transient cause (OOM, config parse, missing secret) is visible, attempt `kubectl rollout restart <deploy>`; otherwise notify me with the log excerpt. Reply `[SILENT]` if none found. |
| `pvc-pressure` | `every 30m` | `kubectl get pv` + node disk via `kubectl top nodes`. Alert if any PVC `Bound` to a near-full volume or node disk >85%. `[SILENT]` otherwise. |
| `argocd-sync-health` | `every 1h` | `kubectl get applications -n argocd -o wide` (or `argocd app sync --dry-run` if CLI present). Report any `OutOfSync`/`Degraded` app. `[SILENT]` if all `Synced`+`Healthy`. |
| `cert-expiry` | `every 1d at 09:00` | List cert-manager `Certificate` resources with expiry < 21 days. Notify only if any. `[SILENT]` otherwise. |
| `node-resource-drift` | `every 30m` | `kubectl top nodes`. Alert if any node CPU>90% or mem>90% sustained, or any node `NotReady`. `[SILENT]` otherwise. |
| `daily-cluster-report` | `0 8 * * *` | Summarize: node count/status, top 5 pods by CPU/mem, # pods not Running, # ArgoCD apps OutOfSync, cert warnings. Always deliver (no `[SILENT]`). |
Design rules baked into SOUL.md:
- **Read-only checks** run frequently (1030m) and stay silent unless wrong.
- **Mutating actions** are restricted to safe idempotent ones (rollout restart,
delete stuck pod so controller recreates). Anything riskier → notify me with
a proposed command and wait for me to run it (I can reply in Discord to the
continuable thread).
- Cron sessions are isolated and **cannot create new cron jobs** (Hermes
disables that inside cron runs) → no runaway loops.
---
## 8. Safety & guardrails
1. **RBAC is the real boundary.** Even if the agent goes rogue, the SA can't
touch other namespaces' secrets beyond read, can't change RBAC, can't taint
nodes, can't create namespaces.
2. **`tool_loop_guardrails.hard_stop_enabled: true`** — circuit-breaks a stuck
gateway (recommended in the Docker doc for unattended deployments).
3. **`skills.write_approval: false` but `memory.write_approval: true`** (so the
agent can build skills/memories but I review memory writes lazily — flip
this if it gets noisy).
4. **No `pods/exec` subresource** initially (keep the agent from shelling into
workloads). Enable later only if you want log-grep-style debugging.
5. **Dashboard behind ingress TLS + basic auth** (the June-2026 hardening makes
auth mandatory on non-loopback binds; we satisfy it with the bundled
basic-auth provider).
6. **Single replica / single-writer PVC** — the Docker doc is explicit that two
gateways on the same `/opt/data` corrupt session/memory stores. Use a
`podAntiAffinity` so an accidental scale-up doesn't co-run.
7. **ArgoCD interaction:** keep `syncPolicy.automated.prune+selfHeal` but
exclude the live-edited hermes state. Practically: Argo owns the *manifests*
(deployment, configmap, secret, pvc), while `/opt/data` (config.yaml,
cron/jobs.json, SOUL.md edits made via the dashboard) is runtime state on the
PVC and is *not* reconciled by Argo. The ConfigMap only *seeds* it on first
boot. Document this clearly in the README so future-you doesn't expect Argo
to reset the agent's personality.
---
## 9. Rollout plan
1. Build & push the derived image to `git.rogi.casa/roger/hermes-agent` (tag
`v1.35-<sha>`).
2. Create the namespace + RBAC + Secret + ConfigMap + PVC:
`kubectl apply -f platform-engineer/`.
3. Create the `platform-engineer` Discord bot, invite it, put its token + your
channel id in `secret.yaml` (base64).
4. Apply the Deployment; wait for the pod to go Running.
5. `kubectl exec` in and run the one-shot cron seed:
`hermes cron create ...` (or apply the `cron-seed` Job).
6. Trigger the first `cluster-health-check` manually: `hermes cron run cluster-health-check`.
7. Add the app to `argocd/gen-apps.sh`, regenerate, commit, push.
---
## 10. Decisions (locked in)
1. **Notifications:** dedicated `platform-engineer` Discord bot → its own token
in `secret.yaml` (`DISCORD_BOT_TOKEN`, `DISCORD_HOME_CHANNEL`).
2. **Dashboard:** public at `hermes.rogi.casa` (Traefik TLS + cert-manager + the
bundled Hermes basic-auth provider). Reach the dashboard on port 9119; the
gateway API on 8642 is ClusterIP-only.
3. **Image:** derived image pushed to `git.rogi.casa/roger/hermes-agent`, pulled
via the existing `gitea-registry` imagePullSecret (must also exist in the
`platform-engineer` ns — see deploy steps).
4. **Model:** `qwen-3.6:27b` via the in-cluster Ollama box (`10.88.20.12:11434`),
exposed through LiteLLM as `qwen-3.6:27b`. Added to `litellm/litellm.yaml`.
Hermes reaches LiteLLM at `https://litellm.rogi.casa/v1` (never Ollama directly).
5. **pods/exec:** granted (`pods/exec` → `create` in the ClusterRole) so the
agent can `kubectl exec`/`kubectl logs` for debugging.
---
## 11. Deployment checklist (do in this order)
1. **Add the Ollama model to LiteLLM** (already done in `litellm/litellm.yaml`):
the `qwen-3.6:27b` entry points at `http://10.88.20.12:11434`. Make sure
`qwen3.6:27b` is actually pulled on that Ollama host
(`ollama pull qwen3.6:27b`). Apply: `kubectl apply -f litellm/` and restart
the LiteLLM pod so the new config takes effect.
2. **Create the `gitea-registry` secret in the new namespace** (ArgoCD won't
create it — it's not in the repo):
```
kubectl create namespace platform-engineer
kubectl create secret docker-registry gitea-registry \
--docker-server=git.rogi.casa \
--docker-username=<your-gitea-user> \
--docker-password=<gitea-access-token> \
--docker-email=<your-email> \
-n platform-engineer
```
3. **Build & push the image:** `./platform-engineer/build-and-push.sh`
(after `docker login git.rogi.casa`).
4. **Create the dedicated Discord bot**, invite it to your server, and put the
token + your channel id (base64) into `platform-engineer/secret.yaml`. Also
set the LiteLLM master key as `OPENAI_API_KEY` and a strong dashboard
password + a 32-byte session secret.
5. **Commit & push** the whole change. ArgoCD will create the namespace
resources, deploy the pod, and bring up the ingress at `hermes.rogi.casa`.
6. **Seed the cron jobs:**
`kubectl apply -f platform-engineer/cron-seed.yaml` (one-shot Job) — it waits
for the hermes pod, then runs `hermes cron create ...` for each watchdog.
Re-run it any time you want to re-seed after a wipe.
7. **Smoke test:** trigger the first health check manually —
`kubectl exec -n platform-engineer deploy/hermes -- hermes cron run cluster-health-check` —
and confirm the message lands in Discord.
8. **ArgoCD:** the `Application` (`argocd/apps/platform-engineer.yaml`) is
already generated. After commit, Argo will reconcile it like every other app.
## 12. What ArgoCD owns vs. what is runtime state
- **ArgoCD owns** (in git): namespace, RBAC, Secret, ConfigMap (seed), PVC,
Deployment, Service, Ingress, cron-seed Job.
- **Runtime state (on the PVC, NOT reconciled):** `config.yaml`, `SOUL.md`,
`.env`, `cron/jobs.json`, `sessions/`, `memories/`, `skills/`. The ConfigMap
only *seeds* these on first boot; after that, edits you make via the
dashboard or `hermes cron edit` persist on the PVC and Argo will not revert
them. If you ever want a hard reset, delete the PVC and re-apply.
---
## Files in this directory
| File | Purpose |
|------|---------|
| `namespace.yaml` | namespace `platform-engineer` |
| `rbac.yaml` | ServiceAccount + ClusterRole (+binding), least-privilege |
| `configmap.yaml` | seed `config.yaml` + `SOUL.md` |
| `secret.yaml` | Discord token, LiteLLM key, dashboard auth (PLACEHOLDERS — fill in) |
| `pvc.yaml` | 5 Gi PVC for `/opt/data` |
| `dockerfile` | derived image: hermes-agent + kubectl + helm (linux/amd64) |
| `build-and-push.sh` | builds & pushes the image to the Gitea registry |
| `deployment.yaml` | Deployment (1 replica, Recreate, pinned to amd64 NUC) + Service |
| `ingress.yaml` | `hermes.rogi.casa` → dashboard (TLS + basic auth) |
| `cron-seed.yaml` | one-shot Job that creates the Hermes cron schedule |
Also changed outside this directory:
- `litellm/litellm.yaml` — added `qwen-3.6:27b` model entry.
- `argocd/gen-apps.sh` + `argocd/apps/platform-engineer.yaml` — ArgoCD
Application for this folder.
```