new platform engineer agent
This commit is contained in:
74
platform-engineer/cron-seed.yaml
Normal file
74
platform-engineer/cron-seed.yaml
Normal file
@@ -0,0 +1,74 @@
|
||||
# One-shot Job that seeds Hermes' built-in cron schedule on first install.
|
||||
# Idempotent: skips job names that already exist.
|
||||
#
|
||||
# The agent's own cron jobs live in /opt/data/cron/jobs.json on the PVC and are
|
||||
# NOT reconciled by ArgoCD (runtime state). Re-run this Job manually after a
|
||||
# wipe to re-seed: kubectl job restart hermes-cron-seed -n platform-engineer
|
||||
---
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: hermes-cron-seed
|
||||
namespace: platform-engineer
|
||||
labels:
|
||||
app: hermes
|
||||
spec:
|
||||
backoffLimit: 4
|
||||
ttlSecondsAfterFinished: 86400
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: hermes
|
||||
spec:
|
||||
serviceAccountName: platform-engineer
|
||||
restartPolicy: OnFailure
|
||||
containers:
|
||||
- name: seed
|
||||
image: bitnami/kubectl:1.35
|
||||
command: ["sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
set -e
|
||||
echo "Waiting for hermes pod to be Ready..."
|
||||
kubectl -n platform-engineer wait --for=condition=Ready pod -l app=hermes --timeout=300s || true
|
||||
|
||||
POD=$(kubectl -n platform-engineer get pod -l app=hermes -o jsonpath='{.items[0].metadata.name}')
|
||||
echo "Using pod: $POD"
|
||||
|
||||
exists() { kubectl -n platform-engineer exec "$POD" -- hermes cron list 2>/dev/null | grep -qi "name=$1\| $1 "; }
|
||||
|
||||
create() {
|
||||
name="$1"; schedule="$2"; deliver="$3"; prompt="$4"
|
||||
if exists "$name"; then
|
||||
echo "cron job '$name' already exists — skipping"
|
||||
else
|
||||
echo "creating cron job '$name' ..."
|
||||
kubectl -n platform-engineer exec "$POD" -- hermes cron create "$schedule" "$prompt" --name "$name" --deliver "$deliver"
|
||||
fi
|
||||
}
|
||||
|
||||
# ---- Watchdog checks (silent unless something is wrong) ----
|
||||
create "cluster-health-check" "every 15m" "discord" \
|
||||
"Run: kubectl get nodes; kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded; kubectl get events -A --field-selector type=Warning --since=20m. If everything is healthy and there are no Warning events, reply with exactly [SILENT]. Otherwise give a concise per-resource summary of what is wrong (node name, pod ns/name, phase, last event)."
|
||||
|
||||
create "pod-restart-loop" "every 10m" "discord" \
|
||||
"Find pods in CrashLoopBackOff or ImagePullBackOff across all namespaces (kubectl get pods -A). For each, fetch kubectl logs (previous) and describe. If the cause is clearly transient (OOM kill, a one-off config parse error that will retry cleanly, a missing Secret the controller will recreate), attempt ONE safe remediation: kubectl rollout restart of the owning Deployment/StatefulSet/DaemonSet, OR delete the single stuck pod. Report what you did in one line per resource. If the cause is not clearly transient (bad image, missing config, auth failure), do NOT act — post the log excerpt and the proposed command and wait for Roger. If no such pods exist, reply [SILENT]."
|
||||
|
||||
create "pvc-pressure" "every 30m" "discord" \
|
||||
"Check cluster storage health: kubectl get pv,pvc -A; kubectl top nodes. Alert if any PVC is Pending/Lost or any node filesystem usage is over 85%. If all healthy, reply [SILENT]."
|
||||
|
||||
create "argocd-sync-health" "every 1h" "discord" \
|
||||
"Run: kubectl get applications -n argocd -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status. If every app is Synced and Healthy, reply [SILENT]. Otherwise list the OutOfSync/Degraded apps with their status. Do NOT hand-edit resources to fix them (Argo will revert) — just report."
|
||||
|
||||
create "cert-expiry" "0 9 * * *" "discord" \
|
||||
"List all cert-manager Certificate resources (kubectl get certificates -A). For each, check notAfter. Alert on any certificate expiring in under 21 days. If none, reply [SILENT]."
|
||||
|
||||
create "node-resource-drift" "every 30m" "discord" \
|
||||
"Run kubectl top nodes. If any node CPU or memory usage is over 90%, or any node is NotReady, report it with the numbers. Otherwise reply [SILENT]."
|
||||
|
||||
# ---- Daily report (always delivered) ----
|
||||
create "daily-cluster-report" "0 8 * * *" "discord" \
|
||||
"Produce a daily cluster report for Roger: (1) node count + Ready/NotReady; (2) top 5 pods by CPU and by memory across all namespaces (kubectl top pods -A --sort-by); (3) count of pods not Running; (4) ArgoCD apps OutOfSync or Degraded; (5) any certificates expiring within 30 days; (6) any recent Warning events (last 24h). Keep it under 1800 chars. Always deliver (no [SILENT])."
|
||||
|
||||
echo "Done. Listing all cron jobs:"
|
||||
kubectl -n platform-engineer exec "$POD" -- hermes cron list
|
||||
Reference in New Issue
Block a user