Compare commits
35 Commits
fe2f1b85f8
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
734962d198 | ||
|
|
4d9195b32d | ||
|
|
54579df4b3 | ||
|
|
3f3467cb13 | ||
|
|
6e02d9a885 | ||
|
|
d8012dfb6c | ||
|
|
bf1387dc3e | ||
|
|
2eab82b430 | ||
|
|
3cdd40153f | ||
|
|
9f74a88be7 | ||
|
|
586e95a57d | ||
|
|
9f7e34ef78 | ||
|
|
b43874bdcd | ||
|
|
da2bae6fa5 | ||
|
|
e77e170421 | ||
|
|
ec947bd58a | ||
|
|
3e57da467d | ||
|
|
9eecedc396 | ||
|
|
ab6b5dc407 | ||
|
|
723693eb07 | ||
|
|
3ed4acd7ec | ||
|
|
1bcfc13047 | ||
|
|
b49918ed67 | ||
|
|
66433ff0b1 | ||
|
|
872d2d0622 | ||
|
|
67732d0898 | ||
|
|
47ab20dd55 | ||
|
|
c5e2a06c54 | ||
|
|
a6ac71c6b5 | ||
|
|
139bb366bb | ||
|
|
f6562df066 | ||
|
|
01321bf50c | ||
|
|
153cf16194 | ||
|
|
ce178d06c0 | ||
|
|
e359984c73 |
132
README.md
132
README.md
@@ -25,18 +25,25 @@ Aquest clúster K3s gestiona els següents serveis:
|
|||||||
```
|
```
|
||||||
.
|
.
|
||||||
├── README.md # Aquest fitxer
|
├── README.md # Aquest fitxer
|
||||||
├── ingress.yaml # Configuració d'Ingress principal (Traefik)
|
├── argocd-bootstrap.yaml # App-of-apps: llavor per a ArgoCD (aplicar 1 cop)
|
||||||
├── nas.yaml # Servei extern per al NAS
|
|
||||||
├── <aplicació>/ # Cada aplicació té el seu directori
|
├── <aplicació>/ # Cada aplicació té el seu directori
|
||||||
│ ├── deployment.yaml # Definició del Deployment
|
│ ├── deployment.yaml # Definició del Deployment
|
||||||
│ ├── service.yaml # Definició del Service
|
│ ├── service.yaml # Definició del Service
|
||||||
│ ├── ingress.yaml # Configuració d'Ingress (opcional)
|
│ ├── ingress.yaml # Configuració d'Ingress de l'aplicació
|
||||||
│ ├── namespace.yaml # Namespace dedicat (opcional)
|
│ ├── namespace.yaml # Namespace dedicat (opcional)
|
||||||
│ ├── configmap.yaml # ConfigMaps (opcional)
|
│ ├── configmap.yaml # ConfigMaps (opcional)
|
||||||
│ └── pvc.yaml # PersistentVolumeClaims (opcional)
|
│ └── pvc.yaml # PersistentVolumeClaims (opcional)
|
||||||
└── monitoring/ # Stack de monitorització complet
|
├── argocd/ # ArgoCD
|
||||||
|
│ ├── ingress.yaml # Ingress d'ArgoCD
|
||||||
|
│ ├── apps/ # Applications + AppProject declaratius
|
||||||
|
│ └── gen-apps.sh # Genera argocd/apps/* i argocd-bootstrap.yaml
|
||||||
|
└── nas/ # Servei extern per al NAS
|
||||||
|
├── transport.yaml # ServersTransport de Traefik
|
||||||
|
└── ingress.yaml # Ingress del NAS
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> **Nota**: Cada aplicació té el seu propi `ingress.yaml` dins del seu directori. Ja no hi ha cap `ingress.yaml` centralitzat a l'arrel.
|
||||||
|
|
||||||
## 🚀 Desplegament
|
## 🚀 Desplegament
|
||||||
|
|
||||||
### Prerequisits
|
### Prerequisits
|
||||||
@@ -60,15 +67,17 @@ kubectl apply -f <aplicació>/<fitxer>.yaml
|
|||||||
|
|
||||||
### Desplegar Tot el Clúster
|
### Desplegar Tot el Clúster
|
||||||
|
|
||||||
|
La forma recomanada és deixar que ArgoCD sincronitzi el repo (veure secció [ArgoCD (GitOps)](#-argocd-gitops)).
|
||||||
|
Per a un desplegament manual sense ArgoCD:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Desplegar totes les aplicacions
|
# Desplegar totes les aplicacions
|
||||||
for dir in */; do
|
for dir in */; do
|
||||||
kubectl apply -f "$dir"
|
kubectl apply -f "$dir"
|
||||||
done
|
done
|
||||||
|
|
||||||
# O aplicar recursos globals primer
|
# O aplicar recursos globals primer (opcional)
|
||||||
kubectl apply -f ingress.yaml
|
kubectl apply -f nas/
|
||||||
kubectl apply -f nas.yaml
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Eliminar una Aplicació
|
### Eliminar una Aplicació
|
||||||
@@ -83,18 +92,103 @@ kubectl delete -f <aplicació>/<fitxer>.yaml
|
|||||||
|
|
||||||
## 🌐 Ingress i Networking
|
## 🌐 Ingress i Networking
|
||||||
|
|
||||||
### Configuració d'Ingress Principal
|
### Configuració d'Ingress per Aplicació
|
||||||
|
|
||||||
El fitxer [ingress.yaml](ingress.yaml) conté la configuració centralitzada d'Ingress utilitzant **Traefik** (controlador per defecte de K3s). Característiques:
|
Cada aplicació té el seu propi fitxer `ingress.yaml` dins del seu directori, seguint el model de [pihole/ingress.yaml](pihole/ingress.yaml). Característiques:
|
||||||
|
|
||||||
- **TLS/SSL**: Certificats wildcard `*.rogi.casa` gestionats per cert-manager
|
- **Traefik**: Controlador per defecte de K3s (`ingressClassName: traefik`)
|
||||||
- **Cloudflare Origin Issuer**: Utilitzat per generar certificats
|
- **TLS/SSL**: Certificats per host gestionats per cert-manager amb el cluster-issuer `letsencrypt-prod`
|
||||||
- **Redirect HTTPS**: Redireccions automàtiques de HTTP a HTTPS
|
- **Secret per aplicació**: Cada ingress té el seu propi `<aplicació>-tls`
|
||||||
- **Compressió**: Habilitada per defecte
|
- **Namespace dedicat**: Cada ingress pertany al namespace de la seva aplicació
|
||||||
|
|
||||||
### Aplicacions amb Ingress Dedicat
|
Exemple (`pihole/ingress.yaml`):
|
||||||
|
|
||||||
Algunes aplicacions tenen el seu propi fitxer `ingress.yaml` dins del seu directori per a configuracions específiques.
|
```yaml
|
||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: pihole
|
||||||
|
namespace: pihole
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- pihole.rogi.casa
|
||||||
|
secretName: pihole-tls
|
||||||
|
rules:
|
||||||
|
- host: pihole.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: pihole-web
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🐙 ArgoCD (GitOps)
|
||||||
|
|
||||||
|
Totes les aplicacions del repo es despleguen de forma declarativa amb ArgoCD. Hi ha un `Application` per cada directori d'aplicació, agrupades sota un `AppProject` anomenat `k3s-cluster`.
|
||||||
|
|
||||||
|
### Estructura
|
||||||
|
|
||||||
|
- [`argocd/apps/project.yaml`](argocd/apps/project.yaml) — `AppProject` `k3s-cluster` (sync-wave -1).
|
||||||
|
- [`argocd/apps/<app>.yaml`](argocd/apps/) — un `Application` per aplicació (sync-wave 0), cadascun apunta al seu directori del repo.
|
||||||
|
- [`argocd-bootstrap.yaml`](argocd-bootstrap.yaml) — `Application` "app-of-apps" que sincronitza tot el directori `argocd/apps/`. És l'únic recurs que cal aplicar a mà.
|
||||||
|
- [`argocd/gen-apps.sh`](argocd/gen-apps.sh) — regenera tots els fitxers anteriors a partir d'una llista d'aplicacions.
|
||||||
|
|
||||||
|
### Flux
|
||||||
|
|
||||||
|
1. El `AppProject` i tots els `Application` estan versionats a `argocd/apps/`.
|
||||||
|
2. L'app `k3s-cluster-root` (a `argocd-bootstrap.yaml`) llegeix `argocd/apps/` i crea/actualitza el projecte i totes les applications.
|
||||||
|
3. Cada `Application` sincronitza el seu directori (ex: `pihole/`) cap al seu namespace, amb `prune` i `selfHeal` activats.
|
||||||
|
|
||||||
|
### Bootstrap (una sola vegada)
|
||||||
|
|
||||||
|
Prerequisits:
|
||||||
|
|
||||||
|
1. ArgoCD instal·lat al clúster (namespace `argocd`).
|
||||||
|
2. `cert-manager` instal·lat (veure [`cert-manager/install.sh`](cert-manager/install.sh)) — el `ClusterIssuer` depèn dels seus CRDs.
|
||||||
|
3. El repo registrat a ArgoCD (`Settings → Repositories`). Si el repo és públic a GitHub, l'HTTPS funciona sense credencials.
|
||||||
|
|
||||||
|
Llançar la llavor:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl apply -f argocd-bootstrap.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
A partir d'aquí ArgoCD crea el projecte `k3s-cluster`, totes les `Application` i les sincronitza automàticament. Qualsevol canvi al repo es propaga sol (self-heal).
|
||||||
|
|
||||||
|
### Recrear el clúster des de zero
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Instal·lar K3s
|
||||||
|
# 2. Instal·lar ArgoCD
|
||||||
|
# 3. Instal·lar cert-manager
|
||||||
|
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
|
||||||
|
kubectl wait --for=condition=available --timeout=120s deployment/cert-manager -n cert-manager
|
||||||
|
# 4. Registrar el repo a ArgoCD (o deixar-lo públic)
|
||||||
|
# 5. Llançar la llavor
|
||||||
|
kubectl apply -f argocd-bootstrap.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Afegir o treure una aplicació
|
||||||
|
|
||||||
|
1. Crea/esborra el directori de l'aplicació.
|
||||||
|
2. Afegeix/treu la línia corresponent a l'array `APPS` de [`argocd/gen-apps.sh`](argocd/gen-apps.sh) amb el format `name|namespace|path|recurse|validate`.
|
||||||
|
3. Executa `./argocd/gen-apps.sh` per regenerar els manifests.
|
||||||
|
4. Fes commit i push; ArgoCD ho sincronitza sol.
|
||||||
|
|
||||||
|
### Notes
|
||||||
|
|
||||||
|
- Els `Application`/`AppProject` pertanyen al namespace `argocd` (recursos propis d'ArgoCD).
|
||||||
|
- L'app `argocd` té `recurse: false` sobre el directori `argocd/` per gestionar només `ingress.yaml` i no els seus propis manifests sota `argocd/apps/`.
|
||||||
|
- L'app `phoenix` usa `Validate=false` per tolerar el CRD `ServiceMonitor` si el Prometheus Operator encara no és instal·lat.
|
||||||
|
- Els secrets que no estan al repo (ex: `gitea-registry` per a `gym-tracker`) s'han de crear manualment al seu namespace; Argo no els gestiona ni els esborra.
|
||||||
|
|
||||||
## 💾 Persistència de Dades
|
## 💾 Persistència de Dades
|
||||||
|
|
||||||
@@ -218,7 +312,7 @@ kubectl get pv
|
|||||||
|
|
||||||
## 📝 Bones Pràctiques
|
## 📝 Bones Pràctiques
|
||||||
|
|
||||||
1. **Namespaces**: Les aplicacions complexes utilitzen namespaces dedicats (n8n, monitoring, phoenix)
|
1. **Namespaces**: Totes les aplicacions tenen un namespace dedicat; cap queda al namespace `default`
|
||||||
2. **Labels**: Tots els recursos utilitzen labels consistents per facilitar la gestió
|
2. **Labels**: Tots els recursos utilitzen labels consistents per facilitar la gestió
|
||||||
3. **Resources Limits**: Configura limits de CPU/memòria per evitar overconsumption
|
3. **Resources Limits**: Configura limits de CPU/memòria per evitar overconsumption
|
||||||
4. **Health Checks**: Implementa liveness i readiness probes quan sigui possible
|
4. **Health Checks**: Implementa liveness i readiness probes quan sigui possible
|
||||||
@@ -245,7 +339,7 @@ kubectl rollout undo deployment/<nom> -n <namespace>
|
|||||||
## 🌟 Serveis Externs
|
## 🌟 Serveis Externs
|
||||||
|
|
||||||
### NAS
|
### NAS
|
||||||
El fitxer [nas.yaml](nas.yaml) configura un servei extern que apunta al NAS local (10.88.88.238:5000) sense desplegar pods dins del clúster.
|
El fitxer [nas/nas.yaml](nas/nas.yaml) configura un servei extern que apunta al NAS local (10.88.88.238:5000) sense desplegar pods dins del clúster. L'Ingress corresponent és a [nas/ingress.yaml](nas/ingress.yaml).
|
||||||
|
|
||||||
## 📚 Recursos Addicionals
|
## 📚 Recursos Addicionals
|
||||||
|
|
||||||
|
|||||||
24
argocd-bootstrap.yaml
Normal file
24
argocd-bootstrap.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: k3s-cluster-root
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "-1"
|
||||||
|
spec:
|
||||||
|
project: default
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: argocd/apps
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: argocd
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/argocd.yaml
Normal file
24
argocd/apps/argocd.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: argocd
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: argocd
|
||||||
|
directory:
|
||||||
|
recurse: false
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: argocd
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/cert-manager.yaml
Normal file
24
argocd/apps/cert-manager.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: cert-manager
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: cert-manager
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: cert-manager
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/fava.yaml
Normal file
24
argocd/apps/fava.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: fava
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: fava
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: fava
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/gitea.yaml
Normal file
24
argocd/apps/gitea.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: gitea
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: gitea
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: gitea
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/glance.yaml
Normal file
24
argocd/apps/glance.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: glance
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: glance
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: glance
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/gym-tracker.yaml
Normal file
24
argocd/apps/gym-tracker.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: gym-tracker
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: gym-tracker
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: gym-tracker
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/homeassistant.yaml
Normal file
24
argocd/apps/homeassistant.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: homeassistant
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: homeassistant
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: home-assistant
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/jellyfin.yaml
Normal file
24
argocd/apps/jellyfin.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: jellyfin
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: jellyfin
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: jellyfin
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/litellm.yaml
Normal file
24
argocd/apps/litellm.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: litellm
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: litellm
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: litellm
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/minecraft-server.yaml
Normal file
24
argocd/apps/minecraft-server.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: minecraft-server
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: minecraft-server
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: minecraft
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/monitoring.yaml
Normal file
24
argocd/apps/monitoring.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: monitoring
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: monitoring
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: monitoring
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/myorg-assistant.yaml
Normal file
24
argocd/apps/myorg-assistant.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: myorg-assistant
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: myorg-assistant
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: myorg-assistant
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/n8n.yaml
Normal file
24
argocd/apps/n8n.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: n8n
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: n8n
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: n8n
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/nas.yaml
Normal file
24
argocd/apps/nas.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: nas
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: nas
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: nas-proxy
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/openwebui.yaml
Normal file
24
argocd/apps/openwebui.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: openwebui
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: openwebui
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: openwebui
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
25
argocd/apps/phoenix.yaml
Normal file
25
argocd/apps/phoenix.yaml
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: phoenix
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: phoenix
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: phoenix
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
|
- Validate=false
|
||||||
24
argocd/apps/pihole.yaml
Normal file
24
argocd/apps/pihole.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: pihole
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: pihole
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: pihole
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/platform-engineer.yaml
Normal file
24
argocd/apps/platform-engineer.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: platform-engineer
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: platform-engineer
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: platform-engineer
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
17
argocd/apps/project.yaml
Normal file
17
argocd/apps/project.yaml
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: AppProject
|
||||||
|
metadata:
|
||||||
|
name: k3s-cluster
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "-1"
|
||||||
|
spec:
|
||||||
|
description: Applications for the rogi.casa K3s cluster (managed in Git)
|
||||||
|
sourceRepos:
|
||||||
|
- https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
destinations:
|
||||||
|
- server: https://kubernetes.default.svc
|
||||||
|
namespace: "*"
|
||||||
|
clusterResourceWhitelist:
|
||||||
|
- group: "*"
|
||||||
|
kind: "*"
|
||||||
24
argocd/apps/qbittorrent.yaml
Normal file
24
argocd/apps/qbittorrent.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: qbittorrent
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: qbittorrent
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: qbittorrent
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
24
argocd/apps/vaultwarden.yaml
Normal file
24
argocd/apps/vaultwarden.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: vaultwarden
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: https://git.rogi.casa/roger/k3s-cluster.git
|
||||||
|
targetRevision: main
|
||||||
|
path: vaultwarden
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: vaultwarden
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
141
argocd/gen-apps.sh
Executable file
141
argocd/gen-apps.sh
Executable file
@@ -0,0 +1,141 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Generates ArgoCD Application manifests (one per app folder) + an AppProject.
|
||||||
|
#
|
||||||
|
# Layout produced:
|
||||||
|
# argocd/apps/project.yaml -> AppProject "k3s-cluster" (sync-wave -1)
|
||||||
|
# argocd/apps/<app>.yaml -> Application for that app folder
|
||||||
|
# argocd-bootstrap.yaml (repo root) -> app-of-apps: syncs everything in argocd/apps/
|
||||||
|
#
|
||||||
|
# Bootstrap (one-time, after ArgoCD + cert-manager are installed):
|
||||||
|
# kubectl apply -f argocd-bootstrap.yaml
|
||||||
|
#
|
||||||
|
# Re-run this script after adding/removing an app folder to regenerate the manifests.
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
cd "$(dirname "$0")/.." # repo root
|
||||||
|
|
||||||
|
REPO="${REPO:-https://git.rogi.casa/roger/k3s-cluster.git}"
|
||||||
|
REV="${REV:-main}"
|
||||||
|
APPS_DIR="argocd/apps"
|
||||||
|
mkdir -p "$APPS_DIR"
|
||||||
|
|
||||||
|
# app-name | namespace | path | recurse | validate
|
||||||
|
APPS=(
|
||||||
|
"argocd|argocd|argocd|false|true"
|
||||||
|
"cert-manager|cert-manager|cert-manager|true|true"
|
||||||
|
"fava|fava|fava|true|true"
|
||||||
|
"gitea|gitea|gitea|true|true"
|
||||||
|
"glance|glance|glance|true|true"
|
||||||
|
"gym-tracker|gym-tracker|gym-tracker|true|true"
|
||||||
|
"homeassistant|home-assistant|homeassistant|true|true"
|
||||||
|
"jellyfin|jellyfin|jellyfin|true|true"
|
||||||
|
"litellm|litellm|litellm|true|true"
|
||||||
|
"minecraft-server|minecraft|minecraft-server|true|true"
|
||||||
|
"monitoring|monitoring|monitoring|true|true"
|
||||||
|
"myorg-assistant|myorg-assistant|myorg-assistant|true|true"
|
||||||
|
"n8n|n8n|n8n|true|true"
|
||||||
|
"nas|nas-proxy|nas|true|true"
|
||||||
|
"openwebui|openwebui|openwebui|true|true"
|
||||||
|
"phoenix|phoenix|phoenix|true|false"
|
||||||
|
"pihole|pihole|pihole|true|true"
|
||||||
|
"platform-engineer|platform-engineer|platform-engineer|true|true"
|
||||||
|
"qbittorrent|qbittorrent|qbittorrent|true|true"
|
||||||
|
"vaultwarden|vaultwarden|vaultwarden|true|true"
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# AppProject
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
cat > "$APPS_DIR/project.yaml" <<EOF
|
||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: AppProject
|
||||||
|
metadata:
|
||||||
|
name: k3s-cluster
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "-1"
|
||||||
|
spec:
|
||||||
|
description: Applications for the rogi.casa K3s cluster (managed in Git)
|
||||||
|
sourceRepos:
|
||||||
|
- ${REPO}
|
||||||
|
destinations:
|
||||||
|
- server: https://kubernetes.default.svc
|
||||||
|
namespace: "*"
|
||||||
|
clusterResourceWhitelist:
|
||||||
|
- group: "*"
|
||||||
|
kind: "*"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# One Application per app folder
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
gen_app() {
|
||||||
|
local name="$1" ns="$2" path="$3" recurse="$4" validate="$5"
|
||||||
|
local recurse_yaml validate_opts=""
|
||||||
|
[ "$recurse" = "true" ] && recurse_yaml=" recurse: true" || recurse_yaml=" recurse: false"
|
||||||
|
[ "$validate" = "false" ] && validate_opts=$'\n - Validate=false'
|
||||||
|
|
||||||
|
cat > "$APPS_DIR/${name}.yaml" <<EOF
|
||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: ${name}
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "0"
|
||||||
|
spec:
|
||||||
|
project: k3s-cluster
|
||||||
|
source:
|
||||||
|
repoURL: ${REPO}
|
||||||
|
targetRevision: ${REV}
|
||||||
|
path: ${path}
|
||||||
|
directory:
|
||||||
|
${recurse_yaml}
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: ${ns}
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false${validate_opts}
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
for line in "${APPS[@]}"; do
|
||||||
|
IFS='|' read -r name ns path recurse validate <<< "$line"
|
||||||
|
gen_app "$name" "$ns" "$path" "$recurse" "$validate"
|
||||||
|
done
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Root "app-of-apps" bootstrap Application (uses the built-in default project)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
cat > argocd-bootstrap.yaml <<EOF
|
||||||
|
apiVersion: argoproj.io/v1alpha1
|
||||||
|
kind: Application
|
||||||
|
metadata:
|
||||||
|
name: k3s-cluster-root
|
||||||
|
namespace: argocd
|
||||||
|
annotations:
|
||||||
|
argocd.argoproj.io/sync-wave: "-1"
|
||||||
|
spec:
|
||||||
|
project: default
|
||||||
|
source:
|
||||||
|
repoURL: ${REPO}
|
||||||
|
targetRevision: ${REV}
|
||||||
|
path: argocd/apps
|
||||||
|
directory:
|
||||||
|
recurse: true
|
||||||
|
destination:
|
||||||
|
server: https://kubernetes.default.svc
|
||||||
|
namespace: argocd
|
||||||
|
syncPolicy:
|
||||||
|
automated:
|
||||||
|
prune: true
|
||||||
|
selfHeal: true
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=false
|
||||||
|
EOF
|
||||||
|
|
||||||
|
echo "Generated $(find "$APPS_DIR" -name '*.yaml' | wc -l) files in $APPS_DIR/ and argocd-bootstrap.yaml"
|
||||||
@@ -35,7 +35,7 @@ data:
|
|||||||
# Clone or update the repository
|
# Clone or update the repository
|
||||||
if [ ! -d "/data/contabilitat/.git" ]; then
|
if [ ! -d "/data/contabilitat/.git" ]; then
|
||||||
echo "Cloning repository..."
|
echo "Cloning repository..."
|
||||||
git clone https://${GITEA_USERNAME}:${GITEA_PASSWORD}@gitea.rogi.casa/${GITEA_USERNAME}/contabilitat.git /data/contabilitat
|
git clone https://${GITEA_USERNAME}:${GITEA_PASSWORD}@git.rogi.casa/roger/contabilitat.git /data/contabilitat
|
||||||
else
|
else
|
||||||
echo "Repository exists, pulling latest changes..."
|
echo "Repository exists, pulling latest changes..."
|
||||||
cd /data/contabilitat
|
cd /data/contabilitat
|
||||||
|
|||||||
@@ -7,9 +7,7 @@ metadata:
|
|||||||
annotations:
|
annotations:
|
||||||
kubernetes.io/ingress.class: "traefik"
|
kubernetes.io/ingress.class: "traefik"
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
||||||
cert-manager.io/issuer: prod-issuer
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
spec:
|
||||||
tls:
|
tls:
|
||||||
- hosts:
|
- hosts:
|
||||||
|
|||||||
@@ -92,16 +92,7 @@ metadata:
|
|||||||
name: gitea-runner-config
|
name: gitea-runner-config
|
||||||
namespace: gitea
|
namespace: gitea
|
||||||
data:
|
data:
|
||||||
GITEA_INSTANCE_URL: "http://gitea.rogi.casa"
|
GITEA_INSTANCE_URL: "http://git.rogi.casa"
|
||||||
---
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Secret
|
|
||||||
metadata:
|
|
||||||
name: gitea-runner-secret
|
|
||||||
namespace: gitea
|
|
||||||
type: Opaque
|
|
||||||
stringData:
|
|
||||||
GITEA_RUNNER_REGISTRATION_TOKEN: "BqkIGoAiwSYUFm2CPXlvvKAdSw5fl6ayCAb60zsM"
|
|
||||||
---
|
---
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
|
|||||||
@@ -1,4 +1,3 @@
|
|||||||
# gitea-ingress.yaml
|
|
||||||
apiVersion: networking.k8s.io/v1
|
apiVersion: networking.k8s.io/v1
|
||||||
kind: Ingress
|
kind: Ingress
|
||||||
metadata:
|
metadata:
|
||||||
@@ -2,6 +2,7 @@ apiVersion: v1
|
|||||||
kind: ConfigMap
|
kind: ConfigMap
|
||||||
metadata:
|
metadata:
|
||||||
name: glance-config
|
name: glance-config
|
||||||
|
namespace: glance
|
||||||
data:
|
data:
|
||||||
glance.yml: |
|
glance.yml: |
|
||||||
pages:
|
pages:
|
||||||
|
|||||||
@@ -1,7 +1,13 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: glance
|
||||||
|
---
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
name: glance
|
name: glance
|
||||||
|
namespace: glance
|
||||||
spec:
|
spec:
|
||||||
replicas: 1
|
replicas: 1
|
||||||
selector:
|
selector:
|
||||||
@@ -29,7 +35,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: glance-service
|
name: glance-service
|
||||||
namespace: default
|
namespace: glance
|
||||||
spec:
|
spec:
|
||||||
type: ClusterIP
|
type: ClusterIP
|
||||||
selector:
|
selector:
|
||||||
|
|||||||
24
glance/ingress.yaml
Normal file
24
glance/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: glance
|
||||||
|
namespace: glance
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- glance.rogi.casa
|
||||||
|
secretName: glance-tls
|
||||||
|
rules:
|
||||||
|
- host: glance.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: glance-service
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
@@ -1,7 +1,13 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: gym-tracker
|
||||||
|
---
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
name: gym-tracker
|
name: gym-tracker
|
||||||
|
namespace: gym-tracker
|
||||||
labels:
|
labels:
|
||||||
app: gym-tracker
|
app: gym-tracker
|
||||||
spec:
|
spec:
|
||||||
@@ -18,7 +24,7 @@ spec:
|
|||||||
- name: gitea-registry
|
- name: gitea-registry
|
||||||
containers:
|
containers:
|
||||||
- name: gym-tracker
|
- name: gym-tracker
|
||||||
image: gitea.rogi.casa/roger/gym-tracker/gym-tracker:3ba68d6
|
image: git.rogi.casa/roger/gym-tracker/gym-tracker:945910a
|
||||||
imagePullPolicy: Always
|
imagePullPolicy: Always
|
||||||
ports:
|
ports:
|
||||||
- containerPort: 80
|
- containerPort: 80
|
||||||
@@ -67,6 +73,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: gym-tracker
|
name: gym-tracker
|
||||||
|
namespace: gym-tracker
|
||||||
labels:
|
labels:
|
||||||
app: gym-tracker
|
app: gym-tracker
|
||||||
spec:
|
spec:
|
||||||
@@ -87,6 +94,7 @@ apiVersion: v1
|
|||||||
kind: PersistentVolumeClaim
|
kind: PersistentVolumeClaim
|
||||||
metadata:
|
metadata:
|
||||||
name: gym-tracker-data
|
name: gym-tracker-data
|
||||||
|
namespace: gym-tracker
|
||||||
labels:
|
labels:
|
||||||
app: gym-tracker
|
app: gym-tracker
|
||||||
spec:
|
spec:
|
||||||
|
|||||||
24
gym-tracker/ingress.yaml
Normal file
24
gym-tracker/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: gym-tracker
|
||||||
|
namespace: gym-tracker
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- gym.rogi.casa
|
||||||
|
secretName: gym-tracker-tls
|
||||||
|
rules:
|
||||||
|
- host: gym.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: gym-tracker
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
@@ -32,7 +32,9 @@ data:
|
|||||||
http:
|
http:
|
||||||
use_x_forwarded_for: true
|
use_x_forwarded_for: true
|
||||||
trusted_proxies:
|
trusted_proxies:
|
||||||
- 10.88.88.0/24
|
- 10.42.0.0/16 # k3s pod CIDR (Traefik pod lives here)
|
||||||
|
- 10.43.0.0/16 # k3s service CIDR
|
||||||
|
- 10.88.20.0/24 # node subnet (Traefik runs hostNetwork-ish, forwards from 10.88.20.11)
|
||||||
---
|
---
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
|
|||||||
24
homeassistant/ingress.yaml
Normal file
24
homeassistant/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: homeassistant
|
||||||
|
namespace: home-assistant
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- homeassistant.rogi.casa
|
||||||
|
secretName: homeassistant-tls
|
||||||
|
rules:
|
||||||
|
- host: homeassistant.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: home-assistant
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
307
ingress.yaml
307
ingress.yaml
@@ -1,307 +0,0 @@
|
|||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: rogicasa-ingress
|
|
||||||
namespace: default
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: glance.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: glance-service
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
- host: pihole.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: pihole-web
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
- host: litellm.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: litellm-service
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
- host: openai.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: open-webui-service
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
- host: gym.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: gym-tracker
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
---
|
|
||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: gitea-ingress
|
|
||||||
namespace: gitea
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: gitea.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: gitea
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
---
|
|
||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: monitoring-ingress
|
|
||||||
namespace: monitoring
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: grafana.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: grafana
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
- host: prometheus.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: prometheus-k8s
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
---
|
|
||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: vaultwarden-ingress
|
|
||||||
namespace: vaultwarden
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: vaultwarden.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: vaultwarden
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
---
|
|
||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: homeassistant-ingress
|
|
||||||
namespace: home-assistant
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: homeassistant.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: home-assistant
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
---
|
|
||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: minecraft-ingress
|
|
||||||
namespace: minecraft
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: minecraft.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: minecraft-server
|
|
||||||
port:
|
|
||||||
number: 25565
|
|
||||||
---
|
|
||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: argocd-ingress
|
|
||||||
namespace: argocd
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: argocd.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: argocd-server
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
---
|
|
||||||
apiVersion: networking.k8s.io/v1
|
|
||||||
kind: Ingress
|
|
||||||
metadata:
|
|
||||||
name: nas-ingress
|
|
||||||
namespace: default
|
|
||||||
annotations:
|
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
|
||||||
kubernetes.io/ingress.class: "traefik"
|
|
||||||
# Enable SSL redirect
|
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
|
||||||
# Optional: enable compression
|
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
|
||||||
# Allow large file uploads (5GB) for NAS
|
|
||||||
traefik.ingress.kubernetes.io/max-request-body-bytes: "5368709120"
|
|
||||||
cert-manager.io/issuer: prod-issuer
|
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
|
||||||
tls:
|
|
||||||
- hosts:
|
|
||||||
- "*.rogi.casa"
|
|
||||||
secretName: rogicasa-tls
|
|
||||||
rules:
|
|
||||||
- host: nas.rogi.casa
|
|
||||||
http:
|
|
||||||
paths:
|
|
||||||
- path: /
|
|
||||||
pathType: Prefix
|
|
||||||
backend:
|
|
||||||
service:
|
|
||||||
name: external-ip
|
|
||||||
port:
|
|
||||||
number: 80
|
|
||||||
@@ -7,14 +7,12 @@ metadata:
|
|||||||
kubernetes.io/ingress.class: "traefik"
|
kubernetes.io/ingress.class: "traefik"
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
traefik.ingress.kubernetes.io/compress: "true"
|
||||||
cert-manager.io/issuer: prod-issuer
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
spec:
|
||||||
tls:
|
tls:
|
||||||
- hosts:
|
- hosts:
|
||||||
- "*.rogi.casa"
|
- jellyfin.rogi.casa
|
||||||
secretName: rogicasa-tls
|
secretName: jellyfin-tls
|
||||||
rules:
|
rules:
|
||||||
- host: jellyfin.rogi.casa
|
- host: jellyfin.rogi.casa
|
||||||
http:
|
http:
|
||||||
|
|||||||
@@ -115,7 +115,7 @@ spec:
|
|||||||
accessModes:
|
accessModes:
|
||||||
- ReadWriteMany
|
- ReadWriteMany
|
||||||
nfs:
|
nfs:
|
||||||
server: 10.88.88.238
|
server: 10.88.30.10
|
||||||
path: /volume1/jellyfin/media
|
path: /volume1/jellyfin/media
|
||||||
persistentVolumeReclaimPolicy: Retain
|
persistentVolumeReclaimPolicy: Retain
|
||||||
---
|
---
|
||||||
|
|||||||
24
litellm/ingress.yaml
Normal file
24
litellm/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: litellm
|
||||||
|
namespace: litellm
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- litellm.rogi.casa
|
||||||
|
secretName: litellm-tls
|
||||||
|
rules:
|
||||||
|
- host: litellm.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: litellm-service
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
@@ -1,7 +1,13 @@
|
|||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: litellm
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
kind: ConfigMap
|
kind: ConfigMap
|
||||||
metadata:
|
metadata:
|
||||||
name: litellm-config-file
|
name: litellm-config-file
|
||||||
|
namespace: litellm
|
||||||
data:
|
data:
|
||||||
config.yaml: |
|
config.yaml: |
|
||||||
model_list:
|
model_list:
|
||||||
@@ -20,7 +26,12 @@ data:
|
|||||||
- model_name: glm-4.7-flash
|
- model_name: glm-4.7-flash
|
||||||
litellm_params:
|
litellm_params:
|
||||||
model: ollama/glm-4.7-flash
|
model: ollama/glm-4.7-flash
|
||||||
api_base: http://10.88.88.235:11434
|
api_base: http://10.88.20.12:11434
|
||||||
|
# Used by the platform-engineer Hermes agent (deployed in ns platform-engineer).
|
||||||
|
- model_name: qwen-3.6:27b
|
||||||
|
litellm_params:
|
||||||
|
model: ollama/qwen3.6:27b
|
||||||
|
api_base: http://10.88.20.12:11434
|
||||||
litellm_settings:
|
litellm_settings:
|
||||||
#set_verbose: True # Uncomment this if you want to see verbose logs; not recommended in production
|
#set_verbose: True # Uncomment this if you want to see verbose logs; not recommended in production
|
||||||
callbacks: ["arize_phoenix"]
|
callbacks: ["arize_phoenix"]
|
||||||
@@ -50,6 +61,7 @@ apiVersion: apps/v1
|
|||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
name: litellm-deployment
|
name: litellm-deployment
|
||||||
|
namespace: litellm
|
||||||
labels:
|
labels:
|
||||||
app: litellm
|
app: litellm
|
||||||
spec:
|
spec:
|
||||||
@@ -88,7 +100,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: litellm-service
|
name: litellm-service
|
||||||
namespace: default
|
namespace: litellm
|
||||||
spec:
|
spec:
|
||||||
type: ClusterIP
|
type: ClusterIP
|
||||||
selector:
|
selector:
|
||||||
|
|||||||
@@ -18,6 +18,7 @@ apiVersion: v1
|
|||||||
kind: PersistentVolumeClaim
|
kind: PersistentVolumeClaim
|
||||||
metadata:
|
metadata:
|
||||||
name: postgres-volume-claim
|
name: postgres-volume-claim
|
||||||
|
namespace: litellm
|
||||||
labels:
|
labels:
|
||||||
app: postgres
|
app: postgres
|
||||||
spec:
|
spec:
|
||||||
@@ -32,6 +33,7 @@ apiVersion: apps/v1
|
|||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
name: postgres
|
name: postgres
|
||||||
|
namespace: litellm
|
||||||
spec:
|
spec:
|
||||||
replicas: 1
|
replicas: 1
|
||||||
selector:
|
selector:
|
||||||
@@ -63,6 +65,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: postgres
|
name: postgres
|
||||||
|
namespace: litellm
|
||||||
labels:
|
labels:
|
||||||
app: postgres
|
app: postgres
|
||||||
spec:
|
spec:
|
||||||
|
|||||||
@@ -12,10 +12,12 @@ metadata:
|
|||||||
labels:
|
labels:
|
||||||
app: minecraft-server
|
app: minecraft-server
|
||||||
spec:
|
spec:
|
||||||
type: ClusterIP
|
type: LoadBalancer
|
||||||
|
loadBalancerIP: 10.88.20.103
|
||||||
ports:
|
ports:
|
||||||
- name: minecraft
|
- name: minecraft
|
||||||
port: 25565
|
port: 25565
|
||||||
targetPort: 25565
|
targetPort: 25565
|
||||||
|
protocol: TCP
|
||||||
selector:
|
selector:
|
||||||
app: minecraft-server
|
app: minecraft-server
|
||||||
|
|||||||
354
monitoring/dashboard-ideas.md
Normal file
354
monitoring/dashboard-ideas.md
Normal file
@@ -0,0 +1,354 @@
|
|||||||
|
# Dashboard Ideas
|
||||||
|
|
||||||
|
This file collects ideas for additional Grafana dashboards to build for the
|
||||||
|
`rogi.casa` k3s cluster. Each idea notes the **data source** (metrics already
|
||||||
|
available vs. metrics that need to be enabled) and a rough panel layout.
|
||||||
|
|
||||||
|
To actually add a dashboard, create a `grafana-dashboard-<name>.yaml` ConfigMap
|
||||||
|
in this folder, mount it in `grafana-deployment.yaml` (add a volume +
|
||||||
|
volumeMount under `/var/lib/grafana/dashboards/<name>`), commit and push.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Already-scraped services (ready to dashboard now)
|
||||||
|
|
||||||
|
These exporters/services are **already being scraped by Prometheus** — dashboards
|
||||||
|
can be built immediately with no infra changes.
|
||||||
|
|
||||||
|
### 1. Traefik (Ingress) — `traefik_*`
|
||||||
|
Traefik is scraped via the `kubernetes-pods` job (pod annotation on
|
||||||
|
`traefik-9bcdbbd9-x8zq4` in `kube-system`). It exposes request counters, entry
|
||||||
|
point latency, TLS handshakes, config reloads.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Requests/sec by entrypoint (web / websecure / traefik) — `rate(traefik_entrypoint_requests_total[5m])`
|
||||||
|
- Request latency p50/p95/p99 — `histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le, entrypoint))`
|
||||||
|
- HTTP status code distribution (2xx/3xx/4xx/5xx) — `rate(traefik_entrypoint_requests_total{code=~"2xx|3xx|4xx|5xx"}[5m])`
|
||||||
|
- TLS handshakes/sec — `rate(traefik_entrypoint_requests_tls_total[5m])`
|
||||||
|
- Config reloads + last reload success — `traefik_config_reloads_total`, `traefik_config_last_reload_success`
|
||||||
|
- Top routes/services by request volume — `topk(10, sum by (service) (rate(traefik_service_requests_total[5m])))`
|
||||||
|
- Bytes transferred in/out — `rate(traefik_entrypoint_requests_bytes_total[5m])`
|
||||||
|
|
||||||
|
**Why useful:** This is your front door. Knowing which routes get hit most,
|
||||||
|
latency per ingress, and 5xx spikes is the single most valuable app-level
|
||||||
|
dashboard in the cluster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. CoreDNS (cluster DNS) — `coredns_*`
|
||||||
|
Scraped via `kube-dns` Service annotation. Exposes query rate, cache hits,
|
||||||
|
error types, response duration.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- DNS queries/sec by zone / type — `rate(coredns_dns_requests_total[5m])`
|
||||||
|
- Cache hit ratio — `rate(coredns_cache_hits_total[5m]) / rate(coredns_cache_requests_total[5m])`
|
||||||
|
- DNS query latency p95 — `histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))`
|
||||||
|
- Queries by response code (NOERROR / NXDOMAIN / SERVFAIL) — `rate(coredns_dns_responses_total[5m])`
|
||||||
|
- Cache size — `coredns_cache_entries`
|
||||||
|
- Forward requests/sec (upstream DNS) — `rate(coredns_forward_requests_total[5m])`
|
||||||
|
|
||||||
|
**Why useful:** DNS issues cause cascading failures (ImagePullBackOff, cert
|
||||||
|
challenges, etc.). A spike in NXDOMAIN/SERVFAIL is an early warning.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. MetalLB (LoadBalancer) — `metallb_*`
|
||||||
|
Scraped via pod annotation on `speaker-*` and `controller` in `metallb-system`.
|
||||||
|
Exposes IP allocation usage, BGP/session state.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- IP addresses in use vs. total — `metallb_allocator_addresses_in_use_total` / `metallb_allocator_addresses_total`
|
||||||
|
- IP pool utilization % (gauge) — `metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total * 100`
|
||||||
|
- BGP session up per speaker — `metallb_bgp_session_up`
|
||||||
|
- Config loaded / stale status — `metallb_k8s_client_config_loaded_bool`, `metallb_k8s_client_config_stale_bool`
|
||||||
|
- Announcements per speaker — `rate(metallb_bgp_announcements_total[5m])`
|
||||||
|
|
||||||
|
**Why useful:** If MetalLB runs out of IPs, new LoadBalancer services will
|
||||||
|
hang in `<pending>`. Knowing pool utilization lets you act before that happens.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. cert-manager (TLS certificates) — `certmanager_*`
|
||||||
|
Scraped via pod annotations on cert-manager pods. Exposes certificate
|
||||||
|
expiration, renewal, ready status, ACME challenges.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Certificate expiration (days remaining, sorted) — table of `(certmanager_certificate_not_after_timestamp_seconds - time()) / 86400`
|
||||||
|
- Certificates not Ready — `certmanager_certificate_ready_status{condition="Ready",status!="True"}`
|
||||||
|
- Upcoming renewals (next 14 days) — `certmanager_certificate_renewal_timestamp_seconds`
|
||||||
|
- ACME challenge status — `certmanager_certificate_challenge_status`
|
||||||
|
- Failed renewals counter — `rate(certmanager_certificate_renewal_total{condition="Failed"}[1h])`
|
||||||
|
|
||||||
|
**Why useful:** A cert about to expire (or silently failing to renew) is the
|
||||||
|
kind of thing that takes down `*.rogi.casa` HTTPS with no warning. This is a
|
||||||
|
must-have alert/dashboard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Phoenix (trace store) — `phoenix_*`
|
||||||
|
Already scraped via the `phoenix` Service annotation. Exposes bulk loader
|
||||||
|
ingestion rates, span insertion times, retention sweeper, exceptions.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Span ingestion rate — `rate(phoenix_bulk_loader_span_insertion_time_seconds_count[5m])`
|
||||||
|
- Span insertion latency p95 — `histogram_quantile(0.95, sum(rate(phoenix_bulk_loader_span_insertion_time_seconds_bucket[5m])) by (le))`
|
||||||
|
- Span exceptions/sec — `rate(phoenix_bulk_loader_span_exceptions_total[5m])`
|
||||||
|
- Retention sweeper last run — `phoenix_retention_sweeper_last_run_seconds`
|
||||||
|
- Last activity timestamp — `phoenix_bulk_loader_last_activity_timestamp_seconds`
|
||||||
|
|
||||||
|
**Why useful:** Phoenix is your observability backend's own backend. Tracking
|
||||||
|
ingestion health tells you whether traces are landing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Infrastructure dashboards (compose from existing metrics)
|
||||||
|
|
||||||
|
### 6. Storage & PVC Health (KSM + kubelet + node-exporter)
|
||||||
|
Cross-source dashboard combining `kube_persistentvolumeclaim_*` (KSM),
|
||||||
|
`kubelet_volume_stats_*` (kubelet), and `node_filesystem_*` (node-exporter).
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- PVC usage % per claim — `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100`
|
||||||
|
- PVC requested vs. capacity — `kube_persistentvolumeclaim_resource_requests_storage_bytes` vs actual
|
||||||
|
- Node disk usage % (all mounts) — `(1 - node_filesystem_avail / node_filesystem_size) * 100`
|
||||||
|
- Inode usage % per mount — `(1 - node_filesystem_files_free / node_filesystem_files) * 100`
|
||||||
|
- Volume binding status (Bound/Pending) — `kube_persistentvolumeclaim_status_phase`
|
||||||
|
- Top 10 PVCs by usage (table)
|
||||||
|
|
||||||
|
**Why useful:** The `local-path` provisioner fills up node disks. Catching a
|
||||||
|
PVC at 95% before it errors is a lifesaver.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. Workload Health (KSM)
|
||||||
|
Uses kube-state-metrics to show deployment/StatefulSet/CronJob health cluster-wide.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Deployments with unavailable replicas — `kube_deployment_status_replicas_available < kube_deployment_status_replicas`
|
||||||
|
- Pods not in Running phase by namespace — `kube_pod_status_phase{phase!="Running"}`
|
||||||
|
- Container restarts (last 1h) — `increase(kube_pod_container_status_restarts_total[1h])`
|
||||||
|
- Pods stuck in CrashLoopBackOff / ImagePullBackOff — `kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff"}`
|
||||||
|
- Job failures — `kube_job_failed`
|
||||||
|
- CronJob schedule heatmap — `kube_cronjob_status_active`
|
||||||
|
- HPA status (if any autoscaled) — `kube_horizontalpodautoscaler_status_current_replicas` vs desired
|
||||||
|
|
||||||
|
**Why useful:** This is the "is anything broken" board. Notice you already have
|
||||||
|
some pods in `ImagePullBackOff` (myorg-assistant) — this dashboard surfaces that.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. etcd / Control Plane Health (if exposed)
|
||||||
|
k3s embeds etcd (or sqlite on single-node). etcd metrics require exposing
|
||||||
|
the etcd `/metrics` endpoint (typically `--listen-metrics-urls` on the control
|
||||||
|
plane node). **Requires config change to enable.**
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Leader changes — `etcd_server_leader_changes_seen_total`
|
||||||
|
- Proposal commits/sec — `rate(etcd_server_proposals_committed_total[5m])`
|
||||||
|
- Proposal failures/sec — `rate(etcd_server_proposals_failed_total[5m])`
|
||||||
|
- DB size — `etcd_mvcc_db_total_size_in_bytes`
|
||||||
|
- RPC latency p99 — `histogram_quantile(0.99, sum(rate(etcd_grpc Unary grpc latency bucket[5m])) by (le))`
|
||||||
|
- Active watchers — `etcd_debugging_mvcc_watcher_total`
|
||||||
|
|
||||||
|
**Why useful:** etcd is the brain of the cluster. Slow commits or a flipping
|
||||||
|
leader indicates control-plane trouble.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## App-service dashboards (require enabling metrics first)
|
||||||
|
|
||||||
|
Most of your apps don't expose `/metrics` yet. Below is the per-service setup
|
||||||
|
plus the dashboard idea once metrics are on. To enable scraping for any of
|
||||||
|
these, annotate the Service with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
metadata:
|
||||||
|
annotations:
|
||||||
|
prometheus.io/scrape: "true"
|
||||||
|
prometheus.io/port: "<port>"
|
||||||
|
```
|
||||||
|
|
||||||
|
The existing `kubernetes-service-endpoints` scrape job will pick them up
|
||||||
|
automatically — **no Prometheus config edit needed**.
|
||||||
|
|
||||||
|
### 9. LiteLLM (LLM gateway) — needs enabling
|
||||||
|
LiteLLM exposes Prometheus metrics on its API port (`/metrics`). Annotate the
|
||||||
|
`litellm` Service.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Requests/sec by model — `rate(litellm_requests_total[5m])` by `model`
|
||||||
|
- Token usage (prompt/completion/total) — `rate(litellm_total_tokens_total[5m])`
|
||||||
|
- Spend by model — `litellm_spend_total` (if cost tracking enabled)
|
||||||
|
- Latency p95 per model — `histogram_quantile(0.95, ...)`
|
||||||
|
- Error rate by model — `rate(litellm_requests_total{status=~"5.."}[5m])`
|
||||||
|
- Rate-limit / quota hits
|
||||||
|
|
||||||
|
**Why useful:** LiteLLM is the gateway for all your AI apps (open-webui,
|
||||||
|
myorg-assistant, etc.). Token spend + per-model latency is the single best
|
||||||
|
cost/quality lever in the cluster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 10. Gitea (git + CI) — needs enabling
|
||||||
|
Gitea exposes metrics at `/metrics` when `ENABLE_METRICS=true` in `app.ini`.
|
||||||
|
Annotate `gitea-http` Service (port 3000 inside, 80 via svc).
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Git push/clone/fetch rate — `gitea_actions_total` by `action`
|
||||||
|
- Active users / repos / orgs — `gitea_users_total`, `gitea_repos_total`
|
||||||
|
- Issues / PRs open — `gitea_issues_total`, `gitea_pulls_total`
|
||||||
|
- HTTP request rate + latency
|
||||||
|
- Gitea Actions runner job duration — if runner metrics exposed
|
||||||
|
|
||||||
|
**Why useful:** Gitea hosts the cluster's own GitOps repo + CI. Tracking push
|
||||||
|
rate and runner throughput catches CI storms.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 11. Home Assistant — needs enabling
|
||||||
|
HA exposes Prometheus metrics via the `prometheus` integration (add to
|
||||||
|
`configuration.yaml`). Then annotate the Service.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Active entities / sensors by domain
|
||||||
|
- State change events/sec — `homeassistant_entity_states_total`
|
||||||
|
- Automation triggers/sec — `homeassistant_automation_triggered_total`
|
||||||
|
- Integrations loaded + errors
|
||||||
|
- Database size / recorder queue depth
|
||||||
|
- Zigbee/Z-Wave mesh health (if exposed)
|
||||||
|
|
||||||
|
**Why useful:** HA is a home-critical service. Event/sec spikes often indicate
|
||||||
|
sensor flapping or runaway automations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 12. Jellyfin — limited
|
||||||
|
Jellyfin doesn't ship first-class Prometheus metrics, but you can scrape it
|
||||||
|
via a sidecar (`jellyfin-prometheus-exporter`) or build a blackbox-style
|
||||||
|
dashboard on the `/health` endpoint.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Active streams — from exporter
|
||||||
|
- Transcode sessions + hw accel usage
|
||||||
|
- Library size by media type
|
||||||
|
- Playback errors
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 13. Pi-hole — needs enabling
|
||||||
|
Pi-hole exposes metrics on its FTL web API; the `pihole-exporter` sidecar
|
||||||
|
converts them to Prometheus format. Add as a sidecar container + annotate.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- DNS queries/sec (total, blocked, cached, forwarded)
|
||||||
|
- Block list size
|
||||||
|
- Top blocked domains
|
||||||
|
- Top permitted domains
|
||||||
|
- Clients by query volume
|
||||||
|
- Cache hit ratio
|
||||||
|
|
||||||
|
**Why useful:** Pi-hole is your network-wide adblock. Block rate + cache ratio
|
||||||
|
are the headline metrics, and query spikes reveal misbehaving clients.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 14. PostgreSQL (litellm + phoenix + n8n) — needs enabling
|
||||||
|
You have two Postgres instances (`postgres` in `litellm` and `phoenix`).
|
||||||
|
Add `prometheus-postgres-exporter` as a sidecar or Deployment per DB.
|
||||||
|
|
||||||
|
**Panels (per DB):**
|
||||||
|
- Connections (active / idle / max) — `pg_stat_activity_count`
|
||||||
|
- Transactions/sec — `rate(pg_stat_database_xact_commit[5m])`
|
||||||
|
- Cache hit ratio — `pg_stat_database_blks_hit / (blks_hit + blks_read)`
|
||||||
|
- Table + index bloat
|
||||||
|
- Replication lag (if replicas)
|
||||||
|
- Slow queries (if `pg_stat_statements` enabled)
|
||||||
|
- DB size growth — `pg_database_size_bytes`
|
||||||
|
|
||||||
|
**Why useful:** DB connection exhaustion and cache ratio collapse are the two
|
||||||
|
most common causes of slow app performance.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 15. Minecraft — limited
|
||||||
|
The Minecraft server exposes metrics via RCON + an exporter
|
||||||
|
(`minecraft-exporter`). Add as sidecar using the existing `RCON_PASSWORD`.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Players online — `minecraft_players_online`
|
||||||
|
- TPS (ticks per second) — `minecraft_tps` (server health)
|
||||||
|
- Entities loaded — `minecraft_entities_total`
|
||||||
|
- Chunk count — `minecraft_chunks_loaded`
|
||||||
|
- Memory used by JVM
|
||||||
|
|
||||||
|
**Why useful:** TPS < 20 means lag. Player count vs. server load is the only
|
||||||
|
real signal a Minecraft server needs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 16. qBittorrent — limited
|
||||||
|
No native metrics. Options: a `qbittorrent-exporter` sidecar (uses the WebUI
|
||||||
|
API), or a blackbox probe on the WebUI.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Download/upload speed
|
||||||
|
- Active torrents
|
||||||
|
- Torrent count by state (downloading/seeding/paused)
|
||||||
|
- Disk usage in download dir
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cluster meta dashboards
|
||||||
|
|
||||||
|
### 17. Network Topology / Service Map
|
||||||
|
Composite view: for each namespace, list services, their pods, scrape status,
|
||||||
|
and request volume (from Traefik logs + cAdvisor network). A "what talks to
|
||||||
|
what" overview.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Service → pod → container resource table
|
||||||
|
- Cross-namespace network flows (if network policy logging enabled)
|
||||||
|
- Scrape health matrix (every target up/down)
|
||||||
|
- Ingress route → backend service map
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 18. Backup / Snapshot Status
|
||||||
|
If you take Velero snapshots or local-path snapshots, build a dashboard on
|
||||||
|
`velero_*` or CRD status. **Requires Velero.**
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Last successful backup per namespace
|
||||||
|
- Failed backups
|
||||||
|
- Backup size growth
|
||||||
|
- Restore test status
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 19. Cost / Capacity Planning
|
||||||
|
Composite: per-namespace CPU/memory requests vs. actual usage, projected
|
||||||
|
growth, node saturation forecast.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Requests vs. limits vs. actual (per namespace) — KSM + cAdvisor
|
||||||
|
- Node capacity vs. allocatable
|
||||||
|
- PVC growth trend + 30-day forecast
|
||||||
|
- "What if I removed node X" simulation (capacity headroom)
|
||||||
|
|
||||||
|
**Why useful:** Tells you when you'll need another node before you hit the wall.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended priority order
|
||||||
|
|
||||||
|
If you only build a few, do them in this order (highest value-to-effort first):
|
||||||
|
|
||||||
|
1. **Traefik Ingress** (#1) — already scraped, your front door
|
||||||
|
2. **Storage & PVC Health** (#6) — local-path fills disks; high blast radius
|
||||||
|
3. **Workload Health** (#7) — surfaces CrashLoopBackOff / ImagePullBackOff
|
||||||
|
4. **cert-manager** (#4) — prevents silent cert expiry outages
|
||||||
|
5. **CoreDNS** (#2) — early warning for DNS cascades
|
||||||
|
6. **LiteLLM** (#9) — needs `prometheus.io/scrape` annotation only; big insights
|
||||||
|
7. **MetalLB** (#3) — small but catches LoadBalancer IP exhaustion
|
||||||
|
|
||||||
|
Items 8–19 are nice-to-have or require additional exporters/config.
|
||||||
331
monitoring/grafana-dashboard-cluster-overview.yaml
Normal file
331
monitoring/grafana-dashboard-cluster-overview.yaml
Normal file
@@ -0,0 +1,331 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: grafana-dashboard-cluster-overview
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: grafana
|
||||||
|
grafana_dashboard: "1"
|
||||||
|
data:
|
||||||
|
cluster-overview.json: |
|
||||||
|
{
|
||||||
|
"annotations": {"list": []},
|
||||||
|
"editable": true,
|
||||||
|
"fiscalYearStartMonth": 0,
|
||||||
|
"graphTooltip": 1,
|
||||||
|
"id": null,
|
||||||
|
"links": [],
|
||||||
|
"liveNow": false,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "green", "value": null}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"unit": "s"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 4, "x": 0, "y": 0},
|
||||||
|
"id": 1,
|
||||||
|
"options": {
|
||||||
|
"colorMode": "value",
|
||||||
|
"graphMode": "area",
|
||||||
|
"justifyMode": "auto",
|
||||||
|
"orientation": "auto",
|
||||||
|
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||||
|
"textMode": "auto"
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "time() - max(process_start_time_seconds{job=\"prometheus\"})", "refId": "A"}],
|
||||||
|
"title": "Prometheus Uptime",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "red", "value": null},
|
||||||
|
{"color": "green", "value": 1}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 4, "x": 4, "y": 0},
|
||||||
|
"id": 2,
|
||||||
|
"options": {
|
||||||
|
"colorMode": "background",
|
||||||
|
"graphMode": "none",
|
||||||
|
"justifyMode": "center",
|
||||||
|
"orientation": "horizontal",
|
||||||
|
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||||
|
"textMode": "value_and_name"
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "count(kubelet_running_pods)", "refId": "A"}],
|
||||||
|
"title": "Running Pods (total)",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "green", "value": null}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 4, "x": 8, "y": 0},
|
||||||
|
"id": 3,
|
||||||
|
"options": {
|
||||||
|
"colorMode": "background",
|
||||||
|
"graphMode": "none",
|
||||||
|
"justifyMode": "center",
|
||||||
|
"orientation": "horizontal",
|
||||||
|
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||||
|
"textMode": "value_and_name"
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kubelet_running_containers)", "refId": "A"}],
|
||||||
|
"title": "Running Containers",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"mappings": [
|
||||||
|
{"options": {"0": {"text": "Down", "color": "red"}, "1": {"text": "Up", "color": "green"}}, "type": "value"}
|
||||||
|
],
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "red", "value": null},
|
||||||
|
{"color": "green", "value": 1}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 12, "x": 12, "y": 0},
|
||||||
|
"id": 4,
|
||||||
|
"options": {
|
||||||
|
"colorMode": "background",
|
||||||
|
"graphMode": "none",
|
||||||
|
"justifyMode": "center",
|
||||||
|
"orientation": "horizontal",
|
||||||
|
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||||
|
"textMode": "value_and_name"
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up{job=\"kubernetes-apiservers\"}", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up{job=\"kubernetes-nodes\"}", "refId": "B"}],
|
||||||
|
"title": "Control Plane & Node Exporters",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"axisCenteredZero": false,
|
||||||
|
"axisColorMode": "text",
|
||||||
|
"axisLabel": "",
|
||||||
|
"axisPlacement": "auto",
|
||||||
|
"barAlignment": 0,
|
||||||
|
"drawStyle": "line",
|
||||||
|
"fillOpacity": 10,
|
||||||
|
"gradientMode": "none",
|
||||||
|
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||||
|
"insertNulls": false,
|
||||||
|
"lineInterpolation": "linear",
|
||||||
|
"lineWidth": 1,
|
||||||
|
"pointSize": 5,
|
||||||
|
"scaleDistribution": {"type": "linear"},
|
||||||
|
"showPoints": "never",
|
||||||
|
"spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"},
|
||||||
|
"thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 5},
|
||||||
|
"id": 10,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\"}) by (namespace)", "legendFormat": "{{namespace}}", "refId": "A"}],
|
||||||
|
"title": "Memory Usage by Namespace",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "core"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 5},
|
||||||
|
"id": 11,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\",container!=\"POD\"}[5m])) by (namespace)", "legendFormat": "{{namespace}}", "refId": "A"}],
|
||||||
|
"title": "CPU Usage by Namespace",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "Bps"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 14},
|
||||||
|
"id": 12,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_receive_bytes_total[5m])) by (namespace)", "legendFormat": "RX {{namespace}}", "refId": "A"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_transmit_bytes_total[5m])) by (namespace)", "legendFormat": "TX {{namespace}}", "refId": "B"}
|
||||||
|
],
|
||||||
|
"title": "Network RX/TX by Namespace",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "decbytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 14},
|
||||||
|
"id": 13,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_fs_usage_bytes) by (instance)", "legendFormat": "{{instance}}", "refId": "A"}],
|
||||||
|
"title": "Filesystem Usage by Node",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 24, "x": 0, "y": 23},
|
||||||
|
"id": 20,
|
||||||
|
"options": {
|
||||||
|
"showHeader": true,
|
||||||
|
"cellHeight": "sm",
|
||||||
|
"footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sort_desc(sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\"}) by (namespace,pod))", "format": "table", "instant": true, "refId": "A"}],
|
||||||
|
"title": "Pods by Memory (live)",
|
||||||
|
"type": "table",
|
||||||
|
"transformations": [
|
||||||
|
{"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Memory (bytes)"}}}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "orange", "value": 1}, {"color": "red", "value": 5}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 24, "x": 0, "y": 32},
|
||||||
|
"id": 30,
|
||||||
|
"options": {
|
||||||
|
"showHeader": true,
|
||||||
|
"cellHeight": "sm",
|
||||||
|
"footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kube_pod_status_phase{phase=\"Running\"}) by (namespace)", "format": "table", "instant": true, "refId": "A"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kube_pod_status_phase{phase=\"Pending\"}) by (namespace)", "format": "table", "instant": true, "refId": "B"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(kube_pod_status_phase{phase=\"Failed\"}) by (namespace)", "format": "table", "instant": true, "refId": "C"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace)", "format": "table", "instant": true, "refId": "D"}
|
||||||
|
],
|
||||||
|
"title": "Pod Health by Namespace (KSM)",
|
||||||
|
"type": "table",
|
||||||
|
"transformations": [
|
||||||
|
{"id": "merge", "options": {}},
|
||||||
|
{"id": "groupBy", "options": {"fields": {"Value": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #B": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #C": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #D": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "namespace": {"aggregations": [], "operation": "groupby"}}}},
|
||||||
|
{"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Running", "Value #B": "Pending", "Value #C": "Failed", "Value #D": "Restarts (1h)"}}}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refresh": "30s",
|
||||||
|
"schemaVersion": 38,
|
||||||
|
"style": "dark",
|
||||||
|
"tags": ["k3s", "overview"],
|
||||||
|
"templating": {"list": []},
|
||||||
|
"time": {"from": "now-6h", "to": "now"},
|
||||||
|
"timepicker": {},
|
||||||
|
"timezone": "",
|
||||||
|
"title": "Cluster Overview",
|
||||||
|
"uid": "k3s-cluster-overview",
|
||||||
|
"version": 2,
|
||||||
|
"weekStart": ""
|
||||||
|
}
|
||||||
209
monitoring/grafana-dashboard-control-plane.yaml
Normal file
209
monitoring/grafana-dashboard-control-plane.yaml
Normal file
@@ -0,0 +1,209 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: grafana-dashboard-control-plane
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: grafana
|
||||||
|
grafana_dashboard: "1"
|
||||||
|
data:
|
||||||
|
control-plane.json: |
|
||||||
|
{
|
||||||
|
"annotations": {"list": []},
|
||||||
|
"editable": true,
|
||||||
|
"graphTooltip": 1,
|
||||||
|
"id": null,
|
||||||
|
"links": [],
|
||||||
|
"liveNow": false,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "reqps"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 0},
|
||||||
|
"id": 1,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(apiserver_request_total[5m])) by (verb)", "legendFormat": "{{verb}}", "refId": "A"}],
|
||||||
|
"title": "API Server Requests by Verb",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "s"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 0},
|
||||||
|
"id": 2,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "histogram_quantile(0.95, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb))", "legendFormat": "p95 {{verb}}", "refId": "A"}],
|
||||||
|
"title": "API Server Request Latency p95",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "ops"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 9},
|
||||||
|
"id": 3,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m])) by (verb)", "legendFormat": "{{verb}}", "refId": "A"}],
|
||||||
|
"title": "API Server 5xx Errors",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 9},
|
||||||
|
"id": 4,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(kubelet_container_log_filesystem_used_bytes[5m]))", "legendFormat": "log fs", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "histogram_quantile(0.95, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le))", "legendFormat": "pod start p95", "refId": "B"}],
|
||||||
|
"title": "Kubelet Pod Start Latency",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "s"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 18},
|
||||||
|
"id": 5,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "histogram_quantile(0.95, sum(rate(kubelet_cgroup_manager_duration_seconds_bucket[5m])) by (le, instance))", "legendFormat": "{{instance}}", "refId": "A"}],
|
||||||
|
"title": "Kubelet Cgroup Manager Duration p95",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 18},
|
||||||
|
"id": 6,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(kubelet_pleg_relist_duration_seconds_count[5m])", "legendFormat": "relists/s {{instance}}", "refId": "A"}],
|
||||||
|
"title": "Kubelet PLEG Relist Rate",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"mappings": [
|
||||||
|
{"options": {"0": {"text": "Down", "color": "red"}, "1": {"text": "Up", "color": "green"}}, "type": "value"}
|
||||||
|
],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 6, "w": 24, "x": 0, "y": 27},
|
||||||
|
"id": 7,
|
||||||
|
"options": {
|
||||||
|
"colorMode": "background",
|
||||||
|
"graphMode": "none",
|
||||||
|
"justifyMode": "center",
|
||||||
|
"orientation": "horizontal",
|
||||||
|
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||||
|
"textMode": "value_and_name"
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up", "refId": "A"}],
|
||||||
|
"title": "All Scrape Targets Status",
|
||||||
|
"type": "stat"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refresh": "30s",
|
||||||
|
"schemaVersion": 38,
|
||||||
|
"style": "dark",
|
||||||
|
"tags": ["k3s", "control-plane"],
|
||||||
|
"templating": {"list": []},
|
||||||
|
"time": {"from": "now-6h", "to": "now"},
|
||||||
|
"timepicker": {},
|
||||||
|
"timezone": "",
|
||||||
|
"title": "Control Plane & API Server",
|
||||||
|
"uid": "k3s-control-plane",
|
||||||
|
"version": 1,
|
||||||
|
"weekStart": ""
|
||||||
|
}
|
||||||
279
monitoring/grafana-dashboard-nodes.yaml
Normal file
279
monitoring/grafana-dashboard-nodes.yaml
Normal file
@@ -0,0 +1,279 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: grafana-dashboard-nodes
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: grafana
|
||||||
|
grafana_dashboard: "1"
|
||||||
|
data:
|
||||||
|
nodes.json: |
|
||||||
|
{
|
||||||
|
"annotations": {"list": []},
|
||||||
|
"editable": true,
|
||||||
|
"graphTooltip": 1,
|
||||||
|
"id": null,
|
||||||
|
"links": [],
|
||||||
|
"liveNow": false,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 6, "w": 6, "x": 0, "y": 0},
|
||||||
|
"id": 1,
|
||||||
|
"options": {
|
||||||
|
"colorMode": "value",
|
||||||
|
"graphMode": "area",
|
||||||
|
"justifyMode": "auto",
|
||||||
|
"orientation": "auto",
|
||||||
|
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||||
|
"textMode": "value_and_name"
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_pods", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_containers", "refId": "B"}],
|
||||||
|
"title": "Pods / Containers per Node",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "orange", "value": 70}, {"color": "red", "value": 90}]},
|
||||||
|
"unit": "percent",
|
||||||
|
"min": 0,
|
||||||
|
"max": 100
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 6, "w": 18, "x": 6, "y": 0},
|
||||||
|
"id": 2,
|
||||||
|
"options": {
|
||||||
|
"colorMode": "background",
|
||||||
|
"graphMode": "area",
|
||||||
|
"justifyMode": "auto",
|
||||||
|
"orientation": "horizontal",
|
||||||
|
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||||
|
"textMode": "value_and_name"
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "{{instance}}", "refId": "A"}],
|
||||||
|
"title": "Node CPU Usage %",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "percent",
|
||||||
|
"min": 0,
|
||||||
|
"max": 100
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 6},
|
||||||
|
"id": 3,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "{{instance}}", "refId": "A"}],
|
||||||
|
"title": "Node CPU Usage % (over time)",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "percent",
|
||||||
|
"min": 0,
|
||||||
|
"max": 100
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 6},
|
||||||
|
"id": 4,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100", "legendFormat": "{{instance}}", "refId": "A"}],
|
||||||
|
"title": "Node Memory Usage %",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 15},
|
||||||
|
"id": 5,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\"}) by (instance)", "legendFormat": "used {{instance}}", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_memory_MemTotal_bytes", "legendFormat": "total {{instance}}", "refId": "B"}],
|
||||||
|
"title": "Node Memory (used vs total)",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "Bps"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 15},
|
||||||
|
"id": 6,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (instance) (rate(node_network_receive_bytes_total{device!~\"lo|veth.*|docker.*|br-.*|cni.*|flannel.*\"}[5m]))", "legendFormat": "RX {{instance}}", "refId": "A"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (instance) (rate(node_network_transmit_bytes_total{device!~\"lo|veth.*|docker.*|br-.*|cni.*|flannel.*\"}[5m]))", "legendFormat": "TX {{instance}}", "refId": "B"}
|
||||||
|
],
|
||||||
|
"title": "Node Network Traffic",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "percent",
|
||||||
|
"min": 0,
|
||||||
|
"max": 100
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 24},
|
||||||
|
"id": 7,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "(1 - (node_filesystem_avail_bytes{fstype!~\"tmpfs|overlay|squashfs\"} / node_filesystem_size_bytes{fstype!~\"tmpfs|overlay|squashfs\"})) * 100", "legendFormat": "{{instance}} {{mountpoint}}", "refId": "A"}],
|
||||||
|
"title": "Node Disk Usage %",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 24},
|
||||||
|
"id": 8,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_load1", "legendFormat": "1m {{instance}}", "refId": "A"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_load5", "legendFormat": "5m {{instance}}", "refId": "B"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "node_load15", "legendFormat": "15m {{instance}}", "refId": "C"}
|
||||||
|
],
|
||||||
|
"title": "Node Load Average",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 24, "x": 0, "y": 33},
|
||||||
|
"id": 9,
|
||||||
|
"options": {
|
||||||
|
"showHeader": true,
|
||||||
|
"cellHeight": "sm",
|
||||||
|
"footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_pods", "format": "table", "instant": true, "refId": "A"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kubelet_running_containers", "format": "table", "instant": true, "refId": "B"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "format": "table", "instant": true, "refId": "C"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100", "format": "table", "instant": true, "refId": "D"}
|
||||||
|
],
|
||||||
|
"title": "Node Summary (live)",
|
||||||
|
"type": "table",
|
||||||
|
"transformations": [
|
||||||
|
{"id": "merge", "options": {}},
|
||||||
|
{"id": "groupBy", "options": {"fields": {"Value": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #B": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #C": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #D": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "instance": {"aggregations": [], "operation": "groupby"}}}},
|
||||||
|
{"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Pods", "Value #B": "Containers", "Value #C": "CPU %", "Value #D": "Memory %"}}}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refresh": "30s",
|
||||||
|
"schemaVersion": 38,
|
||||||
|
"style": "dark",
|
||||||
|
"tags": ["k3s", "nodes"],
|
||||||
|
"templating": {"list": []},
|
||||||
|
"time": {"from": "now-6h", "to": "now"},
|
||||||
|
"timepicker": {},
|
||||||
|
"timezone": "",
|
||||||
|
"title": "Nodes",
|
||||||
|
"uid": "k3s-nodes",
|
||||||
|
"version": 2,
|
||||||
|
"weekStart": ""
|
||||||
|
}
|
||||||
312
monitoring/grafana-dashboard-pods.yaml
Normal file
312
monitoring/grafana-dashboard-pods.yaml
Normal file
@@ -0,0 +1,312 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: grafana-dashboard-pods
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: grafana
|
||||||
|
grafana_dashboard: "1"
|
||||||
|
data:
|
||||||
|
pods.json: |
|
||||||
|
{
|
||||||
|
"annotations": {"list": []},
|
||||||
|
"editable": true,
|
||||||
|
"graphTooltip": 1,
|
||||||
|
"id": null,
|
||||||
|
"links": [],
|
||||||
|
"liveNow": false,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "normal"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "core"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 24, "x": 0, "y": 0},
|
||||||
|
"id": 1,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"}[5m])) by (pod)", "legendFormat": "{{pod}}", "refId": "A"}],
|
||||||
|
"title": "CPU Usage per Pod",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "normal"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 24, "x": 0, "y": 9},
|
||||||
|
"id": 2,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_memory_working_set_bytes{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"}) by (pod)", "legendFormat": "{{pod}}", "refId": "A"}],
|
||||||
|
"title": "Memory Usage per Pod",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "Bps"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 18},
|
||||||
|
"id": 3,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_receive_bytes_total{namespace=~\"$namespace\"}[5m])) by (pod)", "legendFormat": "RX {{pod}}", "refId": "A"}
|
||||||
|
],
|
||||||
|
"title": "Network RX per Pod",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "Bps"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 18},
|
||||||
|
"id": 4,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_network_transmit_bytes_total{namespace=~\"$namespace\"}[5m])) by (pod)", "legendFormat": "TX {{pod}}", "refId": "A"}
|
||||||
|
],
|
||||||
|
"title": "Network TX per Pod",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 27},
|
||||||
|
"id": 5,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(container_fs_usage_bytes{namespace=~\"$namespace\"}) by (pod)", "legendFormat": "{{pod}}", "refId": "A"}],
|
||||||
|
"title": "Filesystem Usage per Pod",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "percent"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 27},
|
||||||
|
"id": 6,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum(rate(container_cpu_cfs_throttled_seconds_total{namespace=~\"$namespace\"}[5m])) by (pod) / sum(rate(container_cpu_cfs_periods_total{namespace=~\"$namespace\"}[5m])) by (pod) * 100", "legendFormat": "{{pod}}", "refId": "A"}],
|
||||||
|
"title": "CPU Throttling % per Pod",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 5}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 36},
|
||||||
|
"id": 7,
|
||||||
|
"options": {
|
||||||
|
"showHeader": true,
|
||||||
|
"cellHeight": "sm",
|
||||||
|
"footer": {"show": false, "reducer": ["sum"], "countRows": false, "fields": ""}
|
||||||
|
},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace, pod) (container_memory_working_set_bytes{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"})", "format": "table", "instant": true, "refId": "A"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"\",container!=\"POD\",namespace=~\"$namespace\"}[5m]))", "format": "table", "instant": true, "refId": "B"},
|
||||||
|
{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace, pod) (rate(container_network_receive_bytes_total{namespace=~\"$namespace\"}[5m]))", "format": "table", "instant": true, "refId": "C"}
|
||||||
|
],
|
||||||
|
"title": "Pod Resource Summary (live)",
|
||||||
|
"type": "table",
|
||||||
|
"transformations": [
|
||||||
|
{"id": "merge", "options": {}},
|
||||||
|
{"id": "groupBy", "options": {"fields": {"Value": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #B": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "Value #C": {"aggregations": ["lastNotNull"], "operation": "aggregate"}, "namespace": {"aggregations": [], "operation": "groupby"}, "pod": {"aggregations": [], "operation": "groupby"}}}},
|
||||||
|
{"id": "organize", "options": {"excludeByName": {"Time": true}, "renameByName": {"Value": "Memory (bytes)", "Value #B": "CPU (cores)", "Value #C": "Network RX (Bps)"}}}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 46},
|
||||||
|
"id": 8,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace) (kube_pod_status_phase{phase=~\"Running|Pending|Failed\",namespace=~\"$namespace\"})", "legendFormat": "{{namespace}} {{phase}}", "refId": "A"}],
|
||||||
|
"title": "Pod Status by Namespace (KSM)",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 46},
|
||||||
|
"id": 9,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "sum by (namespace) (increase(kube_pod_container_status_restarts_total{namespace=~\"$namespace\"}[1h]))", "legendFormat": "{{namespace}}", "refId": "A"}],
|
||||||
|
"title": "Container Restarts (last 1h)",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {
|
||||||
|
"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true,
|
||||||
|
"stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}
|
||||||
|
},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 24, "x": 0, "y": 55},
|
||||||
|
"id": 10,
|
||||||
|
"options": {
|
||||||
|
"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true},
|
||||||
|
"tooltip": {"mode": "multi", "sort": "desc"}
|
||||||
|
},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace=~\"$namespace\"}", "legendFormat": "{{namespace}}/{{persistentvolumeclaim}}", "refId": "A"}],
|
||||||
|
"title": "PVC Storage Requests by Claim (KSM)",
|
||||||
|
"type": "timeseries"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refresh": "30s",
|
||||||
|
"schemaVersion": 38,
|
||||||
|
"style": "dark",
|
||||||
|
"tags": ["k3s", "pods"],
|
||||||
|
"templating": {
|
||||||
|
"list": [
|
||||||
|
{
|
||||||
|
"allValue": ".*",
|
||||||
|
"current": {"selected": true, "text": "All", "value": "$__all"},
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"definition": "label_values(container_cpu_usage_seconds_total, namespace)",
|
||||||
|
"hide": 0,
|
||||||
|
"includeAll": true,
|
||||||
|
"multi": true,
|
||||||
|
"name": "namespace",
|
||||||
|
"options": [],
|
||||||
|
"query": "label_values(container_cpu_usage_seconds_total, namespace)",
|
||||||
|
"refresh": 2,
|
||||||
|
"regex": "",
|
||||||
|
"skipUrlSync": false,
|
||||||
|
"sort": 1,
|
||||||
|
"type": "query"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"time": {"from": "now-6h", "to": "now"},
|
||||||
|
"timepicker": {},
|
||||||
|
"timezone": "",
|
||||||
|
"title": "Pods & Services",
|
||||||
|
"uid": "k3s-pods",
|
||||||
|
"version": 2,
|
||||||
|
"weekStart": ""
|
||||||
|
}
|
||||||
218
monitoring/grafana-dashboard-prometheus.yaml
Normal file
218
monitoring/grafana-dashboard-prometheus.yaml
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: grafana-dashboard-prometheus
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: grafana
|
||||||
|
grafana_dashboard: "1"
|
||||||
|
data:
|
||||||
|
prometheus.json: |
|
||||||
|
{
|
||||||
|
"annotations": {"list": []},
|
||||||
|
"editable": true,
|
||||||
|
"graphTooltip": 1,
|
||||||
|
"id": null,
|
||||||
|
"links": [],
|
||||||
|
"liveNow": false,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]},
|
||||||
|
"mappings": [{"options": {"0": {"text": "DOWN", "color": "red"}, "1": {"text": "UP", "color": "green"}}, "type": "value"}]
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 6, "x": 0, "y": 0},
|
||||||
|
"id": 1,
|
||||||
|
"options": {"colorMode": "background", "graphMode": "none", "justifyMode": "center", "orientation": "horizontal", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "value"},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "up{job=\"prometheus\"}", "refId": "A"}],
|
||||||
|
"title": "Prometheus Status",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 6, "x": 6, "y": 0},
|
||||||
|
"id": 2,
|
||||||
|
"options": {"colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "process_resident_memory_bytes{job=\"prometheus\"}", "refId": "A"}],
|
||||||
|
"title": "Prometheus RSS Memory",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 6, "x": 12, "y": 0},
|
||||||
|
"id": 3,
|
||||||
|
"options": {"colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_series", "refId": "A"}],
|
||||||
|
"title": "Active Series",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 5, "w": 6, "x": 18, "y": 0},
|
||||||
|
"id": 4,
|
||||||
|
"options": {"colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
|
||||||
|
"pluginVersion": "10.2.3",
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "count(up)", "refId": "A"}],
|
||||||
|
"title": "Scrape Targets",
|
||||||
|
"type": "stat"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 5},
|
||||||
|
"id": 10,
|
||||||
|
"options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "process_resident_memory_bytes{job=\"prometheus\"}", "legendFormat": "RSS", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_memory_postings_total", "legendFormat": "postings", "refId": "B"}],
|
||||||
|
"title": "Prometheus Memory",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "core"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 5},
|
||||||
|
"id": 11,
|
||||||
|
"options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(process_cpu_seconds_total{job=\"prometheus\"}[5m])", "legendFormat": "prometheus", "refId": "A"}],
|
||||||
|
"title": "Prometheus CPU",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "short"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 14},
|
||||||
|
"id": 12,
|
||||||
|
"options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(prometheus_tsdb_head_samples_appended_total[5m])", "legendFormat": "samples/s", "refId": "A"}],
|
||||||
|
"title": "Ingestion Rate (samples/s)",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "s"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 14},
|
||||||
|
"id": 13,
|
||||||
|
"options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "scrape_duration_seconds", "legendFormat": "{{job}} {{instance}}", "refId": "A"}],
|
||||||
|
"title": "Scrape Duration by Job",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "bytes"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 23},
|
||||||
|
"id": 14,
|
||||||
|
"options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_series", "legendFormat": "head series", "refId": "A"}, {"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "prometheus_tsdb_head_chunks", "legendFormat": "head chunks", "refId": "B"}],
|
||||||
|
"title": "TSDB Head Series & Chunks",
|
||||||
|
"type": "timeseries"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "Prometheus"},
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"custom": {"drawStyle": "line", "fillOpacity": 10, "lineInterpolation": "linear", "lineWidth": 1, "showPoints": "never", "spanNulls": true, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "off"}},
|
||||||
|
"mappings": [],
|
||||||
|
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||||
|
"unit": "s"
|
||||||
|
},
|
||||||
|
"overrides": []
|
||||||
|
},
|
||||||
|
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 23},
|
||||||
|
"id": 15,
|
||||||
|
"options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
|
||||||
|
"targets": [{"datasource": {"type": "prometheus", "uid": "Prometheus"}, "expr": "rate(prometheus_http_request_duration_seconds_sum[5m]) / rate(prometheus_http_request_duration_seconds_count[5m])", "legendFormat": "avg HTTP req", "refId": "A"}],
|
||||||
|
"title": "Prometheus HTTP Request Duration",
|
||||||
|
"type": "timeseries"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refresh": "30s",
|
||||||
|
"schemaVersion": 38,
|
||||||
|
"style": "dark",
|
||||||
|
"tags": ["k3s", "prometheus"],
|
||||||
|
"templating": {"list": []},
|
||||||
|
"time": {"from": "now-6h", "to": "now"},
|
||||||
|
"timepicker": {},
|
||||||
|
"timezone": "",
|
||||||
|
"title": "Prometheus Self-Monitoring",
|
||||||
|
"uid": "k3s-prometheus",
|
||||||
|
"version": 1,
|
||||||
|
"weekStart": ""
|
||||||
|
}
|
||||||
20
monitoring/grafana-dashboard-provider.yaml
Normal file
20
monitoring/grafana-dashboard-provider.yaml
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: grafana-dashboard-provider
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: grafana
|
||||||
|
data:
|
||||||
|
provider.yaml: |
|
||||||
|
apiVersion: 1
|
||||||
|
providers:
|
||||||
|
- name: 'k3s-dashboards'
|
||||||
|
orgId: 1
|
||||||
|
folder: 'K3s Cluster'
|
||||||
|
type: file
|
||||||
|
disableDeletion: false
|
||||||
|
updateIntervalSeconds: 30
|
||||||
|
allowUiUpdates: true
|
||||||
|
options:
|
||||||
|
path: /var/lib/grafana/dashboards
|
||||||
@@ -33,6 +33,18 @@ spec:
|
|||||||
mountPath: /var/lib/grafana
|
mountPath: /var/lib/grafana
|
||||||
- name: grafana-datasources
|
- name: grafana-datasources
|
||||||
mountPath: /etc/grafana/provisioning/datasources
|
mountPath: /etc/grafana/provisioning/datasources
|
||||||
|
- name: grafana-dashboard-provider
|
||||||
|
mountPath: /etc/grafana/provisioning/dashboards
|
||||||
|
- name: dashboards-cluster-overview
|
||||||
|
mountPath: /var/lib/grafana/dashboards/cluster-overview
|
||||||
|
- name: dashboards-pods
|
||||||
|
mountPath: /var/lib/grafana/dashboards/pods
|
||||||
|
- name: dashboards-nodes
|
||||||
|
mountPath: /var/lib/grafana/dashboards/nodes
|
||||||
|
- name: dashboards-control-plane
|
||||||
|
mountPath: /var/lib/grafana/dashboards/control-plane
|
||||||
|
- name: dashboards-prometheus
|
||||||
|
mountPath: /var/lib/grafana/dashboards/prometheus
|
||||||
resources:
|
resources:
|
||||||
requests:
|
requests:
|
||||||
memory: "256Mi"
|
memory: "256Mi"
|
||||||
@@ -47,3 +59,21 @@ spec:
|
|||||||
- name: grafana-datasources
|
- name: grafana-datasources
|
||||||
configMap:
|
configMap:
|
||||||
name: grafana-datasources
|
name: grafana-datasources
|
||||||
|
- name: grafana-dashboard-provider
|
||||||
|
configMap:
|
||||||
|
name: grafana-dashboard-provider
|
||||||
|
- name: dashboards-cluster-overview
|
||||||
|
configMap:
|
||||||
|
name: grafana-dashboard-cluster-overview
|
||||||
|
- name: dashboards-pods
|
||||||
|
configMap:
|
||||||
|
name: grafana-dashboard-pods
|
||||||
|
- name: dashboards-nodes
|
||||||
|
configMap:
|
||||||
|
name: grafana-dashboard-nodes
|
||||||
|
- name: dashboards-control-plane
|
||||||
|
configMap:
|
||||||
|
name: grafana-dashboard-control-plane
|
||||||
|
- name: dashboards-prometheus
|
||||||
|
configMap:
|
||||||
|
name: grafana-dashboard-prometheus
|
||||||
|
|||||||
35
monitoring/ingress.yaml
Normal file
35
monitoring/ingress.yaml
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: monitoring
|
||||||
|
namespace: monitoring
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- grafana.rogi.casa
|
||||||
|
- prometheus.rogi.casa
|
||||||
|
secretName: monitoring-tls
|
||||||
|
rules:
|
||||||
|
- host: grafana.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: grafana
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
|
- host: prometheus.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: prometheus
|
||||||
|
port:
|
||||||
|
number: 9090
|
||||||
118
monitoring/kube-state-metrics.yaml
Normal file
118
monitoring/kube-state-metrics.yaml
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ServiceAccount
|
||||||
|
metadata:
|
||||||
|
name: kube-state-metrics
|
||||||
|
namespace: monitoring
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRole
|
||||||
|
metadata:
|
||||||
|
name: kube-state-metrics
|
||||||
|
rules:
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources:
|
||||||
|
- configmaps
|
||||||
|
- secrets
|
||||||
|
- nodes
|
||||||
|
- pods
|
||||||
|
- services
|
||||||
|
- resourcequotas
|
||||||
|
- replicationcontrollers
|
||||||
|
- limitranges
|
||||||
|
- persistentvolumeclaims
|
||||||
|
- persistentvolumes
|
||||||
|
- namespaces
|
||||||
|
- endpoints
|
||||||
|
verbs: ["list", "watch"]
|
||||||
|
- apiGroups: ["apps"]
|
||||||
|
resources: ["statefulsets", "daemonsets", "deployments", "replicasets"]
|
||||||
|
verbs: ["list", "watch"]
|
||||||
|
- apiGroups: ["batch"]
|
||||||
|
resources: ["cronjobs", "jobs"]
|
||||||
|
verbs: ["list", "watch"]
|
||||||
|
- apiGroups: ["autoscaling"]
|
||||||
|
resources: ["horizontalpodautoscalers"]
|
||||||
|
verbs: ["list", "watch"]
|
||||||
|
- apiGroups: ["networking.k8s.io"]
|
||||||
|
resources: ["ingresses"]
|
||||||
|
verbs: ["list", "watch"]
|
||||||
|
- apiGroups: ["storage.k8s.io"]
|
||||||
|
resources: ["storageclasses", "volumeattachments"]
|
||||||
|
verbs: ["list", "watch"]
|
||||||
|
- apiGroups: ["certificates.k8s.io"]
|
||||||
|
resources: ["certificatesigningrequests"]
|
||||||
|
verbs: ["list", "watch"]
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRoleBinding
|
||||||
|
metadata:
|
||||||
|
name: kube-state-metrics
|
||||||
|
roleRef:
|
||||||
|
apiGroup: rbac.authorization.k8s.io
|
||||||
|
kind: ClusterRole
|
||||||
|
name: kube-state-metrics
|
||||||
|
subjects:
|
||||||
|
- kind: ServiceAccount
|
||||||
|
name: kube-state-metrics
|
||||||
|
namespace: monitoring
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: kube-state-metrics
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: kube-state-metrics
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: kube-state-metrics
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: kube-state-metrics
|
||||||
|
spec:
|
||||||
|
serviceAccountName: kube-state-metrics
|
||||||
|
containers:
|
||||||
|
- name: kube-state-metrics
|
||||||
|
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.1
|
||||||
|
ports:
|
||||||
|
- containerPort: 8080
|
||||||
|
name: http-metrics
|
||||||
|
- containerPort: 8081
|
||||||
|
name: telemetry
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /
|
||||||
|
port: 8081
|
||||||
|
initialDelaySeconds: 5
|
||||||
|
timeoutSeconds: 5
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
memory: "128Mi"
|
||||||
|
cpu: "100m"
|
||||||
|
limits:
|
||||||
|
memory: "512Mi"
|
||||||
|
cpu: "500m"
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: kube-state-metrics
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: kube-state-metrics
|
||||||
|
annotations:
|
||||||
|
prometheus.io/scrape: "true"
|
||||||
|
prometheus.io/port: "8080"
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app: kube-state-metrics
|
||||||
|
ports:
|
||||||
|
- name: http-metrics
|
||||||
|
port: 8080
|
||||||
|
targetPort: http-metrics
|
||||||
|
- name: telemetry
|
||||||
|
port: 8081
|
||||||
|
targetPort: telemetry
|
||||||
112
monitoring/node-exporter.yaml
Normal file
112
monitoring/node-exporter.yaml
Normal file
@@ -0,0 +1,112 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ServiceAccount
|
||||||
|
metadata:
|
||||||
|
name: node-exporter
|
||||||
|
namespace: monitoring
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: node-exporter
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: node-exporter
|
||||||
|
annotations:
|
||||||
|
prometheus.io/scrape: "true"
|
||||||
|
prometheus.io/port: "9100"
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app: node-exporter
|
||||||
|
ports:
|
||||||
|
- name: metrics
|
||||||
|
port: 9100
|
||||||
|
targetPort: 9100
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRole
|
||||||
|
metadata:
|
||||||
|
name: node-exporter
|
||||||
|
rules:
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources: ["nodes"]
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRoleBinding
|
||||||
|
metadata:
|
||||||
|
name: node-exporter
|
||||||
|
roleRef:
|
||||||
|
apiGroup: rbac.authorization.k8s.io
|
||||||
|
kind: ClusterRole
|
||||||
|
name: node-exporter
|
||||||
|
subjects:
|
||||||
|
- kind: ServiceAccount
|
||||||
|
name: node-exporter
|
||||||
|
namespace: monitoring
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: DaemonSet
|
||||||
|
metadata:
|
||||||
|
name: node-exporter
|
||||||
|
namespace: monitoring
|
||||||
|
labels:
|
||||||
|
app: node-exporter
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: node-exporter
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: node-exporter
|
||||||
|
spec:
|
||||||
|
serviceAccountName: node-exporter
|
||||||
|
hostPID: true
|
||||||
|
hostNetwork: true
|
||||||
|
tolerations:
|
||||||
|
- key: node-role.kubernetes.io/control-plane
|
||||||
|
operator: Exists
|
||||||
|
effect: NoSchedule
|
||||||
|
- key: node-role.kubernetes.io/master
|
||||||
|
operator: Exists
|
||||||
|
effect: NoSchedule
|
||||||
|
containers:
|
||||||
|
- name: node-exporter
|
||||||
|
image: prom/node-exporter:v1.7.0
|
||||||
|
args:
|
||||||
|
- --path.procfs=/host/proc
|
||||||
|
- --path.sysfs=/host/sys
|
||||||
|
- --path.rootfs=/host/root
|
||||||
|
- --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+)($|/)
|
||||||
|
- --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
|
||||||
|
ports:
|
||||||
|
- containerPort: 9100
|
||||||
|
hostPort: 9100
|
||||||
|
name: metrics
|
||||||
|
volumeMounts:
|
||||||
|
- name: proc
|
||||||
|
mountPath: /host/proc
|
||||||
|
readOnly: true
|
||||||
|
- name: sys
|
||||||
|
mountPath: /host/sys
|
||||||
|
readOnly: true
|
||||||
|
- name: root
|
||||||
|
mountPath: /host/root
|
||||||
|
readOnly: true
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
memory: "64Mi"
|
||||||
|
cpu: "50m"
|
||||||
|
limits:
|
||||||
|
memory: "128Mi"
|
||||||
|
cpu: "200m"
|
||||||
|
volumes:
|
||||||
|
- name: proc
|
||||||
|
hostPath:
|
||||||
|
path: /proc
|
||||||
|
- name: sys
|
||||||
|
hostPath:
|
||||||
|
path: /sys
|
||||||
|
- name: root
|
||||||
|
hostPath:
|
||||||
|
path: /
|
||||||
@@ -2,7 +2,7 @@ apiVersion: v1
|
|||||||
kind: ConfigMap
|
kind: ConfigMap
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-assistant-config
|
name: myorg-assistant-config
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
data:
|
data:
|
||||||
# LiteLLM Configuration
|
# LiteLLM Configuration
|
||||||
LITELLM_ENDPOINT: "http://litellm-service.default.svc.cluster.local:4000"
|
LITELLM_ENDPOINT: "http://litellm-service.default.svc.cluster.local:4000"
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: batch/v1
|
|||||||
kind: CronJob
|
kind: CronJob
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-deadline-checker
|
name: myorg-deadline-checker
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
labels:
|
labels:
|
||||||
app: myorg-assistant
|
app: myorg-assistant
|
||||||
job: deadline-checker
|
job: deadline-checker
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: batch/v1
|
|||||||
kind: CronJob
|
kind: CronJob
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-evening-summary
|
name: myorg-evening-summary
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
labels:
|
labels:
|
||||||
app: myorg-assistant
|
app: myorg-assistant
|
||||||
job: evening-summary
|
job: evening-summary
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: batch/v1
|
|||||||
kind: CronJob
|
kind: CronJob
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-git-sync
|
name: myorg-git-sync
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
labels:
|
labels:
|
||||||
app: myorg-assistant
|
app: myorg-assistant
|
||||||
job: git-sync
|
job: git-sync
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: batch/v1
|
|||||||
kind: CronJob
|
kind: CronJob
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-morning-briefing
|
name: myorg-morning-briefing
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
labels:
|
labels:
|
||||||
app: myorg-assistant
|
app: myorg-assistant
|
||||||
job: morning-briefing
|
job: morning-briefing
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: batch/v1
|
|||||||
kind: CronJob
|
kind: CronJob
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-waiting-followup
|
name: myorg-waiting-followup
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
labels:
|
labels:
|
||||||
app: myorg-assistant
|
app: myorg-assistant
|
||||||
job: waiting-followup
|
job: waiting-followup
|
||||||
|
|||||||
@@ -1,8 +1,13 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: myorg-assistant
|
||||||
|
---
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-assistant
|
name: myorg-assistant
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
labels:
|
labels:
|
||||||
app: myorg-assistant
|
app: myorg-assistant
|
||||||
spec:
|
spec:
|
||||||
@@ -58,7 +63,7 @@ spec:
|
|||||||
- name: gitea-registry
|
- name: gitea-registry
|
||||||
containers:
|
containers:
|
||||||
- name: myorg-assistant
|
- name: myorg-assistant
|
||||||
image: gitea.rogi.casa/roger/myorg-assistant/myorg-assistant:5215cd9
|
image: git.rogi.casa/roger/myorg-assistant/myorg-assistant:fcf79bf
|
||||||
imagePullPolicy: Always
|
imagePullPolicy: Always
|
||||||
command: ["./start.sh"]
|
command: ["./start.sh"]
|
||||||
ports:
|
ports:
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: networking.k8s.io/v1
|
|||||||
kind: Ingress
|
kind: Ingress
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-ingress
|
name: myorg-ingress
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
annotations:
|
annotations:
|
||||||
# Use Traefik as the ingress controller (default in k3s)
|
# Use Traefik as the ingress controller (default in k3s)
|
||||||
kubernetes.io/ingress.class: "traefik"
|
kubernetes.io/ingress.class: "traefik"
|
||||||
@@ -10,14 +10,12 @@ metadata:
|
|||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
||||||
# Optional: enable compression
|
# Optional: enable compression
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
traefik.ingress.kubernetes.io/compress: "true"
|
||||||
cert-manager.io/issuer: prod-issuer
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
spec:
|
||||||
tls:
|
tls:
|
||||||
- hosts:
|
- hosts:
|
||||||
- "*.rogi.casa"
|
- myorg.rogi.casa
|
||||||
secretName: rogicasa-tls
|
secretName: myorg-tls
|
||||||
rules:
|
rules:
|
||||||
- host: myorg.rogi.casa
|
- host: myorg.rogi.casa
|
||||||
http:
|
http:
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: v1
|
|||||||
kind: PersistentVolumeClaim
|
kind: PersistentVolumeClaim
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-assistant-pvc
|
name: myorg-assistant-pvc
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
spec:
|
spec:
|
||||||
accessModes:
|
accessModes:
|
||||||
- ReadWriteOnce
|
- ReadWriteOnce
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: myorg-assistant-service
|
name: myorg-assistant-service
|
||||||
namespace: default
|
namespace: myorg-assistant
|
||||||
labels:
|
labels:
|
||||||
app: myorg-assistant
|
app: myorg-assistant
|
||||||
spec:
|
spec:
|
||||||
|
|||||||
@@ -10,14 +10,12 @@ metadata:
|
|||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
||||||
# Optional: enable compression
|
# Optional: enable compression
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
traefik.ingress.kubernetes.io/compress: "true"
|
||||||
cert-manager.io/issuer: prod-issuer
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
spec:
|
||||||
tls:
|
tls:
|
||||||
- hosts:
|
- hosts:
|
||||||
- "*.rogi.casa"
|
- n8n.rogi.casa
|
||||||
secretName: rogicasa-tls
|
secretName: n8n-tls
|
||||||
rules:
|
rules:
|
||||||
- host: n8n.rogi.casa
|
- host: n8n.rogi.casa
|
||||||
http:
|
http:
|
||||||
|
|||||||
45
nas.yaml
45
nas.yaml
@@ -1,45 +0,0 @@
|
|||||||
#apiVersion: networking.k8s.io/v1
|
|
||||||
#kind: Ingress
|
|
||||||
#metadata:
|
|
||||||
# name: nas-redirect
|
|
||||||
# annotations:
|
|
||||||
# nginx.ingress.kubernetes.io/permanent-redirect: "http://10.88.88.238:5000"
|
|
||||||
#spec:
|
|
||||||
# rules:
|
|
||||||
# - host: nas.rogi.casa
|
|
||||||
# http:
|
|
||||||
# paths:
|
|
||||||
# - path: /
|
|
||||||
# pathType: Prefix
|
|
||||||
# backend:
|
|
||||||
# service:
|
|
||||||
# name: dummy-service
|
|
||||||
# port:
|
|
||||||
# number: 80
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Service
|
|
||||||
metadata:
|
|
||||||
name: external-ip
|
|
||||||
spec:
|
|
||||||
ports:
|
|
||||||
- name: app
|
|
||||||
port: 80
|
|
||||||
protocol: TCP
|
|
||||||
targetPort: 5000
|
|
||||||
clusterIP: None
|
|
||||||
type: ClusterIP
|
|
||||||
---
|
|
||||||
apiVersion: v1
|
|
||||||
kind: Endpoints
|
|
||||||
metadata:
|
|
||||||
name: external-ip
|
|
||||||
subsets:
|
|
||||||
- addresses:
|
|
||||||
- ip: 10.88.88.238
|
|
||||||
ports:
|
|
||||||
- name: app
|
|
||||||
port: 5000
|
|
||||||
protocol: TCP
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
77
nas/ingress.yaml
Normal file
77
nas/ingress.yaml
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: nas-proxy
|
||||||
|
---
|
||||||
|
# Standalone cert-manager Certificate for nas.rogi.casa (not owned by an Ingress,
|
||||||
|
# since cert-manager's ingress-shim would otherwise create one owned by the
|
||||||
|
# Ingress below and tie its lifecycle to it; keeping it standalone is cleaner).
|
||||||
|
apiVersion: cert-manager.io/v1
|
||||||
|
kind: Certificate
|
||||||
|
metadata:
|
||||||
|
name: nas-tls
|
||||||
|
namespace: nas-proxy
|
||||||
|
spec:
|
||||||
|
secretName: nas-tls
|
||||||
|
dnsNames:
|
||||||
|
- nas.rogi.casa
|
||||||
|
issuerRef:
|
||||||
|
group: cert-manager.io
|
||||||
|
kind: ClusterIssuer
|
||||||
|
name: letsencrypt-prod
|
||||||
|
usages:
|
||||||
|
- digital signature
|
||||||
|
- key encipherment
|
||||||
|
---
|
||||||
|
# Selector-less Service + manual Endpoints pointing at the NAS.
|
||||||
|
# (Endpoints is no longer excluded in argocd-cm, so ArgoCD manages it.)
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: synology-nas
|
||||||
|
namespace: nas-proxy
|
||||||
|
spec:
|
||||||
|
type: ClusterIP
|
||||||
|
clusterIP: None
|
||||||
|
ports:
|
||||||
|
- port: 5001
|
||||||
|
targetPort: 5001
|
||||||
|
protocol: TCP
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Endpoints
|
||||||
|
metadata:
|
||||||
|
name: synology-nas
|
||||||
|
namespace: nas-proxy
|
||||||
|
subsets:
|
||||||
|
- addresses:
|
||||||
|
- ip: 10.88.30.10
|
||||||
|
ports:
|
||||||
|
- port: 5001
|
||||||
|
protocol: TCP
|
||||||
|
---
|
||||||
|
# Traefik IngressRoute (CRD provider) where scheme: https is a first-class
|
||||||
|
# field. The standard kubernetes Ingress `service.serversscheme` annotation is
|
||||||
|
# ignored for selector-less/Endpoints-backed services in Traefik v3, which
|
||||||
|
# caused Traefik to dial the NAS with plain HTTP -> 400 from DSM's nginx.
|
||||||
|
apiVersion: traefik.io/v1alpha1
|
||||||
|
kind: IngressRoute
|
||||||
|
metadata:
|
||||||
|
name: nas
|
||||||
|
namespace: nas-proxy
|
||||||
|
spec:
|
||||||
|
entryPoints:
|
||||||
|
- websecure
|
||||||
|
routes:
|
||||||
|
- match: Host(`nas.rogi.casa`)
|
||||||
|
kind: Rule
|
||||||
|
services:
|
||||||
|
- kind: Service
|
||||||
|
name: synology-nas
|
||||||
|
namespace: nas-proxy
|
||||||
|
port: 5001
|
||||||
|
scheme: https
|
||||||
|
serversTransport: skip-verify
|
||||||
|
passHostHeader: true
|
||||||
|
tls:
|
||||||
|
secretName: nas-tls
|
||||||
8
nas/transport.yaml
Normal file
8
nas/transport.yaml
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
# nas-transport.yaml
|
||||||
|
apiVersion: traefik.io/v1alpha1
|
||||||
|
kind: ServersTransport
|
||||||
|
metadata:
|
||||||
|
name: skip-verify
|
||||||
|
namespace: nas-proxy
|
||||||
|
spec:
|
||||||
|
insecureSkipVerify: true
|
||||||
24
openwebui/ingress.yaml
Normal file
24
openwebui/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: openwebui
|
||||||
|
namespace: openwebui
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- openai.rogi.casa
|
||||||
|
secretName: openwebui-tls
|
||||||
|
rules:
|
||||||
|
- host: openai.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: open-webui-service
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
@@ -1,7 +1,13 @@
|
|||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: openwebui
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
kind: PersistentVolumeClaim
|
kind: PersistentVolumeClaim
|
||||||
metadata:
|
metadata:
|
||||||
name: openwebui-pvc
|
name: openwebui-pvc
|
||||||
|
namespace: openwebui
|
||||||
spec:
|
spec:
|
||||||
accessModes:
|
accessModes:
|
||||||
- ReadWriteOnce
|
- ReadWriteOnce
|
||||||
@@ -15,6 +21,7 @@ metadata:
|
|||||||
labels:
|
labels:
|
||||||
app: open-webui
|
app: open-webui
|
||||||
name: open-webui
|
name: open-webui
|
||||||
|
namespace: openwebui
|
||||||
spec:
|
spec:
|
||||||
replicas: 1
|
replicas: 1
|
||||||
selector:
|
selector:
|
||||||
@@ -84,6 +91,7 @@ metadata:
|
|||||||
labels:
|
labels:
|
||||||
app: open-webui
|
app: open-webui
|
||||||
name: open-webui-service
|
name: open-webui-service
|
||||||
|
namespace: openwebui
|
||||||
spec:
|
spec:
|
||||||
ports:
|
ports:
|
||||||
- name: http
|
- name: http
|
||||||
|
|||||||
@@ -10,14 +10,12 @@ metadata:
|
|||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
||||||
# Optional: enable compression
|
# Optional: enable compression
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
traefik.ingress.kubernetes.io/compress: "true"
|
||||||
cert-manager.io/issuer: prod-issuer
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
spec:
|
||||||
tls:
|
tls:
|
||||||
- hosts:
|
- hosts:
|
||||||
- "*.rogi.casa"
|
- phoenix.rogi.casa
|
||||||
secretName: rogicasa-tls
|
secretName: phoenix-tls
|
||||||
rules:
|
rules:
|
||||||
- host: phoenix.rogi.casa
|
- host: phoenix.rogi.casa
|
||||||
http:
|
http:
|
||||||
|
|||||||
@@ -1,18 +0,0 @@
|
|||||||
# Optional: ServiceMonitor for Prometheus Operator
|
|
||||||
# Only apply this if you have Prometheus Operator installed
|
|
||||||
apiVersion: monitoring.coreos.com/v1
|
|
||||||
kind: ServiceMonitor
|
|
||||||
metadata:
|
|
||||||
name: phoenix-metrics
|
|
||||||
namespace: phoenix
|
|
||||||
labels:
|
|
||||||
app: phoenix
|
|
||||||
spec:
|
|
||||||
selector:
|
|
||||||
matchLabels:
|
|
||||||
app: phoenix
|
|
||||||
endpoints:
|
|
||||||
- port: metrics
|
|
||||||
path: /metrics
|
|
||||||
interval: 30s
|
|
||||||
scrapeTimeout: 10s
|
|
||||||
24
pihole/ingress.yaml
Normal file
24
pihole/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: pihole
|
||||||
|
namespace: pihole
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- pihole.rogi.casa
|
||||||
|
secretName: pihole-tls
|
||||||
|
rules:
|
||||||
|
- host: pihole.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: pihole-web
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
@@ -1,8 +1,14 @@
|
|||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: pihole
|
||||||
|
---
|
||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
kind: PersistentVolumeClaim
|
kind: PersistentVolumeClaim
|
||||||
metadata:
|
metadata:
|
||||||
name: pihole-pvc
|
name: pihole-pvc
|
||||||
namespace: default
|
namespace: pihole
|
||||||
spec:
|
spec:
|
||||||
accessModes:
|
accessModes:
|
||||||
- ReadWriteOnce
|
- ReadWriteOnce
|
||||||
@@ -14,7 +20,7 @@ apiVersion: apps/v1
|
|||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
name: pihole
|
name: pihole
|
||||||
namespace: default
|
namespace: pihole
|
||||||
labels:
|
labels:
|
||||||
app: pihole
|
app: pihole
|
||||||
spec:
|
spec:
|
||||||
@@ -92,7 +98,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: pihole-web
|
name: pihole-web
|
||||||
namespace: default
|
namespace: pihole
|
||||||
labels:
|
labels:
|
||||||
app: pihole
|
app: pihole
|
||||||
spec:
|
spec:
|
||||||
@@ -109,7 +115,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: pihole-dns
|
name: pihole-dns
|
||||||
namespace: default
|
namespace: pihole
|
||||||
labels:
|
labels:
|
||||||
app: pihole
|
app: pihole
|
||||||
spec:
|
spec:
|
||||||
|
|||||||
367
platform-engineer/README.md
Normal file
367
platform-engineer/README.md
Normal file
@@ -0,0 +1,367 @@
|
|||||||
|
# Platform Engineer Agent — Deployment Plan
|
||||||
|
|
||||||
|
An autonomous **Hermes Agent** that runs inside the k3s cluster, watches its
|
||||||
|
health on a schedule, tries to fix simple problems, and notifies me (via
|
||||||
|
Discord) when something needs my attention or a fix failed.
|
||||||
|
|
||||||
|
Docs: https://hermes-agent.nousresearch.com/docs/user-guide/docker
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Goal & operating model
|
||||||
|
|
||||||
|
- **One Hermes container** in a new namespace `platform-engineer`, scheduled on
|
||||||
|
the powerful amd64 node (`roger-nucbox-evo-x2`, 24 GiB RAM).
|
||||||
|
- Hermes runs in **gateway mode** under s6 supervision (`command: gateway run`),
|
||||||
|
so the built-in **cron scheduler** is active and survives restarts.
|
||||||
|
- The agent talks to the cluster with `kubectl` from *inside* the container
|
||||||
|
(terminal backend = `local`). We give the pod a **ServiceAccount + ClusterRole**
|
||||||
|
scoped to read-mostly + restart/scale/delete-pod permissions.
|
||||||
|
- LLM calls are routed through the in-cluster **LiteLLM** proxy
|
||||||
|
(`litellm.rogi.casa`) — no external API keys needed in the cluster.
|
||||||
|
- Notifications go to **Discord** (reuse the pattern from `myorg-assistant`).
|
||||||
|
- A set of **cron jobs** (Hermes-native, not Kubernetes CronJobs) make the agent
|
||||||
|
run periodic checks. Watchdog checks use `[SILENT]` so it only pings me when
|
||||||
|
something is wrong.
|
||||||
|
|
||||||
|
Why Hermes-native cron (not k8s CronJobs):
|
||||||
|
- Hermes cron ticks inside the gateway, runs in an isolated agent session,
|
||||||
|
supports `[SILENT]` suppression, `deliver="discord"`, `workdir`, and
|
||||||
|
`context_from` chaining — far less plumbing than spawning a fresh pod per run.
|
||||||
|
- Cron jobs live in `~/.hermes/cron/jobs.json` on the PVC, so they survive pod
|
||||||
|
restarts and can be edited live via `hermes cron edit` without redeploying.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Files to create (this directory)
|
||||||
|
|
||||||
|
```
|
||||||
|
platform-engineer/
|
||||||
|
├── namespace.yaml # namespace platform-engineer
|
||||||
|
├── rbac.yaml # ServiceAccount + ClusterRole (+binding)
|
||||||
|
├── configmap.yaml # hermes config.yaml + SOUL.md + cron seed script
|
||||||
|
├── secret.yaml # DISCORD bot token, LITELLM_API_KEY, kubeconfig-less SA token
|
||||||
|
├── pvc.yaml # persistent /opt/data (HERMES_HOME)
|
||||||
|
├── dockerfile # derived image: hermes-agent + kubectl + helm
|
||||||
|
├── deployment.yaml # Deployment, schedules on amd64, mounts kube SA token
|
||||||
|
├── ingress.yaml # hermes.rogi.casa → dashboard (optional)
|
||||||
|
└── README.md # this file
|
||||||
|
```
|
||||||
|
|
||||||
|
Then add a line to `argocd/gen-apps.sh` `APPS=(...)`:
|
||||||
|
```
|
||||||
|
"platform-engineer|platform-engineer|platform-engineer|true|true"
|
||||||
|
```
|
||||||
|
and re-run `./argocd/gen-apps.sh` to generate `argocd/apps/platform-engineer.yaml`
|
||||||
|
so ArgoCD reconciles it like every other app in the repo.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. RBAC — least privilege
|
||||||
|
|
||||||
|
ServiceAccount `platform-engineer` in ns `platform-engineer`, bound to a
|
||||||
|
**ClusterRole** scoped to *platform engineer* actions:
|
||||||
|
|
||||||
|
**Read (get/list/watch):** nodes, pods, services, deployments, statefulsets,
|
||||||
|
daemonsets, replicasets, jobs, cronjobs, events, configmaps, secrets, PVCs,
|
||||||
|
ingresses, namespaces.
|
||||||
|
|
||||||
|
**Act (patch/update on a allowlist):**
|
||||||
|
- `pods` → `delete` (force-restart a stuck pod), `patch` (`/evict`, annotations)
|
||||||
|
- `deployments`, `statefulsets`, `daemonsets`, `replicasets` → `patch` (restart
|
||||||
|
via `kubectl rollout restart` / scale), `update`
|
||||||
|
- `jobs`, `cronjobs` → `delete`, `patch`
|
||||||
|
- `pods/exec` (subresource) → `create` (only if we want the agent to `kubectl
|
||||||
|
exec` into pods for log-style debugging — optional; keep off initially)
|
||||||
|
- `events` → `get/list/watch` only
|
||||||
|
|
||||||
|
**No cluster-scoped writes** (no creating namespaces, no node taints, no RBAC
|
||||||
|
edits, no CRDs). The agent can *propose* those and tell me; it cannot do them
|
||||||
|
itself. All mutating calls are auditable via Kubernetes audit logs and
|
||||||
|
`kubectl auth can-i --as=system:serviceaccount:platform-engineer:platform-engineer`.
|
||||||
|
|
||||||
|
The pod uses the k3s in-cluster ServiceAccount token (`/var/run/secrets/...
|
||||||
|
/serviceaccount/token`) + the `KUBERNETES_SERVICE_HOST/PORT` env vars k3s already
|
||||||
|
injects — **no kubeconfig file, no long-lived token on disk**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Image — thin derived Dockerfile
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM nousresearch/hermes-agent:latest
|
||||||
|
USER root
|
||||||
|
RUN apt-get update \
|
||||||
|
&& apt-get install -y --no-install-recommends curl gnupg \
|
||||||
|
&& curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
|
||||||
|
| gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
|
||||||
|
&& echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
|
||||||
|
> /etc/apt/sources.list.d/kubernetes.list \
|
||||||
|
&& apt-get update \
|
||||||
|
&& apt-get install -y --no-install-recommends kubectl \
|
||||||
|
&& curl -fsSL https://get.helm.sh/helm-v3.16.0-linux-amd64.tar.gz \
|
||||||
|
| tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
USER hermes
|
||||||
|
```
|
||||||
|
|
||||||
|
> Note: the cluster is mixed arch (arm64/amd64/arm). The agent pod is pinned to
|
||||||
|
> the amd64 node, so `linux-amd64` helm + `kubectl` packages are fine. If you
|
||||||
|
> later want it portable, switch to a multi-arch build with
|
||||||
|
> `TARGETARCH` and install matching helm arch.
|
||||||
|
|
||||||
|
Build & push to your Gitea registry (`git.rogi.casa/roger/...`) — same
|
||||||
|
`imagePullSecrets: gitea-registry` pattern as `gym-tracker`. Tag with the
|
||||||
|
hermes version + a short git sha.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Hermes configuration (mounted via ConfigMap → /opt/data/config.yaml)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# config.yaml (seeded into the PVC on first boot)
|
||||||
|
model:
|
||||||
|
provider: openai-api
|
||||||
|
default: claude-4.5-haiku
|
||||||
|
base_url: "https://litellm.rogi.casa/v1"
|
||||||
|
api_mode: chat_completions
|
||||||
|
|
||||||
|
# Use a cheap, fast model for auxiliary tasks (titling, compression)
|
||||||
|
auxiliary:
|
||||||
|
compression:
|
||||||
|
provider: openai-api
|
||||||
|
model: gemini-3-flash
|
||||||
|
title_generation:
|
||||||
|
provider: openai-api
|
||||||
|
model: gemini-3-flash
|
||||||
|
|
||||||
|
terminal:
|
||||||
|
backend: local
|
||||||
|
cwd: /workspace # a working dir for any kubectl output / scratch
|
||||||
|
timeout: 180
|
||||||
|
home_mode: profile # isolate tool credentials under HERMES_HOME/home
|
||||||
|
|
||||||
|
# Unattended gateway → circuit-breaker on tool-call loops
|
||||||
|
tool_loop_guardrails:
|
||||||
|
hard_stop_enabled: true
|
||||||
|
hard_stop_after:
|
||||||
|
exact_failure: 5
|
||||||
|
idempotent_no_progress: 5
|
||||||
|
|
||||||
|
sessions:
|
||||||
|
auto_prune: true
|
||||||
|
retention_days: 90
|
||||||
|
|
||||||
|
cron:
|
||||||
|
wrap_response: false # cleaner Discord messages
|
||||||
|
|
||||||
|
memory:
|
||||||
|
memory_enabled: true
|
||||||
|
user_profile_enabled: true
|
||||||
|
```
|
||||||
|
|
||||||
|
`.env` (from Secret, mounted to `/opt/data/.env`):
|
||||||
|
```
|
||||||
|
OPENAI_API_KEY=<LITELLM_API_KEY value, i.e. sk-...>
|
||||||
|
OPENAI_BASE_URL=https://litellm.rogi.casa/v1
|
||||||
|
DISCORD_BOT_TOKEN=<new dedicated bot token>
|
||||||
|
DISCORD_HOME_CHANNEL=<your user/channel id for alerts>
|
||||||
|
# Dashboard auth (homelab, trusted LAN behind ingress)
|
||||||
|
HERMES_DASHBOARD_BASIC_AUTH_USERNAME=roger
|
||||||
|
HERMES_DASHBOARD_BASIC_AUTH_PASSWORD=<strong password>
|
||||||
|
```
|
||||||
|
|
||||||
|
> Why `OPENAI_API_KEY` + `OPENAI_BASE_URL`: the `openai-api` provider honours
|
||||||
|
> `OPENAI_BASE_URL`, so this is the simplest way to point Hermes at the
|
||||||
|
> in-cluster LiteLLM. `claude-4.5-haiku` / `gemini-3-flash` are the model names
|
||||||
|
> already exposed by your `litellm/litellm.yaml` ConfigMap.
|
||||||
|
|
||||||
|
`SOUL.md` (personality + guardrails) — see `configmap.yaml`. Key points:
|
||||||
|
- Identity: "Platform Engineer for the rogi.casa k3s cluster."
|
||||||
|
- Knows the cluster layout (3 nodes, ArgoCD GitOps, Traefik+cert-manager,
|
||||||
|
LiteLLM, services list).
|
||||||
|
- Operating rules: read-first; only act on the allowlisted verbs; never edit
|
||||||
|
RBAC / taints / namespaces / CRDs; when in doubt, notify instead of acting;
|
||||||
|
always cite the resource and the command used.
|
||||||
|
- How to reach me: `deliver="discord"`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Deployment
|
||||||
|
|
||||||
|
- `replicas: 1` (Hermes data dir is single-writer — never scale >1).
|
||||||
|
- `nodeSelector: kubernetes.io/arch: amd64` + preferred `hardware: high-memory`
|
||||||
|
affinity → lands on the NUC.
|
||||||
|
- `resources`: requests 512Mi/250m, limits 2Gi/1 core (Hermes recommends
|
||||||
|
2–4 GiB; 1 GiB is fine without browser tools, which we keep off).
|
||||||
|
- Volume: PVC mounted at `/opt/data` (HERMES_HOME), RWX not needed (single pod).
|
||||||
|
- Ports: 8642 (gateway API, internal only) and 9119 (dashboard) → exposed via
|
||||||
|
Ingress `hermes.rogi.casa` with TLS + basic-auth (already enforced by the
|
||||||
|
`HERMES_DASHBOARD_BASIC_AUTH_*` env vars).
|
||||||
|
- `imagePullSecrets: gitea-registry`.
|
||||||
|
- env from Secret; `HERMES_DASHBOARD=1`.
|
||||||
|
- Init: on first boot the s6 `01-hermes-setup` hook seeds config/SOUL/.env from
|
||||||
|
the ConfigMap if the volume is empty. We mount the ConfigMap as a readonly
|
||||||
|
projection at `/opt/seed/` and run a tiny initContainer to copy it into
|
||||||
|
`/opt/data` only when `/opt/data/config.yaml` doesn't exist (so ArgoCD
|
||||||
|
self-heal never fights the agent's live-edited config).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Cron jobs to seed (Hermes-native)
|
||||||
|
|
||||||
|
These are written by an init script (one-shot Job `hermes-cron-seed`) that runs
|
||||||
|
`hermes cron create ...` against the gateway on first install, and is idempotent
|
||||||
|
(it checks existing job names). All deliver to Discord. Examples:
|
||||||
|
|
||||||
|
| Name | Schedule | Prompt (abbreviated) |
|
||||||
|
|------|----------|------------------------|
|
||||||
|
| `cluster-health-check` | `every 15m` | Run `kubectl get nodes,pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` and `kubectl get events -A --field-selector type=Warning --since=20m`. If everything healthy, reply with only `[SILENT]`. Otherwise summarize failures and root-cause briefly. |
|
||||||
|
| `pod-restart-loop` | `every 10m` | Find pods in `CrashLoopBackOff`/`ImagePullBackOff` across all namespaces. For `CrashLoopBackOff`, fetch logs and if a clear transient cause (OOM, config parse, missing secret) is visible, attempt `kubectl rollout restart <deploy>`; otherwise notify me with the log excerpt. Reply `[SILENT]` if none found. |
|
||||||
|
| `pvc-pressure` | `every 30m` | `kubectl get pv` + node disk via `kubectl top nodes`. Alert if any PVC `Bound` to a near-full volume or node disk >85%. `[SILENT]` otherwise. |
|
||||||
|
| `argocd-sync-health` | `every 1h` | `kubectl get applications -n argocd -o wide` (or `argocd app sync --dry-run` if CLI present). Report any `OutOfSync`/`Degraded` app. `[SILENT]` if all `Synced`+`Healthy`. |
|
||||||
|
| `cert-expiry` | `every 1d at 09:00` | List cert-manager `Certificate` resources with expiry < 21 days. Notify only if any. `[SILENT]` otherwise. |
|
||||||
|
| `node-resource-drift` | `every 30m` | `kubectl top nodes`. Alert if any node CPU>90% or mem>90% sustained, or any node `NotReady`. `[SILENT]` otherwise. |
|
||||||
|
| `daily-cluster-report` | `0 8 * * *` | Summarize: node count/status, top 5 pods by CPU/mem, # pods not Running, # ArgoCD apps OutOfSync, cert warnings. Always deliver (no `[SILENT]`). |
|
||||||
|
|
||||||
|
Design rules baked into SOUL.md:
|
||||||
|
- **Read-only checks** run frequently (10–30m) and stay silent unless wrong.
|
||||||
|
- **Mutating actions** are restricted to safe idempotent ones (rollout restart,
|
||||||
|
delete stuck pod so controller recreates). Anything riskier → notify me with
|
||||||
|
a proposed command and wait for me to run it (I can reply in Discord to the
|
||||||
|
continuable thread).
|
||||||
|
- Cron sessions are isolated and **cannot create new cron jobs** (Hermes
|
||||||
|
disables that inside cron runs) → no runaway loops.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Safety & guardrails
|
||||||
|
|
||||||
|
1. **RBAC is the real boundary.** Even if the agent goes rogue, the SA can't
|
||||||
|
touch other namespaces' secrets beyond read, can't change RBAC, can't taint
|
||||||
|
nodes, can't create namespaces.
|
||||||
|
2. **`tool_loop_guardrails.hard_stop_enabled: true`** — circuit-breaks a stuck
|
||||||
|
gateway (recommended in the Docker doc for unattended deployments).
|
||||||
|
3. **`skills.write_approval: false` but `memory.write_approval: true`** (so the
|
||||||
|
agent can build skills/memories but I review memory writes lazily — flip
|
||||||
|
this if it gets noisy).
|
||||||
|
4. **No `pods/exec` subresource** initially (keep the agent from shelling into
|
||||||
|
workloads). Enable later only if you want log-grep-style debugging.
|
||||||
|
5. **Dashboard behind ingress TLS + basic auth** (the June-2026 hardening makes
|
||||||
|
auth mandatory on non-loopback binds; we satisfy it with the bundled
|
||||||
|
basic-auth provider).
|
||||||
|
6. **Single replica / single-writer PVC** — the Docker doc is explicit that two
|
||||||
|
gateways on the same `/opt/data` corrupt session/memory stores. Use a
|
||||||
|
`podAntiAffinity` so an accidental scale-up doesn't co-run.
|
||||||
|
7. **ArgoCD interaction:** keep `syncPolicy.automated.prune+selfHeal` but
|
||||||
|
exclude the live-edited hermes state. Practically: Argo owns the *manifests*
|
||||||
|
(deployment, configmap, secret, pvc), while `/opt/data` (config.yaml,
|
||||||
|
cron/jobs.json, SOUL.md edits made via the dashboard) is runtime state on the
|
||||||
|
PVC and is *not* reconciled by Argo. The ConfigMap only *seeds* it on first
|
||||||
|
boot. Document this clearly in the README so future-you doesn't expect Argo
|
||||||
|
to reset the agent's personality.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Rollout plan
|
||||||
|
|
||||||
|
1. Build & push the derived image to `git.rogi.casa/roger/hermes-agent` (tag
|
||||||
|
`v1.35-<sha>`).
|
||||||
|
2. Create the namespace + RBAC + Secret + ConfigMap + PVC:
|
||||||
|
`kubectl apply -f platform-engineer/`.
|
||||||
|
3. Create the `platform-engineer` Discord bot, invite it, put its token + your
|
||||||
|
channel id in `secret.yaml` (base64).
|
||||||
|
4. Apply the Deployment; wait for the pod to go Running.
|
||||||
|
5. `kubectl exec` in and run the one-shot cron seed:
|
||||||
|
`hermes cron create ...` (or apply the `cron-seed` Job).
|
||||||
|
6. Trigger the first `cluster-health-check` manually: `hermes cron run cluster-health-check`.
|
||||||
|
7. Add the app to `argocd/gen-apps.sh`, regenerate, commit, push.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Decisions (locked in)
|
||||||
|
|
||||||
|
1. **Notifications:** dedicated `platform-engineer` Discord bot → its own token
|
||||||
|
in `secret.yaml` (`DISCORD_BOT_TOKEN`, `DISCORD_HOME_CHANNEL`).
|
||||||
|
2. **Dashboard:** public at `hermes.rogi.casa` (Traefik TLS + cert-manager + the
|
||||||
|
bundled Hermes basic-auth provider). Reach the dashboard on port 9119; the
|
||||||
|
gateway API on 8642 is ClusterIP-only.
|
||||||
|
3. **Image:** derived image pushed to `git.rogi.casa/roger/hermes-agent`, pulled
|
||||||
|
via the existing `gitea-registry` imagePullSecret (must also exist in the
|
||||||
|
`platform-engineer` ns — see deploy steps).
|
||||||
|
4. **Model:** `qwen-3.6:27b` via the in-cluster Ollama box (`10.88.20.12:11434`),
|
||||||
|
exposed through LiteLLM as `qwen-3.6:27b`. Added to `litellm/litellm.yaml`.
|
||||||
|
Hermes reaches LiteLLM at `https://litellm.rogi.casa/v1` (never Ollama directly).
|
||||||
|
5. **pods/exec:** granted (`pods/exec` → `create` in the ClusterRole) so the
|
||||||
|
agent can `kubectl exec`/`kubectl logs` for debugging.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Deployment checklist (do in this order)
|
||||||
|
|
||||||
|
1. **Add the Ollama model to LiteLLM** (already done in `litellm/litellm.yaml`):
|
||||||
|
the `qwen-3.6:27b` entry points at `http://10.88.20.12:11434`. Make sure
|
||||||
|
`qwen3.6:27b` is actually pulled on that Ollama host
|
||||||
|
(`ollama pull qwen3.6:27b`). Apply: `kubectl apply -f litellm/` and restart
|
||||||
|
the LiteLLM pod so the new config takes effect.
|
||||||
|
2. **Create the `gitea-registry` secret in the new namespace** (ArgoCD won't
|
||||||
|
create it — it's not in the repo):
|
||||||
|
```
|
||||||
|
kubectl create namespace platform-engineer
|
||||||
|
kubectl create secret docker-registry gitea-registry \
|
||||||
|
--docker-server=git.rogi.casa \
|
||||||
|
--docker-username=<your-gitea-user> \
|
||||||
|
--docker-password=<gitea-access-token> \
|
||||||
|
--docker-email=<your-email> \
|
||||||
|
-n platform-engineer
|
||||||
|
```
|
||||||
|
3. **Build & push the image:** `./platform-engineer/build-and-push.sh`
|
||||||
|
(after `docker login git.rogi.casa`).
|
||||||
|
4. **Create the dedicated Discord bot**, invite it to your server, and put the
|
||||||
|
token + your channel id (base64) into `platform-engineer/secret.yaml`. Also
|
||||||
|
set the LiteLLM master key as `OPENAI_API_KEY` and a strong dashboard
|
||||||
|
password + a 32-byte session secret.
|
||||||
|
5. **Commit & push** the whole change. ArgoCD will create the namespace
|
||||||
|
resources, deploy the pod, and bring up the ingress at `hermes.rogi.casa`.
|
||||||
|
6. **Seed the cron jobs:**
|
||||||
|
`kubectl apply -f platform-engineer/cron-seed.yaml` (one-shot Job) — it waits
|
||||||
|
for the hermes pod, then runs `hermes cron create ...` for each watchdog.
|
||||||
|
Re-run it any time you want to re-seed after a wipe.
|
||||||
|
7. **Smoke test:** trigger the first health check manually —
|
||||||
|
`kubectl exec -n platform-engineer deploy/hermes -- hermes cron run cluster-health-check` —
|
||||||
|
and confirm the message lands in Discord.
|
||||||
|
8. **ArgoCD:** the `Application` (`argocd/apps/platform-engineer.yaml`) is
|
||||||
|
already generated. After commit, Argo will reconcile it like every other app.
|
||||||
|
|
||||||
|
## 12. What ArgoCD owns vs. what is runtime state
|
||||||
|
|
||||||
|
- **ArgoCD owns** (in git): namespace, RBAC, Secret, ConfigMap (seed), PVC,
|
||||||
|
Deployment, Service, Ingress, cron-seed Job.
|
||||||
|
- **Runtime state (on the PVC, NOT reconciled):** `config.yaml`, `SOUL.md`,
|
||||||
|
`.env`, `cron/jobs.json`, `sessions/`, `memories/`, `skills/`. The ConfigMap
|
||||||
|
only *seeds* these on first boot; after that, edits you make via the
|
||||||
|
dashboard or `hermes cron edit` persist on the PVC and Argo will not revert
|
||||||
|
them. If you ever want a hard reset, delete the PVC and re-apply.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files in this directory
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `namespace.yaml` | namespace `platform-engineer` |
|
||||||
|
| `rbac.yaml` | ServiceAccount + ClusterRole (+binding), least-privilege |
|
||||||
|
| `configmap.yaml` | seed `config.yaml` + `SOUL.md` |
|
||||||
|
| `secret.yaml` | Discord token, LiteLLM key, dashboard auth (PLACEHOLDERS — fill in) |
|
||||||
|
| `pvc.yaml` | 5 Gi PVC for `/opt/data` |
|
||||||
|
| `dockerfile` | derived image: hermes-agent + kubectl + helm (linux/amd64) |
|
||||||
|
| `build-and-push.sh` | builds & pushes the image to the Gitea registry |
|
||||||
|
| `deployment.yaml` | Deployment (1 replica, Recreate, pinned to amd64 NUC) + Service |
|
||||||
|
| `ingress.yaml` | `hermes.rogi.casa` → dashboard (TLS + basic auth) |
|
||||||
|
| `cron-seed.yaml` | one-shot Job that creates the Hermes cron schedule |
|
||||||
|
|
||||||
|
Also changed outside this directory:
|
||||||
|
- `litellm/litellm.yaml` — added `qwen-3.6:27b` model entry.
|
||||||
|
- `argocd/gen-apps.sh` + `argocd/apps/platform-engineer.yaml` — ArgoCD
|
||||||
|
Application for this folder.
|
||||||
|
```
|
||||||
43
platform-engineer/build-and-push.sh
Executable file
43
platform-engineer/build-and-push.sh
Executable file
@@ -0,0 +1,43 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Build & push the derived Hermes image (kubectl + helm).
|
||||||
|
#
|
||||||
|
# Two modes:
|
||||||
|
# ./build-and-push.sh push # build + push to the Gitea registry
|
||||||
|
# ./build-and-push.sh local # build + import directly into the NUC's k3s containerd
|
||||||
|
# # (no registry needed; pod is pinned to this node)
|
||||||
|
#
|
||||||
|
# Default (no arg): push.
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Docker registry pushes can't go through the Cloudflare proxy (100 MB cap),
|
||||||
|
# so push to the DNS-only registry hostname instead of git.rogi.casa.
|
||||||
|
# Override with: REGISTRY=git.rogi.casa ./build-and-push.sh push (if grey-clouded)
|
||||||
|
REGISTRY="${REGISTRY:-registry.rogi.casa}"
|
||||||
|
REPO="roger/hermes-agent"
|
||||||
|
TAG="${TAG:-v1.35-1}"
|
||||||
|
IMAGE="${REGISTRY}/${REPO}:${TAG}"
|
||||||
|
MODE="${1:-push}"
|
||||||
|
|
||||||
|
cd "$(dirname "$0")"
|
||||||
|
|
||||||
|
echo "==> Building ${IMAGE}"
|
||||||
|
docker build --platform linux/amd64 -t "${IMAGE}" -f dockerfile .
|
||||||
|
|
||||||
|
case "$MODE" in
|
||||||
|
push)
|
||||||
|
echo "==> Pushing ${IMAGE}"
|
||||||
|
docker push "${IMAGE}"
|
||||||
|
echo "==> Done. If the pod can't pull, create the gitea-registry secret in the namespace."
|
||||||
|
;;
|
||||||
|
local)
|
||||||
|
# Requires k3s + being run on the node the pod schedules to (roger-nucbox-evo-x2).
|
||||||
|
echo "==> Importing into k3s containerd (requires sudo)"
|
||||||
|
docker save "${IMAGE}" | sudo k3s ctr images import -
|
||||||
|
echo "==> Done. Verify: sudo k3s ctr images ls | grep hermes-agent"
|
||||||
|
echo " deployment.yaml is set to imagePullPolicy: IfNotPresent"
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Usage: $0 {push|local}" >&2
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
115
platform-engineer/configmap.yaml
Normal file
115
platform-engineer/configmap.yaml
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
# Hermes configuration, SOUL.md, and the cron-seed script.
|
||||||
|
# Seeded into the PVC (/opt/data) by the initContainer on first boot only.
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: hermes-seed
|
||||||
|
namespace: platform-engineer
|
||||||
|
data:
|
||||||
|
config.yaml: |
|
||||||
|
model:
|
||||||
|
provider: openai-api
|
||||||
|
default: qwen-3.6:27b
|
||||||
|
base_url: "https://litellm.rogi.casa/v1"
|
||||||
|
api_mode: chat_completions
|
||||||
|
|
||||||
|
# Cheap/fast model for auxiliary tasks (titling, compression).
|
||||||
|
auxiliary:
|
||||||
|
compression:
|
||||||
|
provider: openai-api
|
||||||
|
model: qwen-3.6:27b
|
||||||
|
base_url: "https://litellm.rogi.casa/v1"
|
||||||
|
title_generation:
|
||||||
|
provider: openai-api
|
||||||
|
model: qwen-3.6:27b
|
||||||
|
base_url: "https://litellm.rogi.casa/v1"
|
||||||
|
|
||||||
|
terminal:
|
||||||
|
backend: local
|
||||||
|
cwd: /workspace
|
||||||
|
timeout: 180
|
||||||
|
home_mode: profile
|
||||||
|
|
||||||
|
# Unattended gateway → circuit-break on stuck tool-call loops.
|
||||||
|
tool_loop_guardrails:
|
||||||
|
hard_stop_enabled: true
|
||||||
|
hard_stop_after:
|
||||||
|
exact_failure: 5
|
||||||
|
idempotent_no_progress: 5
|
||||||
|
|
||||||
|
sessions:
|
||||||
|
auto_prune: true
|
||||||
|
retention_days: 90
|
||||||
|
|
||||||
|
cron:
|
||||||
|
wrap_response: false
|
||||||
|
|
||||||
|
memory:
|
||||||
|
memory_enabled: true
|
||||||
|
user_profile_enabled: true
|
||||||
|
write_approval: false
|
||||||
|
|
||||||
|
skills:
|
||||||
|
write_approval: false
|
||||||
|
|
||||||
|
SOUL.md: |
|
||||||
|
# Platform Engineer — rogi.casa k3s cluster
|
||||||
|
|
||||||
|
You are the autonomous Platform Engineer for the `rogi.casa` K3s cluster.
|
||||||
|
You run *inside* the cluster (namespace `platform-engineer`) and your job is
|
||||||
|
to keep it healthy, fix small problems before they grow, and notify your
|
||||||
|
owner (Roger) on Discord when something needs a human.
|
||||||
|
|
||||||
|
## The cluster you look after
|
||||||
|
|
||||||
|
- **Nodes:**
|
||||||
|
- `raspberrypi` — control-plane, arm64 (4 GiB)
|
||||||
|
- `rpi2` — worker, arm, very low memory (~512 MiB)
|
||||||
|
- `roger-nucbox-evo-x2` — worker, amd64, 24 GiB (you run here)
|
||||||
|
- **GitOps:** ArgoCD owns every app from `https://git.rogi.casa/roger/k3s-cluster.git`.
|
||||||
|
Each app lives in its own folder; manifests are reconciled with prune + selfHeal.
|
||||||
|
- **Ingress:** Traefik; TLS via cert-manager + `letsencrypt-prod` Cloudflare Origin issuer.
|
||||||
|
- **LLM gateway:** LiteLLM at `https://litellm.rogi.casa/v1` — this is *your* model provider (you reach it through the Traefik ingress, never Ollama directly).
|
||||||
|
- **Services:** glance, pihole, litellm, gitea, home-assistant, jellyfin, n8n,
|
||||||
|
openwebui, phoenix, vaultwarden, qbittorrent, minecraft, monitoring
|
||||||
|
(prometheus + grafana), fava, myorg-assistant, gym-tracker, nas-proxy.
|
||||||
|
- **Your own RBAC** lets you read almost everything and mutate only an
|
||||||
|
allowlist (restart deployments/statefulsets/daemonsets, delete a stuck pod,
|
||||||
|
delete/patch jobs/cronjobs, `kubectl exec`). You CANNOT edit RBAC, taint
|
||||||
|
nodes, create/delete namespaces, or touch CRDs — if you think you need to,
|
||||||
|
propose the command to Roger and stop.
|
||||||
|
|
||||||
|
## Operating rules
|
||||||
|
|
||||||
|
1. **Read first, act second.** Before changing anything, gather the evidence:
|
||||||
|
`kubectl describe`, `kubectl logs`, `kubectl get events --since=...`,
|
||||||
|
`kubectl top`. Cite the exact resource (ns/name) and the exact command in
|
||||||
|
every report.
|
||||||
|
2. **Only safe, idempotent remediations.** Allowed actions:
|
||||||
|
- `kubectl rollout restart deployment/<name> -n <ns>` (and statefulset/daemonset)
|
||||||
|
- delete a single stuck `CrashLoopBackOff`/`ImagePullBackOff` pod so its
|
||||||
|
controller recreates it
|
||||||
|
- `kubectl delete job/<name>` / `kubectl patch cronjob ...`
|
||||||
|
Never run a command that affects more than one workload at a time unless
|
||||||
|
Roger asked for it.
|
||||||
|
3. **When in doubt, notify, don't act.** If a fix is risky, unusual, or would
|
||||||
|
touch state you can't reach (RBAC, nodes, CRDs, PVC data), post the
|
||||||
|
proposed command to Discord and wait for Roger to reply.
|
||||||
|
4. **Be quiet when healthy.** Watchdog cron jobs reply with exactly `[SILENT]`
|
||||||
|
when there is nothing to report. Failed jobs always deliver regardless.
|
||||||
|
5. **No runaway loops.** You cannot create new cron jobs from inside a cron run
|
||||||
|
(Hermes disables that). Do not try.
|
||||||
|
6. **Talk like an engineer.** Short, concrete, with resource names and
|
||||||
|
commands. No filler. When you fixed something, say what you did in one line.
|
||||||
|
7. **Respect GitOps.** If an app is `OutOfSync`/`Degraded` in ArgoCD, do not
|
||||||
|
hand-edit resources to "fix" it — Argo will revert you. Report it so Roger
|
||||||
|
can fix the source repo.
|
||||||
|
|
||||||
|
## How you reach Roger
|
||||||
|
|
||||||
|
Notifications go to Discord (your home channel). Cron jobs deliver there by
|
||||||
|
default (`deliver="discord"`). Keep messages under ~1800 chars; attach
|
||||||
|
longer logs as `kubectl logs ... > /opt/data/cron/output/<file>` and link
|
||||||
|
the path.
|
||||||
|
```
|
||||||
93
platform-engineer/cron-seed.yaml
Normal file
93
platform-engineer/cron-seed.yaml
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
# One-shot Job that seeds Hermes' built-in cron schedule on first install.
|
||||||
|
# Idempotent: skips job names that already exist.
|
||||||
|
#
|
||||||
|
# The agent's own cron jobs live in /opt/data/cron/jobs.json on the PVC and are
|
||||||
|
# NOT reconciled by ArgoCD (runtime state). Re-run this Job manually after a
|
||||||
|
# wipe to re-seed: kubectl job restart hermes-cron-seed -n platform-engineer
|
||||||
|
---
|
||||||
|
apiVersion: batch/v1
|
||||||
|
kind: Job
|
||||||
|
metadata:
|
||||||
|
name: hermes-cron-seed
|
||||||
|
namespace: platform-engineer
|
||||||
|
labels:
|
||||||
|
app: hermes
|
||||||
|
spec:
|
||||||
|
backoffLimit: 4
|
||||||
|
ttlSecondsAfterFinished: 86400
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: hermes
|
||||||
|
spec:
|
||||||
|
serviceAccountName: platform-engineer
|
||||||
|
restartPolicy: OnFailure
|
||||||
|
containers:
|
||||||
|
- name: seed
|
||||||
|
# alpine is tiny and always available; we install curl + download the
|
||||||
|
# right-arch kubectl binary at runtime (bitnami/kubectl tags are
|
||||||
|
# inconsistent across versions, so we avoid depending on them).
|
||||||
|
image: alpine:3.20
|
||||||
|
command: ["sh", "-c"]
|
||||||
|
args:
|
||||||
|
- |
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# Install curl, then download kubectl for this node's architecture.
|
||||||
|
apk add --no-cache curl
|
||||||
|
ARCH=$(uname -m)
|
||||||
|
case "$ARCH" in
|
||||||
|
x86_64) KARCH=amd64 ;;
|
||||||
|
aarch64) KARCH=arm64 ;;
|
||||||
|
armv7l) KARCH=arm ;;
|
||||||
|
*) echo "unsupported arch: $ARCH" >&2; exit 1 ;;
|
||||||
|
esac
|
||||||
|
echo "Downloading kubectl for linux/$KARCH ..."
|
||||||
|
curl -fsSL -o /usr/local/bin/kubectl \
|
||||||
|
"https://dl.k8s.io/release/v1.35.0/bin/linux/${KARCH}/kubectl"
|
||||||
|
chmod +x /usr/local/bin/kubectl
|
||||||
|
kubectl version --client
|
||||||
|
|
||||||
|
echo "Waiting for hermes pod to be Ready..."
|
||||||
|
kubectl -n platform-engineer wait --for=condition=Ready pod -l app=hermes --timeout=300s || true
|
||||||
|
|
||||||
|
POD=$(kubectl -n platform-engineer get pod -l app=hermes -o jsonpath='{.items[0].metadata.name}')
|
||||||
|
echo "Using pod: $POD"
|
||||||
|
|
||||||
|
exists() { kubectl -n platform-engineer exec "$POD" -- hermes cron list 2>/dev/null | grep -qi " $1 "; }
|
||||||
|
|
||||||
|
create() {
|
||||||
|
name="$1"; schedule="$2"; deliver="$3"; prompt="$4"
|
||||||
|
if exists "$name"; then
|
||||||
|
echo "cron job '$name' already exists — skipping"
|
||||||
|
else
|
||||||
|
echo "creating cron job '$name' ..."
|
||||||
|
kubectl -n platform-engineer exec "$POD" -- hermes cron create "$schedule" "$prompt" --name "$name" --deliver "$deliver"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---- Watchdog checks (silent unless something is wrong) ----
|
||||||
|
create "cluster-health-check" "every 15m" "discord" \
|
||||||
|
"Run: kubectl get nodes; kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded; kubectl get events -A --field-selector type=Warning --since=20m. If everything is healthy and there are no Warning events, reply with exactly [SILENT]. Otherwise give a concise per-resource summary of what is wrong (node name, pod ns/name, phase, last event)."
|
||||||
|
|
||||||
|
create "pod-restart-loop" "every 10m" "discord" \
|
||||||
|
"Find pods in CrashLoopBackOff or ImagePullBackOff across all namespaces (kubectl get pods -A). For each, fetch kubectl logs (previous) and describe. If the cause is clearly transient (OOM kill, a one-off config parse error that will retry cleanly, a missing Secret the controller will recreate), attempt ONE safe remediation: kubectl rollout restart of the owning Deployment/StatefulSet/DaemonSet, OR delete the single stuck pod. Report what you did in one line per resource. If the cause is not clearly transient (bad image, missing config, auth failure), do NOT act — post the log excerpt and the proposed command and wait for Roger. If no such pods exist, reply [SILENT]."
|
||||||
|
|
||||||
|
create "pvc-pressure" "every 30m" "discord" \
|
||||||
|
"Check cluster storage health: kubectl get pv,pvc -A; kubectl top nodes. Alert if any PVC is Pending/Lost or any node filesystem usage is over 85%. If all healthy, reply [SILENT]."
|
||||||
|
|
||||||
|
create "argocd-sync-health" "every 1h" "discord" \
|
||||||
|
"Run: kubectl get applications -n argocd -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status. If every app is Synced and Healthy, reply [SILENT]. Otherwise list the OutOfSync/Degraded apps with their status. Do NOT hand-edit resources to fix them (Argo will revert) — just report."
|
||||||
|
|
||||||
|
create "cert-expiry" "0 9 * * *" "discord" \
|
||||||
|
"List all cert-manager Certificate resources (kubectl get certificates -A). For each, check notAfter. Alert on any certificate expiring in under 21 days. If none, reply [SILENT]."
|
||||||
|
|
||||||
|
create "node-resource-drift" "every 30m" "discord" \
|
||||||
|
"Run kubectl top nodes. If any node CPU or memory usage is over 90%, or any node is NotReady, report it with the numbers. Otherwise reply [SILENT]."
|
||||||
|
|
||||||
|
# ---- Daily report (always delivered) ----
|
||||||
|
create "daily-cluster-report" "0 8 * * *" "discord" \
|
||||||
|
"Produce a daily cluster report for Roger: (1) node count + Ready/NotReady; (2) top 5 pods by CPU and by memory across all namespaces (kubectl top pods -A --sort-by); (3) count of pods not Running; (4) ArgoCD apps OutOfSync or Degraded; (5) any certificates expiring within 30 days; (6) any recent Warning events (last 24h). Keep it under 1800 chars. Always deliver (no [SILENT])."
|
||||||
|
|
||||||
|
echo "Done. Listing all cron jobs:"
|
||||||
|
kubectl -n platform-engineer exec "$POD" -- hermes cron list
|
||||||
172
platform-engineer/deployment.yaml
Normal file
172
platform-engineer/deployment.yaml
Normal file
@@ -0,0 +1,172 @@
|
|||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: hermes
|
||||||
|
namespace: platform-engineer
|
||||||
|
labels:
|
||||||
|
app: hermes
|
||||||
|
spec:
|
||||||
|
replicas: 1 # MUST be 1 — Hermes' /opt/data is single-writer.
|
||||||
|
strategy:
|
||||||
|
type: Recreate # never run two pods against the same PVC
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app: hermes
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: hermes
|
||||||
|
spec:
|
||||||
|
serviceAccountName: platform-engineer
|
||||||
|
# No imagePullSecrets — using the public stock Hermes image from Docker Hub.
|
||||||
|
|
||||||
|
# Pin to the powerful amd64 node (image is linux/amd64; the NUC has 24 GiB).
|
||||||
|
nodeSelector:
|
||||||
|
kubernetes.io/arch: amd64
|
||||||
|
affinity:
|
||||||
|
nodeAffinity:
|
||||||
|
preferredDuringSchedulingIgnoredDuringExecution:
|
||||||
|
- weight: 100
|
||||||
|
preference:
|
||||||
|
matchExpressions:
|
||||||
|
- key: hardware
|
||||||
|
operator: In
|
||||||
|
values: ["high-memory"]
|
||||||
|
podAntiAffinity:
|
||||||
|
preferredDuringSchedulingIgnoredDuringExecution:
|
||||||
|
- weight: 100
|
||||||
|
podAffinityTerm:
|
||||||
|
labelSelector:
|
||||||
|
matchLabels:
|
||||||
|
app: hermes
|
||||||
|
topologyKey: kubernetes.io/hostname
|
||||||
|
|
||||||
|
initContainers:
|
||||||
|
# Download kubectl + helm into a shared emptyDir so the stock Hermes image
|
||||||
|
# (which doesn't ship kubectl) can still drive the cluster. Avoids building
|
||||||
|
# and pushing a custom image through a slow / size-capped registry.
|
||||||
|
- name: install-tools
|
||||||
|
image: curlimages/curl:8.12.1
|
||||||
|
command: ["sh", "-c"]
|
||||||
|
args:
|
||||||
|
- |
|
||||||
|
set -e
|
||||||
|
echo "Downloading kubectl v1.35.0..."
|
||||||
|
curl -fsSL -o /tools/kubectl \
|
||||||
|
https://dl.k8s.io/release/v1.35.0/bin/linux/amd64/kubectl
|
||||||
|
chmod +x /tools/kubectl
|
||||||
|
echo "Downloading helm v3.16.3..."
|
||||||
|
curl -fsSL https://get.helm.sh/helm-v3.16.3-linux-amd64.tar.gz \
|
||||||
|
| tar -xz -C /tools --strip-components=1 linux-amd64/helm
|
||||||
|
chmod +x /tools/helm
|
||||||
|
echo "Tools installed:"; ls -la /tools
|
||||||
|
volumeMounts:
|
||||||
|
- name: tools
|
||||||
|
mountPath: /tools
|
||||||
|
|
||||||
|
# Seed /opt/data with config.yaml + SOUL.md on first boot only.
|
||||||
|
# ArgoCD owns the manifests; the PVC is runtime state and is NOT reconciled.
|
||||||
|
- name: seed-data
|
||||||
|
image: busybox:1.36
|
||||||
|
command: ["sh", "-c"]
|
||||||
|
args:
|
||||||
|
- |
|
||||||
|
set -e
|
||||||
|
if [ ! -f /opt/data/config.yaml ]; then
|
||||||
|
echo "First boot: seeding /opt/data from ConfigMap..."
|
||||||
|
cp /seed/config.yaml /opt/data/config.yaml
|
||||||
|
cp /seed/SOUL.md /opt/data/SOUL.md
|
||||||
|
chmod 600 /opt/data/config.yaml
|
||||||
|
else
|
||||||
|
echo "/opt/data already initialized — leaving runtime state intact."
|
||||||
|
fi
|
||||||
|
mkdir -p /opt/data/home/.kube /opt/data/cron/output /opt/data/scripts /workspace
|
||||||
|
volumeMounts:
|
||||||
|
- name: data
|
||||||
|
mountPath: /opt/data
|
||||||
|
- name: seed
|
||||||
|
mountPath: /seed
|
||||||
|
|
||||||
|
containers:
|
||||||
|
- name: hermes
|
||||||
|
image: nousresearch/hermes-agent:latest
|
||||||
|
imagePullPolicy: Always
|
||||||
|
# IMPORTANT: do NOT set `command:` — it would override the image's
|
||||||
|
# ENTRYPOINT (/init, s6-overlay), which sets up the hermes user, seeds
|
||||||
|
# config on first boot, and supervises the gateway. The image's CMD
|
||||||
|
# (main-wrapper.sh) already routes `gateway run` through s6.
|
||||||
|
args: ["gateway", "run"]
|
||||||
|
ports:
|
||||||
|
- name: gateway
|
||||||
|
containerPort: 8642
|
||||||
|
- name: dashboard
|
||||||
|
containerPort: 9119
|
||||||
|
envFrom:
|
||||||
|
- secretRef:
|
||||||
|
name: hermes-env
|
||||||
|
env:
|
||||||
|
# k3s injects KUBERNETES_SERVICE_HOST/PORT + the SA token automatically;
|
||||||
|
# kubectl inside the pod authenticates as the platform-engineer SA.
|
||||||
|
- name: HERMES_HOME
|
||||||
|
value: /opt/data
|
||||||
|
# Put the initContainer-installed kubectl/helm on PATH for the hermes user.
|
||||||
|
- name: PATH
|
||||||
|
value: /opt/hermes/bin:/opt/hermes/.venv/bin:/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
|
||||||
|
volumeMounts:
|
||||||
|
- name: data
|
||||||
|
mountPath: /opt/data
|
||||||
|
- name: workspace
|
||||||
|
mountPath: /workspace
|
||||||
|
- name: tools
|
||||||
|
mountPath: /tools
|
||||||
|
readOnly: true
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
memory: "512Mi"
|
||||||
|
cpu: "250m"
|
||||||
|
limits:
|
||||||
|
memory: "2Gi"
|
||||||
|
cpu: "1000m"
|
||||||
|
livenessProbe:
|
||||||
|
# Probe the dashboard port (9119, always enabled via HERMES_DASHBOARD=1
|
||||||
|
# and binds 0.0.0.0). The gateway API on 8642 is off by default
|
||||||
|
# (API_SERVER_ENABLED not set), so 9119 is the reliable liveness signal.
|
||||||
|
# s6 auto-restarts the gateway itself; this probe only catches a wedged
|
||||||
|
# container.
|
||||||
|
tcpSocket:
|
||||||
|
port: 9119
|
||||||
|
initialDelaySeconds: 90
|
||||||
|
periodSeconds: 30
|
||||||
|
timeoutSeconds: 5
|
||||||
|
failureThreshold: 5
|
||||||
|
securityContext:
|
||||||
|
allowPrivilegeEscalation: false
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
- name: data
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: hermes-data
|
||||||
|
- name: workspace
|
||||||
|
emptyDir: {}
|
||||||
|
- name: tools
|
||||||
|
emptyDir: {}
|
||||||
|
- name: seed
|
||||||
|
configMap:
|
||||||
|
name: hermes-seed
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: hermes
|
||||||
|
namespace: platform-engineer
|
||||||
|
spec:
|
||||||
|
type: ClusterIP
|
||||||
|
selector:
|
||||||
|
app: hermes
|
||||||
|
ports:
|
||||||
|
- name: gateway
|
||||||
|
port: 80
|
||||||
|
targetPort: 8642
|
||||||
|
- name: dashboard
|
||||||
|
port: 9119
|
||||||
|
targetPort: 9119
|
||||||
31
platform-engineer/dockerfile
Normal file
31
platform-engineer/dockerfile
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
# Derived Hermes Agent image with kubectl + helm so the agent can drive the
|
||||||
|
# k3s cluster from inside the container (terminal backend = local).
|
||||||
|
#
|
||||||
|
# Build & push to the Gitea registry:
|
||||||
|
# docker build -t git.rogi.casa/roger/hermes-agent:v1.35-1 -f dockerfile .
|
||||||
|
# docker push git.rogi.casa/roger/hermes-agent:v1.35-1
|
||||||
|
#
|
||||||
|
# This image targets linux/amd64 (the agent pod is pinned to the amd64 NUC).
|
||||||
|
FROM nousresearch/hermes-agent:latest
|
||||||
|
|
||||||
|
USER root
|
||||||
|
|
||||||
|
# kubectl (v1.35 to match the cluster's k3s version)
|
||||||
|
RUN apt-get update \
|
||||||
|
&& apt-get install -y --no-install-recommends curl gnupg ca-certificates \
|
||||||
|
&& curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key \
|
||||||
|
| gpg --dearmor -o /usr/share/keyrings/kubernetes-apt-keyring.gpg \
|
||||||
|
&& echo 'deb [signed-by=/usr/share/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /' \
|
||||||
|
> /etc/apt/sources.list.d/kubernetes.list \
|
||||||
|
&& apt-get update \
|
||||||
|
&& apt-get install -y --no-install-recommends kubectl \
|
||||||
|
# helm
|
||||||
|
&& curl -fsSL https://get.helm.sh/helm-v3.16.3-linux-amd64.tar.gz \
|
||||||
|
| tar -xz -C /usr/local/bin --strip-components=1 linux-amd64/helm \
|
||||||
|
&& apt-get clean \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Hermes' own CLI/kubeconfig helper dir for tool subprocesses
|
||||||
|
RUN mkdir -p /opt/data/home/.kube
|
||||||
|
|
||||||
|
USER hermes
|
||||||
24
platform-engineer/ingress.yaml
Normal file
24
platform-engineer/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: hermes
|
||||||
|
namespace: platform-engineer
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- hermes.rogi.casa
|
||||||
|
secretName: hermes-tls
|
||||||
|
rules:
|
||||||
|
- host: hermes.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: hermes
|
||||||
|
port:
|
||||||
|
number: 9119 # dashboard
|
||||||
4
platform-engineer/namespace.yaml
Normal file
4
platform-engineer/namespace.yaml
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: platform-engineer
|
||||||
11
platform-engineer/pvc.yaml
Normal file
11
platform-engineer/pvc.yaml
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: hermes-data
|
||||||
|
namespace: platform-engineer
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 5Gi
|
||||||
111
platform-engineer/rbac.yaml
Normal file
111
platform-engineer/rbac.yaml
Normal file
@@ -0,0 +1,111 @@
|
|||||||
|
# Least-privilege RBAC for the Platform Engineer Hermes agent.
|
||||||
|
#
|
||||||
|
# The agent can READ almost everything cluster-wide, but can only MUTATE a
|
||||||
|
# narrow allowlist of safe, idempotent resources (restart deployments, delete a
|
||||||
|
# stuck pod so its controller recreates it, etc.). It CANNOT touch RBAC, nodes,
|
||||||
|
# namespaces, CRDs, or other namespaces' Secrets beyond read.
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ServiceAccount
|
||||||
|
metadata:
|
||||||
|
name: platform-engineer
|
||||||
|
namespace: platform-engineer
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRole
|
||||||
|
metadata:
|
||||||
|
name: platform-engineer
|
||||||
|
rules:
|
||||||
|
# ---- Broad read access (cluster-wide) ----
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources:
|
||||||
|
- nodes
|
||||||
|
- nodes/proxy
|
||||||
|
- services
|
||||||
|
- endpoints
|
||||||
|
- pods
|
||||||
|
- pods/log
|
||||||
|
- configmaps
|
||||||
|
- secrets
|
||||||
|
- persistentvolumeclaims
|
||||||
|
- persistentvolumes
|
||||||
|
- namespaces
|
||||||
|
- events
|
||||||
|
- replicationcontrollers
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
- apiGroups: ["apps"]
|
||||||
|
resources:
|
||||||
|
- deployments
|
||||||
|
- statefulsets
|
||||||
|
- daemonsets
|
||||||
|
- replicasets
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
- apiGroups: ["batch"]
|
||||||
|
resources:
|
||||||
|
- jobs
|
||||||
|
- cronjobs
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
- apiGroups: ["networking.k8s.io"]
|
||||||
|
resources:
|
||||||
|
- ingresses
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
- apiGroups: ["autoscaling"]
|
||||||
|
resources:
|
||||||
|
- horizontalpodautoscalers
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
- apiGroups: ["argoproj.io"]
|
||||||
|
resources:
|
||||||
|
- applications
|
||||||
|
- appprojects
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
- apiGroups: ["cert-manager.io"]
|
||||||
|
resources:
|
||||||
|
- certificates
|
||||||
|
- certificaterequests
|
||||||
|
- clusterissuers
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
- apiGroups: ["metrics.k8s.io"]
|
||||||
|
resources:
|
||||||
|
- pods
|
||||||
|
- nodes
|
||||||
|
verbs: ["get", "list"]
|
||||||
|
|
||||||
|
# ---- Metrics / health endpoints ----
|
||||||
|
- nonResourceURLs: ["/metrics", "/metrics/*"]
|
||||||
|
verbs: ["get"]
|
||||||
|
|
||||||
|
# ---- Narrow mutate allowlist (idempotent, safe remediation) ----
|
||||||
|
# Restart a stuck pod by deleting it (its controller recreates it).
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources: ["pods"]
|
||||||
|
verbs: ["delete", "patch"]
|
||||||
|
# `kubectl rollout restart` and scaling for the apps/batch controllers.
|
||||||
|
- apiGroups: ["apps"]
|
||||||
|
resources:
|
||||||
|
- deployments
|
||||||
|
- statefulsets
|
||||||
|
- daemonsets
|
||||||
|
- replicasets
|
||||||
|
verbs: ["patch", "update"]
|
||||||
|
- apiGroups: ["batch"]
|
||||||
|
resources:
|
||||||
|
- jobs
|
||||||
|
- cronjobs
|
||||||
|
verbs: ["patch", "update", "delete"]
|
||||||
|
# Exec into pods for log-style / debug inspection (granted per request #5).
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources: ["pods/exec"]
|
||||||
|
verbs: ["create"]
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRoleBinding
|
||||||
|
metadata:
|
||||||
|
name: platform-engineer
|
||||||
|
roleRef:
|
||||||
|
apiGroup: rbac.authorization.k8s.io
|
||||||
|
kind: ClusterRole
|
||||||
|
name: platform-engineer
|
||||||
|
subjects:
|
||||||
|
- kind: ServiceAccount
|
||||||
|
name: platform-engineer
|
||||||
|
namespace: platform-engineer
|
||||||
@@ -2,18 +2,17 @@ apiVersion: networking.k8s.io/v1
|
|||||||
kind: Ingress
|
kind: Ingress
|
||||||
metadata:
|
metadata:
|
||||||
name: qbittorrent-ingress
|
name: qbittorrent-ingress
|
||||||
|
namespace: qbittorrent
|
||||||
annotations:
|
annotations:
|
||||||
kubernetes.io/ingress.class: "traefik"
|
kubernetes.io/ingress.class: "traefik"
|
||||||
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
traefik.ingress.kubernetes.io/redirect-entry-point: https
|
||||||
traefik.ingress.kubernetes.io/compress: "true"
|
traefik.ingress.kubernetes.io/compress: "true"
|
||||||
cert-manager.io/issuer: prod-issuer
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
cert-manager.io/issuer-kind: OriginIssuer
|
|
||||||
cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
|
|
||||||
spec:
|
spec:
|
||||||
tls:
|
tls:
|
||||||
- hosts:
|
- hosts:
|
||||||
- "*.rogi.casa"
|
- qbittorrent.rogi.casa
|
||||||
secretName: rogicasa-tls
|
secretName: qbittorrent-tls
|
||||||
rules:
|
rules:
|
||||||
- host: qbittorrent.rogi.casa
|
- host: qbittorrent.rogi.casa
|
||||||
http:
|
http:
|
||||||
|
|||||||
@@ -1,7 +1,13 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: Namespace
|
||||||
|
metadata:
|
||||||
|
name: qbittorrent
|
||||||
|
---
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
name: qbittorrent
|
name: qbittorrent
|
||||||
|
namespace: qbittorrent
|
||||||
labels:
|
labels:
|
||||||
app: qbittorrent
|
app: qbittorrent
|
||||||
spec:
|
spec:
|
||||||
@@ -48,6 +54,7 @@ apiVersion: v1
|
|||||||
kind: PersistentVolumeClaim
|
kind: PersistentVolumeClaim
|
||||||
metadata:
|
metadata:
|
||||||
name: qbittorrent-config
|
name: qbittorrent-config
|
||||||
|
namespace: qbittorrent
|
||||||
labels:
|
labels:
|
||||||
app: qbittorrent
|
app: qbittorrent
|
||||||
spec:
|
spec:
|
||||||
@@ -68,7 +75,7 @@ spec:
|
|||||||
accessModes:
|
accessModes:
|
||||||
- ReadWriteMany
|
- ReadWriteMany
|
||||||
nfs:
|
nfs:
|
||||||
server: 10.88.88.238
|
server: 10.88.30.10
|
||||||
path: /volume1/jellyfin/media/downloads
|
path: /volume1/jellyfin/media/downloads
|
||||||
persistentVolumeReclaimPolicy: Retain
|
persistentVolumeReclaimPolicy: Retain
|
||||||
---
|
---
|
||||||
@@ -76,6 +83,7 @@ apiVersion: v1
|
|||||||
kind: PersistentVolumeClaim
|
kind: PersistentVolumeClaim
|
||||||
metadata:
|
metadata:
|
||||||
name: qbittorrent-downloads
|
name: qbittorrent-downloads
|
||||||
|
namespace: qbittorrent
|
||||||
labels:
|
labels:
|
||||||
app: qbittorrent
|
app: qbittorrent
|
||||||
spec:
|
spec:
|
||||||
@@ -91,6 +99,7 @@ apiVersion: v1
|
|||||||
kind: Service
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: qbittorrent
|
name: qbittorrent
|
||||||
|
namespace: qbittorrent
|
||||||
labels:
|
labels:
|
||||||
app: qbittorrent
|
app: qbittorrent
|
||||||
spec:
|
spec:
|
||||||
|
|||||||
24
vaultwarden/ingress.yaml
Normal file
24
vaultwarden/ingress.yaml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: vaultwarden
|
||||||
|
namespace: vaultwarden
|
||||||
|
annotations:
|
||||||
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
||||||
|
spec:
|
||||||
|
ingressClassName: traefik
|
||||||
|
tls:
|
||||||
|
- hosts:
|
||||||
|
- vaultwarden.rogi.casa
|
||||||
|
secretName: vaultwarden-tls
|
||||||
|
rules:
|
||||||
|
- host: vaultwarden.rogi.casa
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: vaultwarden
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
Reference in New Issue
Block a user