monitoring: add dashboard ideas doc
Survey of dashboards that could be built from existing and not-yet-enabled metrics across the cluster's services (traefik, coredns, metallb, cert-manager, phoenix, litellm, gitea, postgres, etc.), with per-service enable steps and a recommended priority order.
This commit is contained in:
354
monitoring/dashboard-ideas.md
Normal file
354
monitoring/dashboard-ideas.md
Normal file
@@ -0,0 +1,354 @@
|
|||||||
|
# Dashboard Ideas
|
||||||
|
|
||||||
|
This file collects ideas for additional Grafana dashboards to build for the
|
||||||
|
`rogi.casa` k3s cluster. Each idea notes the **data source** (metrics already
|
||||||
|
available vs. metrics that need to be enabled) and a rough panel layout.
|
||||||
|
|
||||||
|
To actually add a dashboard, create a `grafana-dashboard-<name>.yaml` ConfigMap
|
||||||
|
in this folder, mount it in `grafana-deployment.yaml` (add a volume +
|
||||||
|
volumeMount under `/var/lib/grafana/dashboards/<name>`), commit and push.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Already-scraped services (ready to dashboard now)
|
||||||
|
|
||||||
|
These exporters/services are **already being scraped by Prometheus** — dashboards
|
||||||
|
can be built immediately with no infra changes.
|
||||||
|
|
||||||
|
### 1. Traefik (Ingress) — `traefik_*`
|
||||||
|
Traefik is scraped via the `kubernetes-pods` job (pod annotation on
|
||||||
|
`traefik-9bcdbbd9-x8zq4` in `kube-system`). It exposes request counters, entry
|
||||||
|
point latency, TLS handshakes, config reloads.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Requests/sec by entrypoint (web / websecure / traefik) — `rate(traefik_entrypoint_requests_total[5m])`
|
||||||
|
- Request latency p50/p95/p99 — `histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le, entrypoint))`
|
||||||
|
- HTTP status code distribution (2xx/3xx/4xx/5xx) — `rate(traefik_entrypoint_requests_total{code=~"2xx|3xx|4xx|5xx"}[5m])`
|
||||||
|
- TLS handshakes/sec — `rate(traefik_entrypoint_requests_tls_total[5m])`
|
||||||
|
- Config reloads + last reload success — `traefik_config_reloads_total`, `traefik_config_last_reload_success`
|
||||||
|
- Top routes/services by request volume — `topk(10, sum by (service) (rate(traefik_service_requests_total[5m])))`
|
||||||
|
- Bytes transferred in/out — `rate(traefik_entrypoint_requests_bytes_total[5m])`
|
||||||
|
|
||||||
|
**Why useful:** This is your front door. Knowing which routes get hit most,
|
||||||
|
latency per ingress, and 5xx spikes is the single most valuable app-level
|
||||||
|
dashboard in the cluster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. CoreDNS (cluster DNS) — `coredns_*`
|
||||||
|
Scraped via `kube-dns` Service annotation. Exposes query rate, cache hits,
|
||||||
|
error types, response duration.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- DNS queries/sec by zone / type — `rate(coredns_dns_requests_total[5m])`
|
||||||
|
- Cache hit ratio — `rate(coredns_cache_hits_total[5m]) / rate(coredns_cache_requests_total[5m])`
|
||||||
|
- DNS query latency p95 — `histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))`
|
||||||
|
- Queries by response code (NOERROR / NXDOMAIN / SERVFAIL) — `rate(coredns_dns_responses_total[5m])`
|
||||||
|
- Cache size — `coredns_cache_entries`
|
||||||
|
- Forward requests/sec (upstream DNS) — `rate(coredns_forward_requests_total[5m])`
|
||||||
|
|
||||||
|
**Why useful:** DNS issues cause cascading failures (ImagePullBackOff, cert
|
||||||
|
challenges, etc.). A spike in NXDOMAIN/SERVFAIL is an early warning.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. MetalLB (LoadBalancer) — `metallb_*`
|
||||||
|
Scraped via pod annotation on `speaker-*` and `controller` in `metallb-system`.
|
||||||
|
Exposes IP allocation usage, BGP/session state.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- IP addresses in use vs. total — `metallb_allocator_addresses_in_use_total` / `metallb_allocator_addresses_total`
|
||||||
|
- IP pool utilization % (gauge) — `metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total * 100`
|
||||||
|
- BGP session up per speaker — `metallb_bgp_session_up`
|
||||||
|
- Config loaded / stale status — `metallb_k8s_client_config_loaded_bool`, `metallb_k8s_client_config_stale_bool`
|
||||||
|
- Announcements per speaker — `rate(metallb_bgp_announcements_total[5m])`
|
||||||
|
|
||||||
|
**Why useful:** If MetalLB runs out of IPs, new LoadBalancer services will
|
||||||
|
hang in `<pending>`. Knowing pool utilization lets you act before that happens.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. cert-manager (TLS certificates) — `certmanager_*`
|
||||||
|
Scraped via pod annotations on cert-manager pods. Exposes certificate
|
||||||
|
expiration, renewal, ready status, ACME challenges.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Certificate expiration (days remaining, sorted) — table of `(certmanager_certificate_not_after_timestamp_seconds - time()) / 86400`
|
||||||
|
- Certificates not Ready — `certmanager_certificate_ready_status{condition="Ready",status!="True"}`
|
||||||
|
- Upcoming renewals (next 14 days) — `certmanager_certificate_renewal_timestamp_seconds`
|
||||||
|
- ACME challenge status — `certmanager_certificate_challenge_status`
|
||||||
|
- Failed renewals counter — `rate(certmanager_certificate_renewal_total{condition="Failed"}[1h])`
|
||||||
|
|
||||||
|
**Why useful:** A cert about to expire (or silently failing to renew) is the
|
||||||
|
kind of thing that takes down `*.rogi.casa` HTTPS with no warning. This is a
|
||||||
|
must-have alert/dashboard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Phoenix (trace store) — `phoenix_*`
|
||||||
|
Already scraped via the `phoenix` Service annotation. Exposes bulk loader
|
||||||
|
ingestion rates, span insertion times, retention sweeper, exceptions.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Span ingestion rate — `rate(phoenix_bulk_loader_span_insertion_time_seconds_count[5m])`
|
||||||
|
- Span insertion latency p95 — `histogram_quantile(0.95, sum(rate(phoenix_bulk_loader_span_insertion_time_seconds_bucket[5m])) by (le))`
|
||||||
|
- Span exceptions/sec — `rate(phoenix_bulk_loader_span_exceptions_total[5m])`
|
||||||
|
- Retention sweeper last run — `phoenix_retention_sweeper_last_run_seconds`
|
||||||
|
- Last activity timestamp — `phoenix_bulk_loader_last_activity_timestamp_seconds`
|
||||||
|
|
||||||
|
**Why useful:** Phoenix is your observability backend's own backend. Tracking
|
||||||
|
ingestion health tells you whether traces are landing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Infrastructure dashboards (compose from existing metrics)
|
||||||
|
|
||||||
|
### 6. Storage & PVC Health (KSM + kubelet + node-exporter)
|
||||||
|
Cross-source dashboard combining `kube_persistentvolumeclaim_*` (KSM),
|
||||||
|
`kubelet_volume_stats_*` (kubelet), and `node_filesystem_*` (node-exporter).
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- PVC usage % per claim — `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100`
|
||||||
|
- PVC requested vs. capacity — `kube_persistentvolumeclaim_resource_requests_storage_bytes` vs actual
|
||||||
|
- Node disk usage % (all mounts) — `(1 - node_filesystem_avail / node_filesystem_size) * 100`
|
||||||
|
- Inode usage % per mount — `(1 - node_filesystem_files_free / node_filesystem_files) * 100`
|
||||||
|
- Volume binding status (Bound/Pending) — `kube_persistentvolumeclaim_status_phase`
|
||||||
|
- Top 10 PVCs by usage (table)
|
||||||
|
|
||||||
|
**Why useful:** The `local-path` provisioner fills up node disks. Catching a
|
||||||
|
PVC at 95% before it errors is a lifesaver.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. Workload Health (KSM)
|
||||||
|
Uses kube-state-metrics to show deployment/StatefulSet/CronJob health cluster-wide.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Deployments with unavailable replicas — `kube_deployment_status_replicas_available < kube_deployment_status_replicas`
|
||||||
|
- Pods not in Running phase by namespace — `kube_pod_status_phase{phase!="Running"}`
|
||||||
|
- Container restarts (last 1h) — `increase(kube_pod_container_status_restarts_total[1h])`
|
||||||
|
- Pods stuck in CrashLoopBackOff / ImagePullBackOff — `kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff"}`
|
||||||
|
- Job failures — `kube_job_failed`
|
||||||
|
- CronJob schedule heatmap — `kube_cronjob_status_active`
|
||||||
|
- HPA status (if any autoscaled) — `kube_horizontalpodautoscaler_status_current_replicas` vs desired
|
||||||
|
|
||||||
|
**Why useful:** This is the "is anything broken" board. Notice you already have
|
||||||
|
some pods in `ImagePullBackOff` (myorg-assistant) — this dashboard surfaces that.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. etcd / Control Plane Health (if exposed)
|
||||||
|
k3s embeds etcd (or sqlite on single-node). etcd metrics require exposing
|
||||||
|
the etcd `/metrics` endpoint (typically `--listen-metrics-urls` on the control
|
||||||
|
plane node). **Requires config change to enable.**
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Leader changes — `etcd_server_leader_changes_seen_total`
|
||||||
|
- Proposal commits/sec — `rate(etcd_server_proposals_committed_total[5m])`
|
||||||
|
- Proposal failures/sec — `rate(etcd_server_proposals_failed_total[5m])`
|
||||||
|
- DB size — `etcd_mvcc_db_total_size_in_bytes`
|
||||||
|
- RPC latency p99 — `histogram_quantile(0.99, sum(rate(etcd_grpc Unary grpc latency bucket[5m])) by (le))`
|
||||||
|
- Active watchers — `etcd_debugging_mvcc_watcher_total`
|
||||||
|
|
||||||
|
**Why useful:** etcd is the brain of the cluster. Slow commits or a flipping
|
||||||
|
leader indicates control-plane trouble.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## App-service dashboards (require enabling metrics first)
|
||||||
|
|
||||||
|
Most of your apps don't expose `/metrics` yet. Below is the per-service setup
|
||||||
|
plus the dashboard idea once metrics are on. To enable scraping for any of
|
||||||
|
these, annotate the Service with:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
metadata:
|
||||||
|
annotations:
|
||||||
|
prometheus.io/scrape: "true"
|
||||||
|
prometheus.io/port: "<port>"
|
||||||
|
```
|
||||||
|
|
||||||
|
The existing `kubernetes-service-endpoints` scrape job will pick them up
|
||||||
|
automatically — **no Prometheus config edit needed**.
|
||||||
|
|
||||||
|
### 9. LiteLLM (LLM gateway) — needs enabling
|
||||||
|
LiteLLM exposes Prometheus metrics on its API port (`/metrics`). Annotate the
|
||||||
|
`litellm` Service.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Requests/sec by model — `rate(litellm_requests_total[5m])` by `model`
|
||||||
|
- Token usage (prompt/completion/total) — `rate(litellm_total_tokens_total[5m])`
|
||||||
|
- Spend by model — `litellm_spend_total` (if cost tracking enabled)
|
||||||
|
- Latency p95 per model — `histogram_quantile(0.95, ...)`
|
||||||
|
- Error rate by model — `rate(litellm_requests_total{status=~"5.."}[5m])`
|
||||||
|
- Rate-limit / quota hits
|
||||||
|
|
||||||
|
**Why useful:** LiteLLM is the gateway for all your AI apps (open-webui,
|
||||||
|
myorg-assistant, etc.). Token spend + per-model latency is the single best
|
||||||
|
cost/quality lever in the cluster.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 10. Gitea (git + CI) — needs enabling
|
||||||
|
Gitea exposes metrics at `/metrics` when `ENABLE_METRICS=true` in `app.ini`.
|
||||||
|
Annotate `gitea-http` Service (port 3000 inside, 80 via svc).
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Git push/clone/fetch rate — `gitea_actions_total` by `action`
|
||||||
|
- Active users / repos / orgs — `gitea_users_total`, `gitea_repos_total`
|
||||||
|
- Issues / PRs open — `gitea_issues_total`, `gitea_pulls_total`
|
||||||
|
- HTTP request rate + latency
|
||||||
|
- Gitea Actions runner job duration — if runner metrics exposed
|
||||||
|
|
||||||
|
**Why useful:** Gitea hosts the cluster's own GitOps repo + CI. Tracking push
|
||||||
|
rate and runner throughput catches CI storms.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 11. Home Assistant — needs enabling
|
||||||
|
HA exposes Prometheus metrics via the `prometheus` integration (add to
|
||||||
|
`configuration.yaml`). Then annotate the Service.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Active entities / sensors by domain
|
||||||
|
- State change events/sec — `homeassistant_entity_states_total`
|
||||||
|
- Automation triggers/sec — `homeassistant_automation_triggered_total`
|
||||||
|
- Integrations loaded + errors
|
||||||
|
- Database size / recorder queue depth
|
||||||
|
- Zigbee/Z-Wave mesh health (if exposed)
|
||||||
|
|
||||||
|
**Why useful:** HA is a home-critical service. Event/sec spikes often indicate
|
||||||
|
sensor flapping or runaway automations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 12. Jellyfin — limited
|
||||||
|
Jellyfin doesn't ship first-class Prometheus metrics, but you can scrape it
|
||||||
|
via a sidecar (`jellyfin-prometheus-exporter`) or build a blackbox-style
|
||||||
|
dashboard on the `/health` endpoint.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Active streams — from exporter
|
||||||
|
- Transcode sessions + hw accel usage
|
||||||
|
- Library size by media type
|
||||||
|
- Playback errors
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 13. Pi-hole — needs enabling
|
||||||
|
Pi-hole exposes metrics on its FTL web API; the `pihole-exporter` sidecar
|
||||||
|
converts them to Prometheus format. Add as a sidecar container + annotate.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- DNS queries/sec (total, blocked, cached, forwarded)
|
||||||
|
- Block list size
|
||||||
|
- Top blocked domains
|
||||||
|
- Top permitted domains
|
||||||
|
- Clients by query volume
|
||||||
|
- Cache hit ratio
|
||||||
|
|
||||||
|
**Why useful:** Pi-hole is your network-wide adblock. Block rate + cache ratio
|
||||||
|
are the headline metrics, and query spikes reveal misbehaving clients.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 14. PostgreSQL (litellm + phoenix + n8n) — needs enabling
|
||||||
|
You have two Postgres instances (`postgres` in `litellm` and `phoenix`).
|
||||||
|
Add `prometheus-postgres-exporter` as a sidecar or Deployment per DB.
|
||||||
|
|
||||||
|
**Panels (per DB):**
|
||||||
|
- Connections (active / idle / max) — `pg_stat_activity_count`
|
||||||
|
- Transactions/sec — `rate(pg_stat_database_xact_commit[5m])`
|
||||||
|
- Cache hit ratio — `pg_stat_database_blks_hit / (blks_hit + blks_read)`
|
||||||
|
- Table + index bloat
|
||||||
|
- Replication lag (if replicas)
|
||||||
|
- Slow queries (if `pg_stat_statements` enabled)
|
||||||
|
- DB size growth — `pg_database_size_bytes`
|
||||||
|
|
||||||
|
**Why useful:** DB connection exhaustion and cache ratio collapse are the two
|
||||||
|
most common causes of slow app performance.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 15. Minecraft — limited
|
||||||
|
The Minecraft server exposes metrics via RCON + an exporter
|
||||||
|
(`minecraft-exporter`). Add as sidecar using the existing `RCON_PASSWORD`.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Players online — `minecraft_players_online`
|
||||||
|
- TPS (ticks per second) — `minecraft_tps` (server health)
|
||||||
|
- Entities loaded — `minecraft_entities_total`
|
||||||
|
- Chunk count — `minecraft_chunks_loaded`
|
||||||
|
- Memory used by JVM
|
||||||
|
|
||||||
|
**Why useful:** TPS < 20 means lag. Player count vs. server load is the only
|
||||||
|
real signal a Minecraft server needs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 16. qBittorrent — limited
|
||||||
|
No native metrics. Options: a `qbittorrent-exporter` sidecar (uses the WebUI
|
||||||
|
API), or a blackbox probe on the WebUI.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Download/upload speed
|
||||||
|
- Active torrents
|
||||||
|
- Torrent count by state (downloading/seeding/paused)
|
||||||
|
- Disk usage in download dir
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cluster meta dashboards
|
||||||
|
|
||||||
|
### 17. Network Topology / Service Map
|
||||||
|
Composite view: for each namespace, list services, their pods, scrape status,
|
||||||
|
and request volume (from Traefik logs + cAdvisor network). A "what talks to
|
||||||
|
what" overview.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Service → pod → container resource table
|
||||||
|
- Cross-namespace network flows (if network policy logging enabled)
|
||||||
|
- Scrape health matrix (every target up/down)
|
||||||
|
- Ingress route → backend service map
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 18. Backup / Snapshot Status
|
||||||
|
If you take Velero snapshots or local-path snapshots, build a dashboard on
|
||||||
|
`velero_*` or CRD status. **Requires Velero.**
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Last successful backup per namespace
|
||||||
|
- Failed backups
|
||||||
|
- Backup size growth
|
||||||
|
- Restore test status
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 19. Cost / Capacity Planning
|
||||||
|
Composite: per-namespace CPU/memory requests vs. actual usage, projected
|
||||||
|
growth, node saturation forecast.
|
||||||
|
|
||||||
|
**Panels:**
|
||||||
|
- Requests vs. limits vs. actual (per namespace) — KSM + cAdvisor
|
||||||
|
- Node capacity vs. allocatable
|
||||||
|
- PVC growth trend + 30-day forecast
|
||||||
|
- "What if I removed node X" simulation (capacity headroom)
|
||||||
|
|
||||||
|
**Why useful:** Tells you when you'll need another node before you hit the wall.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended priority order
|
||||||
|
|
||||||
|
If you only build a few, do them in this order (highest value-to-effort first):
|
||||||
|
|
||||||
|
1. **Traefik Ingress** (#1) — already scraped, your front door
|
||||||
|
2. **Storage & PVC Health** (#6) — local-path fills disks; high blast radius
|
||||||
|
3. **Workload Health** (#7) — surfaces CrashLoopBackOff / ImagePullBackOff
|
||||||
|
4. **cert-manager** (#4) — prevents silent cert expiry outages
|
||||||
|
5. **CoreDNS** (#2) — early warning for DNS cascades
|
||||||
|
6. **LiteLLM** (#9) — needs `prometheus.io/scrape` annotation only; big insights
|
||||||
|
7. **MetalLB** (#3) — small but catches LoadBalancer IP exhaustion
|
||||||
|
|
||||||
|
Items 8–19 are nice-to-have or require additional exporters/config.
|
||||||
Reference in New Issue
Block a user