Files

Roger Oriol d8012dfb6c monitoring: add dashboard ideas doc

Survey of dashboards that could be built from existing and not-yet-enabled
metrics across the cluster's services (traefik, coredns, metallb, cert-manager,
phoenix, litellm, gitea, postgres, etc.), with per-service enable steps and
a recommended priority order.

2026-06-26 20:22:54 +02:00

14 KiB

Raw Blame History

Dashboard Ideas

This file collects ideas for additional Grafana dashboards to build for the rogi.casa k3s cluster. Each idea notes the data source (metrics already available vs. metrics that need to be enabled) and a rough panel layout.

To actually add a dashboard, create a grafana-dashboard-<name>.yaml ConfigMap in this folder, mount it in grafana-deployment.yaml (add a volume + volumeMount under /var/lib/grafana/dashboards/<name>), commit and push.

Already-scraped services (ready to dashboard now)

These exporters/services are already being scraped by Prometheus — dashboards can be built immediately with no infra changes.

1. Traefik (Ingress) — `traefik_*`

Traefik is scraped via the kubernetes-pods job (pod annotation on traefik-9bcdbbd9-x8zq4 in kube-system). It exposes request counters, entry point latency, TLS handshakes, config reloads.

Panels:

Requests/sec by entrypoint (web / websecure / traefik) — rate(traefik_entrypoint_requests_total[5m])
Request latency p50/p95/p99 — histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le, entrypoint))
HTTP status code distribution (2xx/3xx/4xx/5xx) — rate(traefik_entrypoint_requests_total{code=~"2xx|3xx|4xx|5xx"}[5m])
TLS handshakes/sec — rate(traefik_entrypoint_requests_tls_total[5m])
Config reloads + last reload success — traefik_config_reloads_total, traefik_config_last_reload_success
Top routes/services by request volume — topk(10, sum by (service) (rate(traefik_service_requests_total[5m])))
Bytes transferred in/out — rate(traefik_entrypoint_requests_bytes_total[5m])

Why useful: This is your front door. Knowing which routes get hit most, latency per ingress, and 5xx spikes is the single most valuable app-level dashboard in the cluster.

2. CoreDNS (cluster DNS) — `coredns_*`

Scraped via kube-dns Service annotation. Exposes query rate, cache hits, error types, response duration.

Panels:

DNS queries/sec by zone / type — rate(coredns_dns_requests_total[5m])
Cache hit ratio — rate(coredns_cache_hits_total[5m]) / rate(coredns_cache_requests_total[5m])
DNS query latency p95 — histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))
Queries by response code (NOERROR / NXDOMAIN / SERVFAIL) — rate(coredns_dns_responses_total[5m])
Cache size — coredns_cache_entries
Forward requests/sec (upstream DNS) — rate(coredns_forward_requests_total[5m])

Why useful: DNS issues cause cascading failures (ImagePullBackOff, cert challenges, etc.). A spike in NXDOMAIN/SERVFAIL is an early warning.

3. MetalLB (LoadBalancer) — `metallb_*`

Scraped via pod annotation on speaker-* and controller in metallb-system. Exposes IP allocation usage, BGP/session state.

Panels:

IP addresses in use vs. total — metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total
IP pool utilization % (gauge) — metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total * 100
BGP session up per speaker — metallb_bgp_session_up
Config loaded / stale status — metallb_k8s_client_config_loaded_bool, metallb_k8s_client_config_stale_bool
Announcements per speaker — rate(metallb_bgp_announcements_total[5m])

Why useful: If MetalLB runs out of IPs, new LoadBalancer services will hang in <pending>. Knowing pool utilization lets you act before that happens.

4. cert-manager (TLS certificates) — `certmanager_*`

Scraped via pod annotations on cert-manager pods. Exposes certificate expiration, renewal, ready status, ACME challenges.

Panels:

Certificate expiration (days remaining, sorted) — table of (certmanager_certificate_not_after_timestamp_seconds - time()) / 86400
Certificates not Ready — certmanager_certificate_ready_status{condition="Ready",status!="True"}
Upcoming renewals (next 14 days) — certmanager_certificate_renewal_timestamp_seconds
ACME challenge status — certmanager_certificate_challenge_status
Failed renewals counter — rate(certmanager_certificate_renewal_total{condition="Failed"}[1h])

Why useful: A cert about to expire (or silently failing to renew) is the kind of thing that takes down *.rogi.casa HTTPS with no warning. This is a must-have alert/dashboard.

5. Phoenix (trace store) — `phoenix_*`

Already scraped via the phoenix Service annotation. Exposes bulk loader ingestion rates, span insertion times, retention sweeper, exceptions.

Panels:

Span ingestion rate — rate(phoenix_bulk_loader_span_insertion_time_seconds_count[5m])
Span insertion latency p95 — histogram_quantile(0.95, sum(rate(phoenix_bulk_loader_span_insertion_time_seconds_bucket[5m])) by (le))
Span exceptions/sec — rate(phoenix_bulk_loader_span_exceptions_total[5m])
Retention sweeper last run — phoenix_retention_sweeper_last_run_seconds
Last activity timestamp — phoenix_bulk_loader_last_activity_timestamp_seconds

Why useful: Phoenix is your observability backend's own backend. Tracking ingestion health tells you whether traces are landing.

Infrastructure dashboards (compose from existing metrics)

6. Storage & PVC Health (KSM + kubelet + node-exporter)

Cross-source dashboard combining kube_persistentvolumeclaim_* (KSM), kubelet_volume_stats_* (kubelet), and node_filesystem_* (node-exporter).

Panels:

PVC usage % per claim — kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
PVC requested vs. capacity — kube_persistentvolumeclaim_resource_requests_storage_bytes vs actual
Node disk usage % (all mounts) — (1 - node_filesystem_avail / node_filesystem_size) * 100
Inode usage % per mount — (1 - node_filesystem_files_free / node_filesystem_files) * 100
Volume binding status (Bound/Pending) — kube_persistentvolumeclaim_status_phase
Top 10 PVCs by usage (table)

Why useful: The local-path provisioner fills up node disks. Catching a PVC at 95% before it errors is a lifesaver.

7. Workload Health (KSM)

Uses kube-state-metrics to show deployment/StatefulSet/CronJob health cluster-wide.

Panels:

Deployments with unavailable replicas — kube_deployment_status_replicas_available < kube_deployment_status_replicas
Pods not in Running phase by namespace — kube_pod_status_phase{phase!="Running"}
Container restarts (last 1h) — increase(kube_pod_container_status_restarts_total[1h])
Pods stuck in CrashLoopBackOff / ImagePullBackOff — kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff"}
Job failures — kube_job_failed
CronJob schedule heatmap — kube_cronjob_status_active
HPA status (if any autoscaled) — kube_horizontalpodautoscaler_status_current_replicas vs desired

Why useful: This is the "is anything broken" board. Notice you already have some pods in ImagePullBackOff (myorg-assistant) — this dashboard surfaces that.

8. etcd / Control Plane Health (if exposed)

k3s embeds etcd (or sqlite on single-node). etcd metrics require exposing the etcd /metrics endpoint (typically --listen-metrics-urls on the control plane node). Requires config change to enable.

Panels:

Leader changes — etcd_server_leader_changes_seen_total
Proposal commits/sec — rate(etcd_server_proposals_committed_total[5m])
Proposal failures/sec — rate(etcd_server_proposals_failed_total[5m])
DB size — etcd_mvcc_db_total_size_in_bytes
RPC latency p99 — histogram_quantile(0.99, sum(rate(etcd_grpc Unary grpc latency bucket[5m])) by (le))
Active watchers — etcd_debugging_mvcc_watcher_total

Why useful: etcd is the brain of the cluster. Slow commits or a flipping leader indicates control-plane trouble.

App-service dashboards (require enabling metrics first)

Most of your apps don't expose /metrics yet. Below is the per-service setup plus the dashboard idea once metrics are on. To enable scraping for any of these, annotate the Service with:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "<port>"

The existing kubernetes-service-endpoints scrape job will pick them up automatically — no Prometheus config edit needed.

9. LiteLLM (LLM gateway) — needs enabling

LiteLLM exposes Prometheus metrics on its API port (/metrics). Annotate the litellm Service.

Panels:

Requests/sec by model — rate(litellm_requests_total[5m]) by model
Token usage (prompt/completion/total) — rate(litellm_total_tokens_total[5m])
Spend by model — litellm_spend_total (if cost tracking enabled)
Latency p95 per model — histogram_quantile(0.95, ...)
Error rate by model — rate(litellm_requests_total{status=~"5.."}[5m])
Rate-limit / quota hits

Why useful: LiteLLM is the gateway for all your AI apps (open-webui, myorg-assistant, etc.). Token spend + per-model latency is the single best cost/quality lever in the cluster.

10. Gitea (git + CI) — needs enabling

Gitea exposes metrics at /metrics when ENABLE_METRICS=true in app.ini. Annotate gitea-http Service (port 3000 inside, 80 via svc).

Panels:

Git push/clone/fetch rate — gitea_actions_total by action
Active users / repos / orgs — gitea_users_total, gitea_repos_total
Issues / PRs open — gitea_issues_total, gitea_pulls_total
HTTP request rate + latency
Gitea Actions runner job duration — if runner metrics exposed

Why useful: Gitea hosts the cluster's own GitOps repo + CI. Tracking push rate and runner throughput catches CI storms.

11. Home Assistant — needs enabling

HA exposes Prometheus metrics via the prometheus integration (add to configuration.yaml). Then annotate the Service.

Panels:

Active entities / sensors by domain
State change events/sec — homeassistant_entity_states_total
Automation triggers/sec — homeassistant_automation_triggered_total
Integrations loaded + errors
Database size / recorder queue depth
Zigbee/Z-Wave mesh health (if exposed)

Why useful: HA is a home-critical service. Event/sec spikes often indicate sensor flapping or runaway automations.

12. Jellyfin — limited

Jellyfin doesn't ship first-class Prometheus metrics, but you can scrape it via a sidecar (jellyfin-prometheus-exporter) or build a blackbox-style dashboard on the /health endpoint.

Panels:

Active streams — from exporter
Transcode sessions + hw accel usage
Library size by media type
Playback errors

13. Pi-hole — needs enabling

Pi-hole exposes metrics on its FTL web API; the pihole-exporter sidecar converts them to Prometheus format. Add as a sidecar container + annotate.

Panels:

DNS queries/sec (total, blocked, cached, forwarded)
Block list size
Top blocked domains
Top permitted domains
Clients by query volume
Cache hit ratio

Why useful: Pi-hole is your network-wide adblock. Block rate + cache ratio are the headline metrics, and query spikes reveal misbehaving clients.

14. PostgreSQL (litellm + phoenix + n8n) — needs enabling

You have two Postgres instances (postgres in litellm and phoenix). Add prometheus-postgres-exporter as a sidecar or Deployment per DB.

Panels (per DB):

Connections (active / idle / max) — pg_stat_activity_count
Transactions/sec — rate(pg_stat_database_xact_commit[5m])
Cache hit ratio — pg_stat_database_blks_hit / (blks_hit + blks_read)
Table + index bloat
Replication lag (if replicas)
Slow queries (if pg_stat_statements enabled)
DB size growth — pg_database_size_bytes

Why useful: DB connection exhaustion and cache ratio collapse are the two most common causes of slow app performance.

15. Minecraft — limited

The Minecraft server exposes metrics via RCON + an exporter (minecraft-exporter). Add as sidecar using the existing RCON_PASSWORD.

Panels:

Players online — minecraft_players_online
TPS (ticks per second) — minecraft_tps (server health)
Entities loaded — minecraft_entities_total
Chunk count — minecraft_chunks_loaded
Memory used by JVM

Why useful: TPS < 20 means lag. Player count vs. server load is the only real signal a Minecraft server needs.

16. qBittorrent — limited

No native metrics. Options: a qbittorrent-exporter sidecar (uses the WebUI API), or a blackbox probe on the WebUI.

Panels:

Download/upload speed
Active torrents
Torrent count by state (downloading/seeding/paused)
Disk usage in download dir

Cluster meta dashboards

17. Network Topology / Service Map

Composite view: for each namespace, list services, their pods, scrape status, and request volume (from Traefik logs + cAdvisor network). A "what talks to what" overview.

Panels:

Service → pod → container resource table
Cross-namespace network flows (if network policy logging enabled)
Scrape health matrix (every target up/down)
Ingress route → backend service map

18. Backup / Snapshot Status

If you take Velero snapshots or local-path snapshots, build a dashboard on velero_* or CRD status. Requires Velero.

Panels:

Last successful backup per namespace
Failed backups
Backup size growth
Restore test status

19. Cost / Capacity Planning

Composite: per-namespace CPU/memory requests vs. actual usage, projected growth, node saturation forecast.

Panels:

Requests vs. limits vs. actual (per namespace) — KSM + cAdvisor
Node capacity vs. allocatable
PVC growth trend + 30-day forecast
"What if I removed node X" simulation (capacity headroom)

Why useful: Tells you when you'll need another node before you hit the wall.

Recommended priority order

If you only build a few, do them in this order (highest value-to-effort first):

Traefik Ingress (#1) — already scraped, your front door
Storage & PVC Health (#6) — local-path fills disks; high blast radius
Workload Health (#7) — surfaces CrashLoopBackOff / ImagePullBackOff
cert-manager (#4) — prevents silent cert expiry outages
CoreDNS (#2) — early warning for DNS cascades
LiteLLM (#9) — needs prometheus.io/scrape annotation only; big insights
MetalLB (#3) — small but catches LoadBalancer IP exhaustion

Items 8–19 are nice-to-have or require additional exporters/config.

14 KiB Raw Blame History Unescape Escape

Dashboard Ideas

Already-scraped services (ready to dashboard now)

1. Traefik (Ingress) — traefik_*

2. CoreDNS (cluster DNS) — coredns_*

3. MetalLB (LoadBalancer) — metallb_*

4. cert-manager (TLS certificates) — certmanager_*

5. Phoenix (trace store) — phoenix_*

Infrastructure dashboards (compose from existing metrics)

6. Storage & PVC Health (KSM + kubelet + node-exporter)

7. Workload Health (KSM)

8. etcd / Control Plane Health (if exposed)

App-service dashboards (require enabling metrics first)

9. LiteLLM (LLM gateway) — needs enabling

10. Gitea (git + CI) — needs enabling

11. Home Assistant — needs enabling

12. Jellyfin — limited

13. Pi-hole — needs enabling

14. PostgreSQL (litellm + phoenix + n8n) — needs enabling

15. Minecraft — limited

16. qBittorrent — limited

Cluster meta dashboards

17. Network Topology / Service Map

18. Backup / Snapshot Status

19. Cost / Capacity Planning

Recommended priority order

14 KiB

Raw Blame History

1. Traefik (Ingress) — `traefik_*`

2. CoreDNS (cluster DNS) — `coredns_*`

3. MetalLB (LoadBalancer) — `metallb_*`

4. cert-manager (TLS certificates) — `certmanager_*`

5. Phoenix (trace store) — `phoenix_*`