Survey of dashboards that could be built from existing and not-yet-enabled metrics across the cluster's services (traefik, coredns, metallb, cert-manager, phoenix, litellm, gitea, postgres, etc.), with per-service enable steps and a recommended priority order.
14 KiB
Dashboard Ideas
This file collects ideas for additional Grafana dashboards to build for the
rogi.casa k3s cluster. Each idea notes the data source (metrics already
available vs. metrics that need to be enabled) and a rough panel layout.
To actually add a dashboard, create a grafana-dashboard-<name>.yaml ConfigMap
in this folder, mount it in grafana-deployment.yaml (add a volume +
volumeMount under /var/lib/grafana/dashboards/<name>), commit and push.
Already-scraped services (ready to dashboard now)
These exporters/services are already being scraped by Prometheus — dashboards can be built immediately with no infra changes.
1. Traefik (Ingress) — traefik_*
Traefik is scraped via the kubernetes-pods job (pod annotation on
traefik-9bcdbbd9-x8zq4 in kube-system). It exposes request counters, entry
point latency, TLS handshakes, config reloads.
Panels:
- Requests/sec by entrypoint (web / websecure / traefik) —
rate(traefik_entrypoint_requests_total[5m]) - Request latency p50/p95/p99 —
histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le, entrypoint)) - HTTP status code distribution (2xx/3xx/4xx/5xx) —
rate(traefik_entrypoint_requests_total{code=~"2xx|3xx|4xx|5xx"}[5m]) - TLS handshakes/sec —
rate(traefik_entrypoint_requests_tls_total[5m]) - Config reloads + last reload success —
traefik_config_reloads_total,traefik_config_last_reload_success - Top routes/services by request volume —
topk(10, sum by (service) (rate(traefik_service_requests_total[5m]))) - Bytes transferred in/out —
rate(traefik_entrypoint_requests_bytes_total[5m])
Why useful: This is your front door. Knowing which routes get hit most, latency per ingress, and 5xx spikes is the single most valuable app-level dashboard in the cluster.
2. CoreDNS (cluster DNS) — coredns_*
Scraped via kube-dns Service annotation. Exposes query rate, cache hits,
error types, response duration.
Panels:
- DNS queries/sec by zone / type —
rate(coredns_dns_requests_total[5m]) - Cache hit ratio —
rate(coredns_cache_hits_total[5m]) / rate(coredns_cache_requests_total[5m]) - DNS query latency p95 —
histogram_quantile(0.95, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le)) - Queries by response code (NOERROR / NXDOMAIN / SERVFAIL) —
rate(coredns_dns_responses_total[5m]) - Cache size —
coredns_cache_entries - Forward requests/sec (upstream DNS) —
rate(coredns_forward_requests_total[5m])
Why useful: DNS issues cause cascading failures (ImagePullBackOff, cert challenges, etc.). A spike in NXDOMAIN/SERVFAIL is an early warning.
3. MetalLB (LoadBalancer) — metallb_*
Scraped via pod annotation on speaker-* and controller in metallb-system.
Exposes IP allocation usage, BGP/session state.
Panels:
- IP addresses in use vs. total —
metallb_allocator_addresses_in_use_total/metallb_allocator_addresses_total - IP pool utilization % (gauge) —
metallb_allocator_addresses_in_use_total / metallb_allocator_addresses_total * 100 - BGP session up per speaker —
metallb_bgp_session_up - Config loaded / stale status —
metallb_k8s_client_config_loaded_bool,metallb_k8s_client_config_stale_bool - Announcements per speaker —
rate(metallb_bgp_announcements_total[5m])
Why useful: If MetalLB runs out of IPs, new LoadBalancer services will
hang in <pending>. Knowing pool utilization lets you act before that happens.
4. cert-manager (TLS certificates) — certmanager_*
Scraped via pod annotations on cert-manager pods. Exposes certificate expiration, renewal, ready status, ACME challenges.
Panels:
- Certificate expiration (days remaining, sorted) — table of
(certmanager_certificate_not_after_timestamp_seconds - time()) / 86400 - Certificates not Ready —
certmanager_certificate_ready_status{condition="Ready",status!="True"} - Upcoming renewals (next 14 days) —
certmanager_certificate_renewal_timestamp_seconds - ACME challenge status —
certmanager_certificate_challenge_status - Failed renewals counter —
rate(certmanager_certificate_renewal_total{condition="Failed"}[1h])
Why useful: A cert about to expire (or silently failing to renew) is the
kind of thing that takes down *.rogi.casa HTTPS with no warning. This is a
must-have alert/dashboard.
5. Phoenix (trace store) — phoenix_*
Already scraped via the phoenix Service annotation. Exposes bulk loader
ingestion rates, span insertion times, retention sweeper, exceptions.
Panels:
- Span ingestion rate —
rate(phoenix_bulk_loader_span_insertion_time_seconds_count[5m]) - Span insertion latency p95 —
histogram_quantile(0.95, sum(rate(phoenix_bulk_loader_span_insertion_time_seconds_bucket[5m])) by (le)) - Span exceptions/sec —
rate(phoenix_bulk_loader_span_exceptions_total[5m]) - Retention sweeper last run —
phoenix_retention_sweeper_last_run_seconds - Last activity timestamp —
phoenix_bulk_loader_last_activity_timestamp_seconds
Why useful: Phoenix is your observability backend's own backend. Tracking ingestion health tells you whether traces are landing.
Infrastructure dashboards (compose from existing metrics)
6. Storage & PVC Health (KSM + kubelet + node-exporter)
Cross-source dashboard combining kube_persistentvolumeclaim_* (KSM),
kubelet_volume_stats_* (kubelet), and node_filesystem_* (node-exporter).
Panels:
- PVC usage % per claim —
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 - PVC requested vs. capacity —
kube_persistentvolumeclaim_resource_requests_storage_bytesvs actual - Node disk usage % (all mounts) —
(1 - node_filesystem_avail / node_filesystem_size) * 100 - Inode usage % per mount —
(1 - node_filesystem_files_free / node_filesystem_files) * 100 - Volume binding status (Bound/Pending) —
kube_persistentvolumeclaim_status_phase - Top 10 PVCs by usage (table)
Why useful: The local-path provisioner fills up node disks. Catching a
PVC at 95% before it errors is a lifesaver.
7. Workload Health (KSM)
Uses kube-state-metrics to show deployment/StatefulSet/CronJob health cluster-wide.
Panels:
- Deployments with unavailable replicas —
kube_deployment_status_replicas_available < kube_deployment_status_replicas - Pods not in Running phase by namespace —
kube_pod_status_phase{phase!="Running"} - Container restarts (last 1h) —
increase(kube_pod_container_status_restarts_total[1h]) - Pods stuck in CrashLoopBackOff / ImagePullBackOff —
kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff"} - Job failures —
kube_job_failed - CronJob schedule heatmap —
kube_cronjob_status_active - HPA status (if any autoscaled) —
kube_horizontalpodautoscaler_status_current_replicasvs desired
Why useful: This is the "is anything broken" board. Notice you already have
some pods in ImagePullBackOff (myorg-assistant) — this dashboard surfaces that.
8. etcd / Control Plane Health (if exposed)
k3s embeds etcd (or sqlite on single-node). etcd metrics require exposing
the etcd /metrics endpoint (typically --listen-metrics-urls on the control
plane node). Requires config change to enable.
Panels:
- Leader changes —
etcd_server_leader_changes_seen_total - Proposal commits/sec —
rate(etcd_server_proposals_committed_total[5m]) - Proposal failures/sec —
rate(etcd_server_proposals_failed_total[5m]) - DB size —
etcd_mvcc_db_total_size_in_bytes - RPC latency p99 —
histogram_quantile(0.99, sum(rate(etcd_grpc Unary grpc latency bucket[5m])) by (le)) - Active watchers —
etcd_debugging_mvcc_watcher_total
Why useful: etcd is the brain of the cluster. Slow commits or a flipping leader indicates control-plane trouble.
App-service dashboards (require enabling metrics first)
Most of your apps don't expose /metrics yet. Below is the per-service setup
plus the dashboard idea once metrics are on. To enable scraping for any of
these, annotate the Service with:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "<port>"
The existing kubernetes-service-endpoints scrape job will pick them up
automatically — no Prometheus config edit needed.
9. LiteLLM (LLM gateway) — needs enabling
LiteLLM exposes Prometheus metrics on its API port (/metrics). Annotate the
litellm Service.
Panels:
- Requests/sec by model —
rate(litellm_requests_total[5m])bymodel - Token usage (prompt/completion/total) —
rate(litellm_total_tokens_total[5m]) - Spend by model —
litellm_spend_total(if cost tracking enabled) - Latency p95 per model —
histogram_quantile(0.95, ...) - Error rate by model —
rate(litellm_requests_total{status=~"5.."}[5m]) - Rate-limit / quota hits
Why useful: LiteLLM is the gateway for all your AI apps (open-webui, myorg-assistant, etc.). Token spend + per-model latency is the single best cost/quality lever in the cluster.
10. Gitea (git + CI) — needs enabling
Gitea exposes metrics at /metrics when ENABLE_METRICS=true in app.ini.
Annotate gitea-http Service (port 3000 inside, 80 via svc).
Panels:
- Git push/clone/fetch rate —
gitea_actions_totalbyaction - Active users / repos / orgs —
gitea_users_total,gitea_repos_total - Issues / PRs open —
gitea_issues_total,gitea_pulls_total - HTTP request rate + latency
- Gitea Actions runner job duration — if runner metrics exposed
Why useful: Gitea hosts the cluster's own GitOps repo + CI. Tracking push rate and runner throughput catches CI storms.
11. Home Assistant — needs enabling
HA exposes Prometheus metrics via the prometheus integration (add to
configuration.yaml). Then annotate the Service.
Panels:
- Active entities / sensors by domain
- State change events/sec —
homeassistant_entity_states_total - Automation triggers/sec —
homeassistant_automation_triggered_total - Integrations loaded + errors
- Database size / recorder queue depth
- Zigbee/Z-Wave mesh health (if exposed)
Why useful: HA is a home-critical service. Event/sec spikes often indicate sensor flapping or runaway automations.
12. Jellyfin — limited
Jellyfin doesn't ship first-class Prometheus metrics, but you can scrape it
via a sidecar (jellyfin-prometheus-exporter) or build a blackbox-style
dashboard on the /health endpoint.
Panels:
- Active streams — from exporter
- Transcode sessions + hw accel usage
- Library size by media type
- Playback errors
13. Pi-hole — needs enabling
Pi-hole exposes metrics on its FTL web API; the pihole-exporter sidecar
converts them to Prometheus format. Add as a sidecar container + annotate.
Panels:
- DNS queries/sec (total, blocked, cached, forwarded)
- Block list size
- Top blocked domains
- Top permitted domains
- Clients by query volume
- Cache hit ratio
Why useful: Pi-hole is your network-wide adblock. Block rate + cache ratio are the headline metrics, and query spikes reveal misbehaving clients.
14. PostgreSQL (litellm + phoenix + n8n) — needs enabling
You have two Postgres instances (postgres in litellm and phoenix).
Add prometheus-postgres-exporter as a sidecar or Deployment per DB.
Panels (per DB):
- Connections (active / idle / max) —
pg_stat_activity_count - Transactions/sec —
rate(pg_stat_database_xact_commit[5m]) - Cache hit ratio —
pg_stat_database_blks_hit / (blks_hit + blks_read) - Table + index bloat
- Replication lag (if replicas)
- Slow queries (if
pg_stat_statementsenabled) - DB size growth —
pg_database_size_bytes
Why useful: DB connection exhaustion and cache ratio collapse are the two most common causes of slow app performance.
15. Minecraft — limited
The Minecraft server exposes metrics via RCON + an exporter
(minecraft-exporter). Add as sidecar using the existing RCON_PASSWORD.
Panels:
- Players online —
minecraft_players_online - TPS (ticks per second) —
minecraft_tps(server health) - Entities loaded —
minecraft_entities_total - Chunk count —
minecraft_chunks_loaded - Memory used by JVM
Why useful: TPS < 20 means lag. Player count vs. server load is the only real signal a Minecraft server needs.
16. qBittorrent — limited
No native metrics. Options: a qbittorrent-exporter sidecar (uses the WebUI
API), or a blackbox probe on the WebUI.
Panels:
- Download/upload speed
- Active torrents
- Torrent count by state (downloading/seeding/paused)
- Disk usage in download dir
Cluster meta dashboards
17. Network Topology / Service Map
Composite view: for each namespace, list services, their pods, scrape status, and request volume (from Traefik logs + cAdvisor network). A "what talks to what" overview.
Panels:
- Service → pod → container resource table
- Cross-namespace network flows (if network policy logging enabled)
- Scrape health matrix (every target up/down)
- Ingress route → backend service map
18. Backup / Snapshot Status
If you take Velero snapshots or local-path snapshots, build a dashboard on
velero_* or CRD status. Requires Velero.
Panels:
- Last successful backup per namespace
- Failed backups
- Backup size growth
- Restore test status
19. Cost / Capacity Planning
Composite: per-namespace CPU/memory requests vs. actual usage, projected growth, node saturation forecast.
Panels:
- Requests vs. limits vs. actual (per namespace) — KSM + cAdvisor
- Node capacity vs. allocatable
- PVC growth trend + 30-day forecast
- "What if I removed node X" simulation (capacity headroom)
Why useful: Tells you when you'll need another node before you hit the wall.
Recommended priority order
If you only build a few, do them in this order (highest value-to-effort first):
- Traefik Ingress (#1) — already scraped, your front door
- Storage & PVC Health (#6) — local-path fills disks; high blast radius
- Workload Health (#7) — surfaces CrashLoopBackOff / ImagePullBackOff
- cert-manager (#4) — prevents silent cert expiry outages
- CoreDNS (#2) — early warning for DNS cascades
- LiteLLM (#9) — needs
prometheus.io/scrapeannotation only; big insights - MetalLB (#3) — small but catches LoadBalancer IP exhaustion
Items 8–19 are nice-to-have or require additional exporters/config.