A Practical Observability Playbook: Golden Signals, RED/USE, and SLO-Driven Alerting
If you want monitoring that actually helps during incidents (instead of becoming dashboard wallpaper), anchor everything to a few proven frameworks:
- Golden Signals: latency, traffic, errors, saturation
- RED (request-driven services): Rate, Errors, Duration
- USE (infrastructure): Utilization, Saturation, Errors
These frameworks keep you focused on what matters, across UI, services, and infrastructure—without inventing a new taxonomy for every team.
Data plumbing: Metrics + Logs + Traces (and why exemplars matter)
A modern “full stack” setup often looks like:
- Metrics: Prometheus / OpenTelemetry → Grafana
- Logs: Loki
- Traces: Tempo
- Bonus: Exemplars connect them.
Exemplars are the bridge from “something looks slow” (a metric datapoint) to the exact trace responsible. In Grafana, that means you can click a slow latency spike (say P99) and jump straight to the corresponding trace in Tempo—hugely reducing time to root cause.
Rule of thumb:
- Metrics answer: “Is this getting worse? How often?”
- Logs answer: “What exactly happened?”
- Traces answer: “Where did the time go across services?”
Metrics that won’t lie: histograms + percentiles (not averages)
Prefer histograms over summaries
In Prometheus-world, histograms are generally the better default because they:
- aggregate safely across instances/regions,
- support SLO math and burn-rate alerting cleanly,
- work naturally with exemplars.
Alert on P95/P99 (or SLO compliance), not averages
Averages hide pain. Users don’t experience the mean—they experience the tail.
Practical approach:
- track distributions as histograms
- visualize P50/P95/P99
- alert on P95/P99 or (even better) SLO error budget burn
Choose buckets that reflect user expectations
Buckets should match what “fast enough” means for your product. Example for web-ish APIs:
- 10–50–100–300–1000 ms (and maybe 2s / 5s if you have heavy endpoints)
Buckets that are too coarse blur the story; buckets that are too fine raise cost and noise. Pick buckets that line up with real thresholds.
Labels you’ll care about (and cardinality you must control)
The most useful low-cardinality labels for slicing incidents:
serviceroute(use normalized templates like/users/:id, not raw URLs)methodstatus_code_class(2xx/4xx/5xx instead of full status if you need to limit series)envregionversion(critical for deploy correlation)
Cardinality rule: never label on user IDs, request IDs, full query strings, or unbounded values. If you need per-request detail, that’s what traces/logs are for.
UI Observability (Frontend: RUM + Synthetics)
UI monitoring is where “it’s fine in the backend” goes to die. You need both:
- RUM (Real User Monitoring): what real users experience
- Synthetics: controlled checks for critical paths
What to track
Availability & errors
- Page view success rate, SPA route change success rate
- JS error rate (uncaught exceptions per 1k sessions)
- Resource load failures (images/fonts/third-party scripts)
- Frontend → API failure rate (by domain/endpoint)
Performance (user-perceived)
Core Web Vitals
- LCP (Largest Contentful Paint)
- INP (Interaction to Next Paint; successor to FID)
- CLS (Cumulative Layout Shift)
Plus the “why is it slow” helpers:
- TTFB, FCP, TTI, total blocking time
- long task count/time, SPA route transition time
- third-party dependency latency distribution
UX / behavior (optional but powerful)
- session count, bounce rate, funnel conversion, checkout success
- rage clicks, scroll jank / frame drops (if you collect it)
- Service Worker / cache hit ratio
Coverage breakdowns
- browser, device class, OS, country/region, app version/build
How to collect
- RUM SDK (Grafana Faro or OTel web SDK) → metrics/logs/traces via gateway or Grafana Cloud
- Synthetics: scheduled login/checkout flows using k6 or blackbox exporter Store Apdex + step latencies so you can see which step regressed.
SLO ideas & alert patterns
- Availability SLO: ≥ 99.9% page views without JS error or HTTP ≥ 500
- Performance SLOs: P75 LCP < 2.5s; P75 INP < 200ms; P95 route change < 500ms
- Anomaly alerts: error rate > baseline + 3σ for 10 minutes; third-party domain failure > 5%
API / Backend Observability (Services + Dependencies)
Backend monitoring should be RED-driven, plus saturation signals so you can tell “slow because demand” vs “slow because capacity.”
What to track (RED + saturation)
Requests (RED)
- request rate (by
service,route,method) - error rate split: 5xx (server), 4xx (client), timeouts/cancelled
- latency distributions (histograms): overall + per route + per dependency
Saturation & capacity
- in-flight requests
- worker/thread pool utilization
- event loop lag (Node), goroutine spikes (Go), GC pause (JVM/Go), JVM heap/GC
- queue depth/lag (Kafka/SQS/RabbitMQ), backlog age, DLQ ingress
- DB pools: in-use vs max, wait time; HTTP client pools
- rate limiting: allowed vs limited vs dropped
Key dependencies
Database
- query latency, lock waits, deadlocks, slow query rate
- replication lag, cache/buffer hit ratio
- WAL pressure / write saturation (where relevant)
Cache (Redis/Memcached)
- hit ratio, command latency, evictions, memory fragmentation
External calls
- per-upstream rate/errors/latency
- retry counts, circuit breaker state
Correctness / quality (often overlooked, often critical)
- idempotency/key collisions
- job retry rate, saga/compensation events
- schema migration duration/failures during deploys
SLO ideas & alerting that scales
Example SLOs
- Latency: 99% of
GET /search< 300 ms; 99% ofPOST /checkout< 800 ms - Availability: 99.95% success (non-5xx) over 30 days
Burn-rate alerts (multi-window)
This is one of the cleanest ways to avoid noisy alerting:
- fast burn: 5m / 1h window catches “it’s on fire”
- slow burn: 30m / 6h window catches “it’s steadily degrading”
Deploy safety gates (high ROI)
If error rate or P95 latency increases by ~50% within 10 minutes of a new version, flag it:
- page the on-call
- stop rollout / auto-rollback (if you have it)
- annotate the dashboard with the deploy event
Infrastructure Observability (K8s/VMs/Network/Storage/DNS)
Infra monitoring is USE-driven, with special attention to “hidden saturation” (throttling, queues, IO wait).
Kubernetes / Containers
Node (USE)
- CPU/memory utilization vs allocatable
- pressure signals (memory/cpu/io), throttling
- inode & disk usage, kubelet health
Pod/container
- restarts, CrashLoopBackOff, OOM kills
- CPU throttling, memory working set vs limits
- open FDs, probe failures, image pull timeouts
Control plane
- pending pods, scheduling latency
- etcd quorum health + operation latency
Autoscaling
- desired vs current replicas
- scale events
- pending pods vs quotas
Network & edge
- LB/Ingress: rate, 4xx/5xx at edge, upstream connection errors
- TLS handshake time, cert days-to-expiry
- retransmits/drops, DNS latency/error rate
Storage
- PV/disk: IOPS, throughput, read/write latency, queue depth, %util
- fs usage + inodes, PVC binding timeouts
- managed DB infra signals: CPU, memory, storage, I/O credits, connections, replication lag, failovers
SLO ideas & alerts
- Cluster SLO: < 1% of ingress requests return 5xx due to upstream unavailable
- Infra saturation: node CPU > 90% for 15m and pods pending due to CPU → page
- Storage regression: PV latency P99 > 50ms for 15m → investigate
Closing: How to implement this without boiling the ocean
If you’re starting or cleaning up an existing mess, implement in this order:
- Golden Signals dashboards per tier (UI / API / infra)
- Histograms + meaningful buckets for latency (plus exemplars if possible)
- Low-cardinality labeling with
versionbaked in - SLOs + burn-rate alerts (fewer alerts, more signal)
- Drill-down workflow: Alert → dashboard → exemplars → trace → logs
That’s the path from “we have Grafana” to “we can debug production quickly.”
If you want, paste your current metric names (or a sample Prometheus scrape) and I’ll map them into a concrete set of Grafana dashboards + recording rules + SLO/burn-rate alerts using your exact labels.