A Practical Observability Playbook: Golden Signals, RED/USE, and SLO-Driven Alerting

If you want monitoring that actually helps during incidents (instead of becoming dashboard wallpaper), anchor everything to a few proven frameworks:

Golden Signals: latency, traffic, errors, saturation
RED (request-driven services): Rate, Errors, Duration
USE (infrastructure): Utilization, Saturation, Errors

These frameworks keep you focused on what matters, across UI, services, and infrastructure—without inventing a new taxonomy for every team.

Data plumbing: Metrics + Logs + Traces (and why exemplars matter)

A modern “full stack” setup often looks like:

Metrics: Prometheus / OpenTelemetry → Grafana
Logs: Loki
Traces: Tempo
Bonus: Exemplars connect them.

Exemplars are the bridge from “something looks slow” (a metric datapoint) to the exact trace responsible. In Grafana, that means you can click a slow latency spike (say P99) and jump straight to the corresponding trace in Tempo—hugely reducing time to root cause.

Rule of thumb:

Metrics answer: “Is this getting worse? How often?”
Logs answer: “What exactly happened?”
Traces answer: “Where did the time go across services?”

Metrics that won’t lie: histograms + percentiles (not averages)

Prefer histograms over summaries

In Prometheus-world, histograms are generally the better default because they:

aggregate safely across instances/regions,
support SLO math and burn-rate alerting cleanly,
work naturally with exemplars.

Alert on P95/P99 (or SLO compliance), not averages

Averages hide pain. Users don’t experience the mean—they experience the tail.

Practical approach:

track distributions as histograms
visualize P50/P95/P99
alert on P95/P99 or (even better) SLO error budget burn

Choose buckets that reflect user expectations

Buckets should match what “fast enough” means for your product. Example for web-ish APIs:

10–50–100–300–1000 ms (and maybe 2s / 5s if you have heavy endpoints)

Buckets that are too coarse blur the story; buckets that are too fine raise cost and noise. Pick buckets that line up with real thresholds.

Labels you’ll care about (and cardinality you must control)

The most useful low-cardinality labels for slicing incidents:

service
route (use normalized templates like /users/:id, not raw URLs)
method
status_code_class (2xx/4xx/5xx instead of full status if you need to limit series)
env
region
version (critical for deploy correlation)

Cardinality rule: never label on user IDs, request IDs, full query strings, or unbounded values. If you need per-request detail, that’s what traces/logs are for.

UI Observability (Frontend: RUM + Synthetics)

UI monitoring is where “it’s fine in the backend” goes to die. You need both:

RUM (Real User Monitoring): what real users experience
Synthetics: controlled checks for critical paths

What to track

Availability & errors

Page view success rate, SPA route change success rate
JS error rate (uncaught exceptions per 1k sessions)
Resource load failures (images/fonts/third-party scripts)
Frontend → API failure rate (by domain/endpoint)

Performance (user-perceived)

Core Web Vitals

LCP (Largest Contentful Paint)
INP (Interaction to Next Paint; successor to FID)
CLS (Cumulative Layout Shift)

Plus the “why is it slow” helpers:

TTFB, FCP, TTI, total blocking time
long task count/time, SPA route transition time
third-party dependency latency distribution

UX / behavior (optional but powerful)

session count, bounce rate, funnel conversion, checkout success
rage clicks, scroll jank / frame drops (if you collect it)
Service Worker / cache hit ratio

Coverage breakdowns

browser, device class, OS, country/region, app version/build

How to collect

RUM SDK (Grafana Faro or OTel web SDK) → metrics/logs/traces via gateway or Grafana Cloud
Synthetics: scheduled login/checkout flows using k6 or blackbox exporter Store Apdex + step latencies so you can see which step regressed.

SLO ideas & alert patterns

Availability SLO: ≥ 99.9% page views without JS error or HTTP ≥ 500
Performance SLOs: P75 LCP < 2.5s; P75 INP < 200ms; P95 route change < 500ms
Anomaly alerts: error rate > baseline + 3σ for 10 minutes; third-party domain failure > 5%

API / Backend Observability (Services + Dependencies)

Backend monitoring should be RED-driven, plus saturation signals so you can tell “slow because demand” vs “slow because capacity.”

What to track (RED + saturation)

Requests (RED)

request rate (by service, route, method)
error rate split: 5xx (server), 4xx (client), timeouts/cancelled
latency distributions (histograms): overall + per route + per dependency

Saturation & capacity

in-flight requests
worker/thread pool utilization
event loop lag (Node), goroutine spikes (Go), GC pause (JVM/Go), JVM heap/GC
queue depth/lag (Kafka/SQS/RabbitMQ), backlog age, DLQ ingress
DB pools: in-use vs max, wait time; HTTP client pools
rate limiting: allowed vs limited vs dropped

Key dependencies

Database

query latency, lock waits, deadlocks, slow query rate
replication lag, cache/buffer hit ratio
WAL pressure / write saturation (where relevant)

Cache (Redis/Memcached)

hit ratio, command latency, evictions, memory fragmentation

External calls

per-upstream rate/errors/latency
retry counts, circuit breaker state

Correctness / quality (often overlooked, often critical)

idempotency/key collisions
job retry rate, saga/compensation events
schema migration duration/failures during deploys

SLO ideas & alerting that scales

Example SLOs

Latency: 99% of GET /search < 300 ms; 99% of POST /checkout < 800 ms
Availability: 99.95% success (non-5xx) over 30 days

Burn-rate alerts (multi-window)

This is one of the cleanest ways to avoid noisy alerting:

fast burn: 5m / 1h window catches “it’s on fire”
slow burn: 30m / 6h window catches “it’s steadily degrading”

Deploy safety gates (high ROI)

If error rate or P95 latency increases by ~50% within 10 minutes of a new version, flag it:

page the on-call
stop rollout / auto-rollback (if you have it)
annotate the dashboard with the deploy event

Infrastructure Observability (K8s/VMs/Network/Storage/DNS)

Infra monitoring is USE-driven, with special attention to “hidden saturation” (throttling, queues, IO wait).

Kubernetes / Containers

Node (USE)

CPU/memory utilization vs allocatable
pressure signals (memory/cpu/io), throttling
inode & disk usage, kubelet health

Pod/container

restarts, CrashLoopBackOff, OOM kills
CPU throttling, memory working set vs limits
open FDs, probe failures, image pull timeouts

Control plane

pending pods, scheduling latency
etcd quorum health + operation latency

Autoscaling

desired vs current replicas
scale events
pending pods vs quotas

Network & edge

LB/Ingress: rate, 4xx/5xx at edge, upstream connection errors
TLS handshake time, cert days-to-expiry
retransmits/drops, DNS latency/error rate

Storage

PV/disk: IOPS, throughput, read/write latency, queue depth, %util
fs usage + inodes, PVC binding timeouts
managed DB infra signals: CPU, memory, storage, I/O credits, connections, replication lag, failovers

SLO ideas & alerts

Cluster SLO: < 1% of ingress requests return 5xx due to upstream unavailable
Infra saturation: node CPU > 90% for 15m and pods pending due to CPU → page
Storage regression: PV latency P99 > 50ms for 15m → investigate

Closing: How to implement this without boiling the ocean

If you’re starting or cleaning up an existing mess, implement in this order:

Golden Signals dashboards per tier (UI / API / infra)
Histograms + meaningful buckets for latency (plus exemplars if possible)
Low-cardinality labeling with version baked in
SLOs + burn-rate alerts (fewer alerts, more signal)
Drill-down workflow: Alert → dashboard → exemplars → trace → logs

That’s the path from “we have Grafana” to “we can debug production quickly.”

If you want, paste your current metric names (or a sample Prometheus scrape) and I’ll map them into a concrete set of Grafana dashboards + recording rules + SLO/burn-rate alerts using your exact labels.