Skip to Content
BlogCodeObservability considerations

A Practical Observability Playbook: Golden Signals, RED/USE, and SLO-Driven Alerting

If you want monitoring that actually helps during incidents (instead of becoming dashboard wallpaper), anchor everything to a few proven frameworks:

  • Golden Signals: latency, traffic, errors, saturation
  • RED (request-driven services): Rate, Errors, Duration
  • USE (infrastructure): Utilization, Saturation, Errors

These frameworks keep you focused on what matters, across UI, services, and infrastructure—without inventing a new taxonomy for every team.


Data plumbing: Metrics + Logs + Traces (and why exemplars matter)

A modern “full stack” setup often looks like:

  • Metrics: Prometheus / OpenTelemetry → Grafana
  • Logs: Loki
  • Traces: Tempo
  • Bonus: Exemplars connect them.

Exemplars are the bridge from “something looks slow” (a metric datapoint) to the exact trace responsible. In Grafana, that means you can click a slow latency spike (say P99) and jump straight to the corresponding trace in Tempo—hugely reducing time to root cause.

Rule of thumb:

  • Metrics answer: “Is this getting worse? How often?”
  • Logs answer: “What exactly happened?”
  • Traces answer: “Where did the time go across services?”

Metrics that won’t lie: histograms + percentiles (not averages)

Prefer histograms over summaries

In Prometheus-world, histograms are generally the better default because they:

  • aggregate safely across instances/regions,
  • support SLO math and burn-rate alerting cleanly,
  • work naturally with exemplars.

Alert on P95/P99 (or SLO compliance), not averages

Averages hide pain. Users don’t experience the mean—they experience the tail.

Practical approach:

  • track distributions as histograms
  • visualize P50/P95/P99
  • alert on P95/P99 or (even better) SLO error budget burn

Choose buckets that reflect user expectations

Buckets should match what “fast enough” means for your product. Example for web-ish APIs:

  • 10–50–100–300–1000 ms (and maybe 2s / 5s if you have heavy endpoints)

Buckets that are too coarse blur the story; buckets that are too fine raise cost and noise. Pick buckets that line up with real thresholds.


Labels you’ll care about (and cardinality you must control)

The most useful low-cardinality labels for slicing incidents:

  • service
  • route (use normalized templates like /users/:id, not raw URLs)
  • method
  • status_code_class (2xx/4xx/5xx instead of full status if you need to limit series)
  • env
  • region
  • version (critical for deploy correlation)

Cardinality rule: never label on user IDs, request IDs, full query strings, or unbounded values. If you need per-request detail, that’s what traces/logs are for.


UI Observability (Frontend: RUM + Synthetics)

UI monitoring is where “it’s fine in the backend” goes to die. You need both:

  • RUM (Real User Monitoring): what real users experience
  • Synthetics: controlled checks for critical paths

What to track

Availability & errors

  • Page view success rate, SPA route change success rate
  • JS error rate (uncaught exceptions per 1k sessions)
  • Resource load failures (images/fonts/third-party scripts)
  • Frontend → API failure rate (by domain/endpoint)

Performance (user-perceived)

Core Web Vitals

  • LCP (Largest Contentful Paint)
  • INP (Interaction to Next Paint; successor to FID)
  • CLS (Cumulative Layout Shift)

Plus the “why is it slow” helpers:

  • TTFB, FCP, TTI, total blocking time
  • long task count/time, SPA route transition time
  • third-party dependency latency distribution

UX / behavior (optional but powerful)

  • session count, bounce rate, funnel conversion, checkout success
  • rage clicks, scroll jank / frame drops (if you collect it)
  • Service Worker / cache hit ratio

Coverage breakdowns

  • browser, device class, OS, country/region, app version/build

How to collect

  • RUM SDK (Grafana Faro or OTel web SDK) → metrics/logs/traces via gateway or Grafana Cloud
  • Synthetics: scheduled login/checkout flows using k6 or blackbox exporter Store Apdex + step latencies so you can see which step regressed.

SLO ideas & alert patterns

  • Availability SLO: ≥ 99.9% page views without JS error or HTTP ≥ 500
  • Performance SLOs: P75 LCP < 2.5s; P75 INP < 200ms; P95 route change < 500ms
  • Anomaly alerts: error rate > baseline + 3σ for 10 minutes; third-party domain failure > 5%

API / Backend Observability (Services + Dependencies)

Backend monitoring should be RED-driven, plus saturation signals so you can tell “slow because demand” vs “slow because capacity.”

What to track (RED + saturation)

Requests (RED)

  • request rate (by service, route, method)
  • error rate split: 5xx (server), 4xx (client), timeouts/cancelled
  • latency distributions (histograms): overall + per route + per dependency

Saturation & capacity

  • in-flight requests
  • worker/thread pool utilization
  • event loop lag (Node), goroutine spikes (Go), GC pause (JVM/Go), JVM heap/GC
  • queue depth/lag (Kafka/SQS/RabbitMQ), backlog age, DLQ ingress
  • DB pools: in-use vs max, wait time; HTTP client pools
  • rate limiting: allowed vs limited vs dropped

Key dependencies

Database

  • query latency, lock waits, deadlocks, slow query rate
  • replication lag, cache/buffer hit ratio
  • WAL pressure / write saturation (where relevant)

Cache (Redis/Memcached)

  • hit ratio, command latency, evictions, memory fragmentation

External calls

  • per-upstream rate/errors/latency
  • retry counts, circuit breaker state

Correctness / quality (often overlooked, often critical)

  • idempotency/key collisions
  • job retry rate, saga/compensation events
  • schema migration duration/failures during deploys

SLO ideas & alerting that scales

Example SLOs

  • Latency: 99% of GET /search < 300 ms; 99% of POST /checkout < 800 ms
  • Availability: 99.95% success (non-5xx) over 30 days

Burn-rate alerts (multi-window)

This is one of the cleanest ways to avoid noisy alerting:

  • fast burn: 5m / 1h window catches “it’s on fire”
  • slow burn: 30m / 6h window catches “it’s steadily degrading”

Deploy safety gates (high ROI)

If error rate or P95 latency increases by ~50% within 10 minutes of a new version, flag it:

  • page the on-call
  • stop rollout / auto-rollback (if you have it)
  • annotate the dashboard with the deploy event

Infrastructure Observability (K8s/VMs/Network/Storage/DNS)

Infra monitoring is USE-driven, with special attention to “hidden saturation” (throttling, queues, IO wait).

Kubernetes / Containers

Node (USE)

  • CPU/memory utilization vs allocatable
  • pressure signals (memory/cpu/io), throttling
  • inode & disk usage, kubelet health

Pod/container

  • restarts, CrashLoopBackOff, OOM kills
  • CPU throttling, memory working set vs limits
  • open FDs, probe failures, image pull timeouts

Control plane

  • pending pods, scheduling latency
  • etcd quorum health + operation latency

Autoscaling

  • desired vs current replicas
  • scale events
  • pending pods vs quotas

Network & edge

  • LB/Ingress: rate, 4xx/5xx at edge, upstream connection errors
  • TLS handshake time, cert days-to-expiry
  • retransmits/drops, DNS latency/error rate

Storage

  • PV/disk: IOPS, throughput, read/write latency, queue depth, %util
  • fs usage + inodes, PVC binding timeouts
  • managed DB infra signals: CPU, memory, storage, I/O credits, connections, replication lag, failovers

SLO ideas & alerts

  • Cluster SLO: < 1% of ingress requests return 5xx due to upstream unavailable
  • Infra saturation: node CPU > 90% for 15m and pods pending due to CPU → page
  • Storage regression: PV latency P99 > 50ms for 15m → investigate

Closing: How to implement this without boiling the ocean

If you’re starting or cleaning up an existing mess, implement in this order:

  1. Golden Signals dashboards per tier (UI / API / infra)
  2. Histograms + meaningful buckets for latency (plus exemplars if possible)
  3. Low-cardinality labeling with version baked in
  4. SLOs + burn-rate alerts (fewer alerts, more signal)
  5. Drill-down workflow: Alert → dashboard → exemplars → trace → logs

That’s the path from “we have Grafana” to “we can debug production quickly.”

If you want, paste your current metric names (or a sample Prometheus scrape) and I’ll map them into a concrete set of Grafana dashboards + recording rules + SLO/burn-rate alerts using your exact labels.