Observability Engineering: Achieving Production Excellence
Charity Majors, Liz Fong-Jones, and George Miranda
- Observability lets you map and explore the full “state space” of a system in granular detail—especially the weird, long-tail, unpredictable behaviors you didn’t plan for.
- Monitoring gives broad-brush approximations of overall health (useful, but limited). It’s better for “known-unknowns,” while observability is for “unknown-unknowns.”
1. The path to observability
-
In control theory, observability is how well you can infer a system’s internal state from its external outputs.
-
Practical definition: if you can understand any bizarre or novel production state without shipping new code, you have observability.
-
Modern cloud-native systems (microservices + polyglot persistence) make debugging harder because the challenge is often finding where the problematic code lives, not understanding the code itself.
-
Observability is about preserving context around each request so you can reconstruct what happened and why later.
-
Two core data properties:
- High cardinality matters most (request IDs, user IDs, cart IDs, build numbers, container/host, spans). You can downsample high-cardinality → low, but not the reverse.
- Dimensionality (“wide events”): telemetry should be arbitrarily wide structured events with potentially hundreds/thousands of key-value pairs to capture rich context.
-
Core workflow: iterative investigation—ask a question, learn from the data, ask the next question, repeat until you find the needle.
-
Key capability: explorability—the ability to understand previously unseen system states via open-ended exploration.
2. How debugging practices differ between observability and monitoring
-
Metrics are numeric representations of system state over a time interval.
-
Monitoring data is often optimized for machines: triggering alerts, declaring recovery, and driving automated decisions.
-
A cultural smell: if the “best debugger” is always the most senior/longest-tenured engineer, your system knowledge lives in institutional memory, not in accessible tooling.
-
Observability flips that:
- The best debugger becomes the most curious engineer—someone who can interrogate the system, explore, and follow evidence through open-ended questions.
3. Lessons from scaling without observability
-
Example failure mode: with Ruby + a fixed pool of API workers, if a backend gets slightly slower, request queues fill up fast; if it becomes very slow/unresponsive, pools saturate in seconds → system-wide failure.
-
Architectural principle: don’t add unnecessary complexity. Identify the problems you truly need to solve, keep options open, and usually pick boring technology.
-
Operational priority: put the bulk of attention/tooling on production first; staging has value, but it’s secondary.
-
Traditional “monitoring-era” incident loop:
- investigate → retrospective → runbook → custom dashboard → “resolved”
- Works for monoliths where novel failures are rarer.
- Breaks down in modern systems where novel problems are the norm.
-
With observability + high-cardinality slicing, investigation time drops dramatically: you can start from symptoms and follow a breadcrumb trail to root cause—wherever it leads.
4. How observability relates to DevOps, SRE, and cloud native
-
Like testability, observability is a system property that requires continuous investment, not a one-time add-on.
-
Cloud native (per CNCF) emphasizes loosely coupled systems that are resilient, manageable, and observable, enabling frequent, predictable change with minimal toil.
-
As teams ship independently (Agile + cloud native), observability becomes a prerequisite for safe autonomy.
-
Observability shifts incident approach:
- Mature DevOps/SRE teams start from user-visible symptoms of pain and then drill down using observability tooling—rather than relying on a giant catalog of alerts that guess causes.
-
Distributed tracing helps replace “keep the whole system graph in your head” with “understand dependencies as they affect this request,” including versioned call paths and compatibility issues.
-
Practices that depend on observability:
- Chaos engineering / continuous verification (compare behavior vs steady state)
- Feature flags (too many runtime combinations to test exhaustively; need per-user visibility)
- Progressive delivery (canary, blue/green—know when to halt and why)
- Incident analysis + blameless postmortems (reconstruct both system behavior and operator understanding via an evidence trail)
Part 2 — Fundamentals of observability
5. Structured events are the building blocks of observability
-
An event is a record of everything that occurred while one particular request interacted with your service.
-
“Structured logs” as structured events: ideally one event captures the full unit of work:
- inputs
- derived/resolved attributes discovered along the way
- service conditions during execution
- result details
-
The point is to support extremely specific queries that isolate real-world edge cases—often requiring many high-cardinality dimensions layered together (user, device, language pack, install date, shard, region, etc.).
6. Stitching events into traces
- A distributed trace is an interrelated series of events describing a single request’s journey through multiple services.
- Tracing is a classic debugging technique: record execution details to diagnose problems.
- Distributed tracing matters because modern requests routinely cross process/machine/network boundaries—including hybrid cases like on-prem → cloud → SaaS → back.
- Desired outcome of tracing (per your notes): clearly see relationships among services so you can pinpoint where failures happen and what contributes to performance issues.
A waterfall visualization shows how a request’s total time is built up from intermediate steps, ending in a final cumulative duration (the full request time).
Each “bar” (chunk) in the waterfall is a trace span (or span). A trace is the collection of these related spans for a single request.
Spans form a hierarchy:
The root span is the top-level span for the request and typically covers the entire end-to-end duration.
Spans nested inside the root span are child operations (and may themselves have children), creating a parent → child tree.
Example relationship: if Service A calls Service B, which calls Service C, then Span A is the parent of Span B, and Span B is the parent of Span C; Span C can be a leaf span with no children.
Waterfall views make two things easy to see at a glance:
Structure (who called whom) via nesting/indentation.
Performance contribution via bar lengths and timing—e.g., a root span of ~86s containing child spans like UpdateIntercom (~20s), UpdateSalesforce (~64s), and UpdateStripe (~1.6s) clearly reveals which work dominates the overall request latency.
7. Instrumentation with OpenTelemetry
In the mid–late 2010s, two major open standards for collecting telemetry emerged:
OpenTracing (under the CNCF) and OpenCensus (sponsored by Google), both aiming to provide language libraries that collect telemetry and send it to a backend of your choice in real time.
Rather than continue as competing standards, the efforts merged in 2019 into OpenTelemetry, a unified project under the CNCF.
Decided to come back later to the book, notes stop early on in chapter 7.