Observability Engineering: Achieving Production Excellence

Charity Majors, Liz Fong-Jones, and George Miranda

Observability lets you map and explore the full “state space” of a system in granular detail—especially the weird, long-tail, unpredictable behaviors you didn’t plan for.
Monitoring gives broad-brush approximations of overall health (useful, but limited). It’s better for “known-unknowns,” while observability is for “unknown-unknowns.”

1. The path to observability

In control theory, observability is how well you can infer a system’s internal state from its external outputs.
Practical definition: if you can understand any bizarre or novel production state without shipping new code, you have observability.
Modern cloud-native systems (microservices + polyglot persistence) make debugging harder because the challenge is often finding where the problematic code lives, not understanding the code itself.
Observability is about preserving context around each request so you can reconstruct what happened and why later.
Two core data properties:
- High cardinality matters most (request IDs, user IDs, cart IDs, build numbers, container/host, spans). You can downsample high-cardinality → low, but not the reverse.
- Dimensionality (“wide events”): telemetry should be arbitrarily wide structured events with potentially hundreds/thousands of key-value pairs to capture rich context.
Core workflow: iterative investigation—ask a question, learn from the data, ask the next question, repeat until you find the needle.
Key capability: explorability—the ability to understand previously unseen system states via open-ended exploration.

2. How debugging practices differ between observability and monitoring

Metrics are numeric representations of system state over a time interval.
Monitoring data is often optimized for machines: triggering alerts, declaring recovery, and driving automated decisions.
A cultural smell: if the “best debugger” is always the most senior/longest-tenured engineer, your system knowledge lives in institutional memory, not in accessible tooling.
Observability flips that:
- The best debugger becomes the most curious engineer—someone who can interrogate the system, explore, and follow evidence through open-ended questions.

3. Lessons from scaling without observability

Example failure mode: with Ruby + a fixed pool of API workers, if a backend gets slightly slower, request queues fill up fast; if it becomes very slow/unresponsive, pools saturate in seconds → system-wide failure.
Architectural principle: don’t add unnecessary complexity. Identify the problems you truly need to solve, keep options open, and usually pick boring technology.
Operational priority: put the bulk of attention/tooling on production first; staging has value, but it’s secondary.
Traditional “monitoring-era” incident loop:
- investigate → retrospective → runbook → custom dashboard → “resolved”
- Works for monoliths where novel failures are rarer.
- Breaks down in modern systems where novel problems are the norm.
With observability + high-cardinality slicing, investigation time drops dramatically: you can start from symptoms and follow a breadcrumb trail to root cause—wherever it leads.

4. How observability relates to DevOps, SRE, and cloud native

Like testability, observability is a system property that requires continuous investment, not a one-time add-on.
Cloud native (per CNCF) emphasizes loosely coupled systems that are resilient, manageable, and observable, enabling frequent, predictable change with minimal toil.
As teams ship independently (Agile + cloud native), observability becomes a prerequisite for safe autonomy.
Observability shifts incident approach:
- Mature DevOps/SRE teams start from user-visible symptoms of pain and then drill down using observability tooling—rather than relying on a giant catalog of alerts that guess causes.
Distributed tracing helps replace “keep the whole system graph in your head” with “understand dependencies as they affect this request,” including versioned call paths and compatibility issues.
Practices that depend on observability:
- Chaos engineering / continuous verification (compare behavior vs steady state)
- Feature flags (too many runtime combinations to test exhaustively; need per-user visibility)
- Progressive delivery (canary, blue/green—know when to halt and why)
- Incident analysis + blameless postmortems (reconstruct both system behavior and operator understanding via an evidence trail)

Part 2 — Fundamentals of observability

5. Structured events are the building blocks of observability

An event is a record of everything that occurred while one particular request interacted with your service.
“Structured logs” as structured events: ideally one event captures the full unit of work:
- inputs
- derived/resolved attributes discovered along the way
- service conditions during execution
- result details
The point is to support extremely specific queries that isolate real-world edge cases—often requiring many high-cardinality dimensions layered together (user, device, language pack, install date, shard, region, etc.).

6. Stitching events into traces

A distributed trace is an interrelated series of events describing a single request’s journey through multiple services.
Tracing is a classic debugging technique: record execution details to diagnose problems.
Distributed tracing matters because modern requests routinely cross process/machine/network boundaries—including hybrid cases like on-prem → cloud → SaaS → back.
Desired outcome of tracing (per your notes): clearly see relationships among services so you can pinpoint where failures happen and what contributes to performance issues.

A waterfall visualization shows how a request’s total time is built up from intermediate steps, ending in a final cumulative duration (the full request time).

Each “bar” (chunk) in the waterfall is a trace span (or span). A trace is the collection of these related spans for a single request.

Spans form a hierarchy:

The root span is the top-level span for the request and typically covers the entire end-to-end duration.

Spans nested inside the root span are child operations (and may themselves have children), creating a parent → child tree.

Example relationship: if Service A calls Service B, which calls Service C, then Span A is the parent of Span B, and Span B is the parent of Span C; Span C can be a leaf span with no children.

Waterfall views make two things easy to see at a glance:

Structure (who called whom) via nesting/indentation.

Performance contribution via bar lengths and timing—e.g., a root span of ~86s containing child spans like UpdateIntercom (~20s), UpdateSalesforce (~64s), and UpdateStripe (~1.6s) clearly reveals which work dominates the overall request latency.

7. Instrumentation with OpenTelemetry

In the mid–late 2010s, two major open standards for collecting telemetry emerged:

OpenTracing (under the CNCF) and OpenCensus (sponsored by Google), both aiming to provide language libraries that collect telemetry and send it to a backend of your choice in real time.

Rather than continue as competing standards, the efforts merged in 2019 into OpenTelemetry, a unified project under the CNCF.

Decided to come back later to the book, notes stop early on in chapter 7.