Skip to Content
BlogBooksobservability engineering

Observability Engineering: Achieving Production Excellence

Charity Majors, Liz Fong-Jones, and George Miranda

  • Observability lets you map and explore the full “state space” of a system in granular detail—especially the weird, long-tail, unpredictable behaviors you didn’t plan for.
  • Monitoring gives broad-brush approximations of overall health (useful, but limited). It’s better for “known-unknowns,” while observability is for “unknown-unknowns.”

1. The path to observability

  • In control theory, observability is how well you can infer a system’s internal state from its external outputs.

  • Practical definition: if you can understand any bizarre or novel production state without shipping new code, you have observability.

  • Modern cloud-native systems (microservices + polyglot persistence) make debugging harder because the challenge is often finding where the problematic code lives, not understanding the code itself.

  • Observability is about preserving context around each request so you can reconstruct what happened and why later.

  • Two core data properties:

    • High cardinality matters most (request IDs, user IDs, cart IDs, build numbers, container/host, spans). You can downsample high-cardinality → low, but not the reverse.
    • Dimensionality (“wide events”): telemetry should be arbitrarily wide structured events with potentially hundreds/thousands of key-value pairs to capture rich context.
  • Core workflow: iterative investigation—ask a question, learn from the data, ask the next question, repeat until you find the needle.

  • Key capability: explorability—the ability to understand previously unseen system states via open-ended exploration.


2. How debugging practices differ between observability and monitoring

  • Metrics are numeric representations of system state over a time interval.

  • Monitoring data is often optimized for machines: triggering alerts, declaring recovery, and driving automated decisions.

  • A cultural smell: if the “best debugger” is always the most senior/longest-tenured engineer, your system knowledge lives in institutional memory, not in accessible tooling.

  • Observability flips that:

    • The best debugger becomes the most curious engineer—someone who can interrogate the system, explore, and follow evidence through open-ended questions.

3. Lessons from scaling without observability

  • Example failure mode: with Ruby + a fixed pool of API workers, if a backend gets slightly slower, request queues fill up fast; if it becomes very slow/unresponsive, pools saturate in seconds → system-wide failure.

  • Architectural principle: don’t add unnecessary complexity. Identify the problems you truly need to solve, keep options open, and usually pick boring technology.

  • Operational priority: put the bulk of attention/tooling on production first; staging has value, but it’s secondary.

  • Traditional “monitoring-era” incident loop:

    • investigate → retrospective → runbook → custom dashboard → “resolved”
    • Works for monoliths where novel failures are rarer.
    • Breaks down in modern systems where novel problems are the norm.
  • With observability + high-cardinality slicing, investigation time drops dramatically: you can start from symptoms and follow a breadcrumb trail to root cause—wherever it leads.


4. How observability relates to DevOps, SRE, and cloud native

  • Like testability, observability is a system property that requires continuous investment, not a one-time add-on.

  • Cloud native (per CNCF) emphasizes loosely coupled systems that are resilient, manageable, and observable, enabling frequent, predictable change with minimal toil.

  • As teams ship independently (Agile + cloud native), observability becomes a prerequisite for safe autonomy.

  • Observability shifts incident approach:

    • Mature DevOps/SRE teams start from user-visible symptoms of pain and then drill down using observability tooling—rather than relying on a giant catalog of alerts that guess causes.
  • Distributed tracing helps replace “keep the whole system graph in your head” with “understand dependencies as they affect this request,” including versioned call paths and compatibility issues.

  • Practices that depend on observability:

    • Chaos engineering / continuous verification (compare behavior vs steady state)
    • Feature flags (too many runtime combinations to test exhaustively; need per-user visibility)
    • Progressive delivery (canary, blue/green—know when to halt and why)
    • Incident analysis + blameless postmortems (reconstruct both system behavior and operator understanding via an evidence trail)

Part 2 — Fundamentals of observability

5. Structured events are the building blocks of observability

  • An event is a record of everything that occurred while one particular request interacted with your service.

  • “Structured logs” as structured events: ideally one event captures the full unit of work:

    • inputs
    • derived/resolved attributes discovered along the way
    • service conditions during execution
    • result details
  • The point is to support extremely specific queries that isolate real-world edge cases—often requiring many high-cardinality dimensions layered together (user, device, language pack, install date, shard, region, etc.).


6. Stitching events into traces

  • A distributed trace is an interrelated series of events describing a single request’s journey through multiple services.
  • Tracing is a classic debugging technique: record execution details to diagnose problems.
  • Distributed tracing matters because modern requests routinely cross process/machine/network boundaries—including hybrid cases like on-prem → cloud → SaaS → back.
  • Desired outcome of tracing (per your notes): clearly see relationships among services so you can pinpoint where failures happen and what contributes to performance issues.

A waterfall visualization shows how a request’s total time is built up from intermediate steps, ending in a final cumulative duration (the full request time).

Each “bar” (chunk) in the waterfall is a trace span (or span). A trace is the collection of these related spans for a single request.

Spans form a hierarchy:

The root span is the top-level span for the request and typically covers the entire end-to-end duration.

Spans nested inside the root span are child operations (and may themselves have children), creating a parent → child tree.

Example relationship: if Service A calls Service B, which calls Service C, then Span A is the parent of Span B, and Span B is the parent of Span C; Span C can be a leaf span with no children.

Waterfall views make two things easy to see at a glance:

Structure (who called whom) via nesting/indentation.

Performance contribution via bar lengths and timing—e.g., a root span of ~86s containing child spans like UpdateIntercom (~20s), UpdateSalesforce (~64s), and UpdateStripe (~1.6s) clearly reveals which work dominates the overall request latency.

7. Instrumentation with OpenTelemetry

In the mid–late 2010s, two major open standards for collecting telemetry emerged:

OpenTracing (under the CNCF) and OpenCensus (sponsored by Google), both aiming to provide language libraries that collect telemetry and send it to a backend of your choice in real time.

Rather than continue as competing standards, the efforts merged in 2019 into OpenTelemetry, a unified project under the CNCF.

Decided to come back later to the book, notes stop early on in chapter 7.