The 3 Questions Monitoring Answers
Good monitoring is basically a structured way to answer three escalating questions:
-
Is the service on? Basic availability: is anything responding at all?
-
Is it functioning as expected? Correctness: are requests succeeding, are core flows working, are error rates acceptable?
-
Is it functioning well? Performance and experience: is it fast enough, stable enough, and within SLOs?
A useful mental model: monitoring doesn’t “fix” incidents—it gives you telemetry that helps you detect and localize problems so humans (or automation) can resolve them.
Key Incident Metrics: MTTD and MTTR
- MTTD (Mean Time to Detection): how long it takes to notice something is wrong.
- MTTR (Mean Time to Resolution): how long it takes to restore service.
Monitoring primarily drives MTTD down (fast detection). Strong observability (especially traces + logs + good context) is what typically drives MTTR down (faster root cause and verification).
Levels of Monitoring: UI, Service, and Infrastructure
Monitoring works best when you layer it:
1) UI Layer (User Experience)
You’re validating what users feel: loading, responsiveness, stability.
Common standard: Core Web Vitals
- LCP (Largest Contentful Paint): “How long before the user feels the page has loaded?”
- FID (First Input Delay): perceived responsiveness (delay before interaction is handled). (Note: modern guidance often emphasizes INP, but the intent is the same—interaction latency.)
- CLS (Cumulative Layout Shift): visual stability (unexpected layout movement).
2) Service Layer (APIs / microservices)
Here you monitor request flows and backend behavior.
Best practice: RED method (and often Golden Signals too)
- Rate: throughput (requests/sec)
- Errors: failed requests
- Duration: latency (response time / transaction time)
3) Infrastructure Layer (hosts, nodes, clusters)
Here you monitor capacity and resource pressure.
Best practice: USE method
- Utilization: percent busy (CPU, disk, memory)
- Saturation: “how queued up is it?” (e.g., run queue length, network queues) — closer to zero is generally healthier
- Errors: hardware / IO failures (disk write errors, NIC errors) — zero is the goal
Four Golden Signals
The Four Golden Signals are a classic cross-layer service reliability set:
- Latency
- Traffic
- Errors
- Saturation
You can think of it as RED + Saturation (and it works well across service and infrastructure views).
Monitoring vs Observability (and how they fit)
A clean relationship:
- Monitoring is a subset of observability.
- Monitoring tells you something is wrong (often via thresholds / alerts).
- Observability is collecting actionable, high-context telemetry to tell you when, where, and why an issue occurs.
If monitoring is “the smoke alarm,” observability is “the fire investigator + building blueprint.”
MELT: The Four Telemetry Pillars
A widely used framework for observability telemetry is MELT:
Metrics
- Aggregated measurements over time (counts, rates, percentiles).
- Best for: alerting, dashboards, trend analysis.
Events
- Something that happened at a point in time.
- Best for: audits, state changes, feature flags, deployments.
Logs
- Detailed records of events (often text or structured JSON).
- Best for: deep debugging, error details, edge cases.
Traces
- The end-to-end path of a request across services.
- Best for: distributed debugging, pinpointing bottlenecks, dependency mapping.
Metrics Collection Models: Push vs Scrape
Push
Apps send metrics to a remote endpoint (TCP/UDP/HTTP).
Pros: simpler for short-lived jobs, constrained networks Cons: harder to scale safely, backpressure/reliability can get tricky, more moving parts at the sender
Scrape (Pull)
Services expose a /metrics endpoint, and the collector reads it on an interval.
Example: Prometheus (the canonical scrape system)
Why scraping is often more scalable:
- the collector controls polling rate,
- failures are isolated to the collector,
- you standardize exposure and discovery,
- it avoids “everyone pushing everywhere.”
Practical Prometheus components you’ll see:
- Exporters: expose metrics for systems that don’t natively (e.g., node exporter)
- Pushgateway: supports push-style for short-lived batch jobs, while Prometheus still scrapes the gateway
Prometheus Basics: Time Series + Labels
Prometheus stores time series, uniquely identified by:
- a metric name
- optional labels (key-value pairs)
Format:
<metric_name>{key="value", key="value"}
Labels are extremely powerful for slicing data (service, instance, route, status code), but also dangerous if you create high-cardinality labels (userId, requestId, full URL) that blow up storage and query cost.
PromQL Concepts You Actually Use
Data types (in practice)
Prometheus values are numeric (stored as floats). Strings exist in limited contexts, but most PromQL work is numeric.
Vectors: instant vs range
-
Instant vector: “one value per series at an evaluation time”
- Example:
http_requests_total
- Example:
-
Range vector: “a window of samples per series”
- Example:
http_requests_total[5m]
- Example:
Operators
PromQL supports:
- arithmetic:
+ - * / % ^ - comparisons:
== != < <= > >=PromQL returns 1 for true, 0 for false, and filters series depending on the form.
Important behavior you noted (and it’s key):
vector == 10filters to series whose current sample equals 10 (and yields 1/0 depending on context).- binary ops between two instant vectors are pairwise: only matching label sets combine.
Set operators
and(intersection)or(union)unless(left-side exclusion)
These are lifesavers for “show me series that are missing / different / only present in one place.”
Matchers / selectors
Label matchers:
=exact!=not exact=~regex match!~regex not match
Example pattern:
http_requests_total{job="api", status=~"5.."}
Aggregations
Aggregations reduce series count by combining values:
Common ones:
sum,min,max,avg,counttopk,bottomkstddev,stdvarcount_valuesgroup(groups elements)
Grouping modifiers:
by(...)keep only these labelswithout(...)drop these labels
Time shifting
offset 10m lets you compare current behavior to the past.
Example use-case: “is error rate worse than 10 minutes ago?”
Useful functions (high-signal)
absent()/absent_over_time()Great for “this metric disappeared” alerts.clamp_min,clamp_max,clampUseful for bounding and filtering weirdness.delta/ideltaUse on gauges (not counters).*_over_time(on range vectors):avg_over_time,sum_over_time,min_over_time,max_over_time,count_over_timesort,sort_desctime()andtimestamp()
Grafana Note
- Grafana config files like
.iniuse semicolon;for comments.
(And in practice: Grafana becomes the “lens” over Prometheus—dashboards, alerting, and correlations across sources.)
A Practical “Put It All Together” Monitoring Recipe
If you want one actionable checklist for a service:
Service (RED + Golden Signals)
- Rate by route/status
- Error rate by route/status
- Latency p50/p95/p99 by route
- Saturation indicators for dependencies (DB pool, queue depth)
Infrastructure (USE)
- CPU utilization + run queue (saturation)
- Disk IO latency + disk errors
- Network retransmits / queue + errors
UI (Core Web Vitals)
- LCP / interaction responsiveness / CLS by page template
- correlate spikes with deploys, CDN changes, JS bundle size, API latency
And always tie alerts back to: MTTD down, MTTR down.
If you want, I can convert this into a more “bloggy” format (shorter paragraphs, stronger transitions, a real-world incident walkthrough, and a concluding checklist), but the content above is already clean enough to publish.