Grafana Prometheus Observability Monitoring

The 3 Questions Monitoring Answers

Good monitoring is basically a structured way to answer three escalating questions:

Is the service on? Basic availability: is anything responding at all?
Is it functioning as expected? Correctness: are requests succeeding, are core flows working, are error rates acceptable?
Is it functioning well? Performance and experience: is it fast enough, stable enough, and within SLOs?

A useful mental model: monitoring doesn’t “fix” incidents—it gives you telemetry that helps you detect and localize problems so humans (or automation) can resolve them.

Key Incident Metrics: MTTD and MTTR

MTTD (Mean Time to Detection): how long it takes to notice something is wrong.
MTTR (Mean Time to Resolution): how long it takes to restore service.

Monitoring primarily drives MTTD down (fast detection). Strong observability (especially traces + logs + good context) is what typically drives MTTR down (faster root cause and verification).

Levels of Monitoring: UI, Service, and Infrastructure

Monitoring works best when you layer it:

1) UI Layer (User Experience)

You’re validating what users feel: loading, responsiveness, stability.

Common standard: Core Web Vitals

LCP (Largest Contentful Paint): “How long before the user feels the page has loaded?”
FID (First Input Delay): perceived responsiveness (delay before interaction is handled). (Note: modern guidance often emphasizes INP, but the intent is the same—interaction latency.)
CLS (Cumulative Layout Shift): visual stability (unexpected layout movement).

2) Service Layer (APIs / microservices)

Here you monitor request flows and backend behavior.

Best practice: RED method (and often Golden Signals too)

Rate: throughput (requests/sec)
Errors: failed requests
Duration: latency (response time / transaction time)

3) Infrastructure Layer (hosts, nodes, clusters)

Here you monitor capacity and resource pressure.

Best practice: USE method

Utilization: percent busy (CPU, disk, memory)
Saturation: “how queued up is it?” (e.g., run queue length, network queues) — closer to zero is generally healthier
Errors: hardware / IO failures (disk write errors, NIC errors) — zero is the goal

Four Golden Signals

The Four Golden Signals are a classic cross-layer service reliability set:

Latency
Traffic
Errors
Saturation

You can think of it as RED + Saturation (and it works well across service and infrastructure views).

Monitoring vs Observability (and how they fit)

A clean relationship:

Monitoring is a subset of observability.
Monitoring tells you something is wrong (often via thresholds / alerts).
Observability is collecting actionable, high-context telemetry to tell you when, where, and why an issue occurs.

If monitoring is “the smoke alarm,” observability is “the fire investigator + building blueprint.”

MELT: The Four Telemetry Pillars

A widely used framework for observability telemetry is MELT:

Metrics

Aggregated measurements over time (counts, rates, percentiles).
Best for: alerting, dashboards, trend analysis.

Events

Something that happened at a point in time.
Best for: audits, state changes, feature flags, deployments.

Logs

Detailed records of events (often text or structured JSON).
Best for: deep debugging, error details, edge cases.

Traces

The end-to-end path of a request across services.
Best for: distributed debugging, pinpointing bottlenecks, dependency mapping.

Metrics Collection Models: Push vs Scrape

Push

Apps send metrics to a remote endpoint (TCP/UDP/HTTP).

Pros: simpler for short-lived jobs, constrained networks Cons: harder to scale safely, backpressure/reliability can get tricky, more moving parts at the sender

Scrape (Pull)

Services expose a /metrics endpoint, and the collector reads it on an interval.

Example: Prometheus (the canonical scrape system)

Why scraping is often more scalable:

the collector controls polling rate,
failures are isolated to the collector,
you standardize exposure and discovery,
it avoids “everyone pushing everywhere.”

Practical Prometheus components you’ll see:

Exporters: expose metrics for systems that don’t natively (e.g., node exporter)
Pushgateway: supports push-style for short-lived batch jobs, while Prometheus still scrapes the gateway

Prometheus Basics: Time Series + Labels

Prometheus stores time series, uniquely identified by:

a metric name
optional labels (key-value pairs)

Format: <metric_name>{key="value", key="value"}

Labels are extremely powerful for slicing data (service, instance, route, status code), but also dangerous if you create high-cardinality labels (userId, requestId, full URL) that blow up storage and query cost.

PromQL Concepts You Actually Use

Data types (in practice)

Prometheus values are numeric (stored as floats). Strings exist in limited contexts, but most PromQL work is numeric.

Vectors: instant vs range

Instant vector: “one value per series at an evaluation time”
- Example: http_requests_total
Range vector: “a window of samples per series”
- Example: http_requests_total[5m]

Operators

PromQL supports:

arithmetic: + - * / % ^
comparisons: == != < <= > >= PromQL returns 1 for true, 0 for false, and filters series depending on the form.

Important behavior you noted (and it’s key):

vector == 10 filters to series whose current sample equals 10 (and yields 1/0 depending on context).
binary ops between two instant vectors are pairwise: only matching label sets combine.

Set operators

and (intersection)
or (union)
unless (left-side exclusion)

These are lifesavers for “show me series that are missing / different / only present in one place.”

Matchers / selectors

Label matchers:

= exact
!= not exact
=~ regex match
!~ regex not match

Example pattern: http_requests_total{job="api", status=~"5.."}

Aggregations

Aggregations reduce series count by combining values:

Common ones:

sum, min, max, avg, count
topk, bottomk
stddev, stdvar
count_values
group (groups elements)

Grouping modifiers:

by(...) keep only these labels
without(...) drop these labels

Time shifting

offset 10m lets you compare current behavior to the past.

Example use-case: “is error rate worse than 10 minutes ago?”

Useful functions (high-signal)

absent() / absent_over_time() Great for “this metric disappeared” alerts.
clamp_min, clamp_max, clamp Useful for bounding and filtering weirdness.
delta / idelta Use on gauges (not counters).
*_over_time (on range vectors): avg_over_time, sum_over_time, min_over_time, max_over_time, count_over_time
sort, sort_desc
time() and timestamp()

Grafana Note

Grafana config files like .ini use semicolon ; for comments.

(And in practice: Grafana becomes the “lens” over Prometheus—dashboards, alerting, and correlations across sources.)

A Practical “Put It All Together” Monitoring Recipe

If you want one actionable checklist for a service:

Service (RED + Golden Signals)

Rate by route/status
Error rate by route/status
Latency p50/p95/p99 by route
Saturation indicators for dependencies (DB pool, queue depth)

Infrastructure (USE)

CPU utilization + run queue (saturation)
Disk IO latency + disk errors
Network retransmits / queue + errors

UI (Core Web Vitals)

LCP / interaction responsiveness / CLS by page template
correlate spikes with deploys, CDN changes, JS bundle size, API latency

And always tie alerts back to: MTTD down, MTTR down.

If you want, I can convert this into a more “bloggy” format (shorter paragraphs, stronger transitions, a real-world incident walkthrough, and a concluding checklist), but the content above is already clean enough to publish.