What is Observability?

Observability is the ability to understand a system's internal state from its external outputs: logs, metrics, and traces correlated per request.

What is observability?

Observability is the property of a system that lets you understand its internal state from its external outputs without having to ship new code to investigate. The three external outputs are logs (what events happened), metrics (numeric time series like RPS, latency, CPU), and traces (request-level breakdowns across services). A system is observable when an engineer with no prior context can answer arbitrary questions about a production incident from those three signals alone.

The term comes from control theory: a system is observable if its internal state is fully determinable from a finite history of outputs. In software, the practical bar is lower but the principle holds. If a customer reports a slow checkout and your only response is to add log lines, redeploy, and wait for the error to recur, your system is not observable. If you can pull up the failed trace, see the slow span, drill into the slow query, and confirm root cause inside 15 minutes, it is.

Observability vs monitoring

Monitoring and observability overlap but are not the same:

  • Monitoring asks pre-defined questions: is the homepage up, is CPU under 80%, is error rate under 1%? You set dashboards and alerts in advance against failure modes you predict. Best for known failure modes.
  • Observability asks arbitrary questions of high-cardinality data after the fact: "show me all p99 requests last hour grouped by customer_id and region where downstream call to stripe.com exceeded 800 ms." Best for unknown failure modes that you did not predict.

A monitoring system can tell you something is broken. An observable system lets you find out what. Healthy production needs both: monitoring for the alerting layer, observability for the investigation layer.

The three pillars (and the cardinality trap)

  • Logs: discrete event records. Easy to write, expensive to query at scale, weak at correlation across services unless you propagate trace IDs into every log line.
  • Metrics: numeric time series. Cheap to store, fast to query, but lose individual request context. Cardinality (number of unique tag combinations) is the cost driver: high-cardinality metrics get expensive fast.
  • Traces: per-request span trees across services. The strongest signal for incident investigation. Often sampled (1-10%) because storing every trace is prohibitive.

Modern observability tools (Honeycomb, Datadog, Grafana, Splunk, New Relic, Elastic) try to unify the three around a shared trace ID so a single click moves from a metric spike to the slow trace to the corresponding logs. The cardinality trap: as soon as you tag metrics with user_id or request_id, the time-series database explodes. Use traces and logs for high-cardinality dimensions, metrics for aggregate dashboards only.

What observability covers in practice

  1. Incident investigation: production page on fire, on-call engineer correlates the metric spike to the trace to the slow query to the bad deploy in under 10 minutes.
  2. SLO measurement: error rate and latency percentiles per endpoint, per customer tier, sliced any way you need to defend the SLO target.
  3. Performance regression detection: compare p95 latency for endpoint X before and after a deploy, scoped to a single customer cohort.
  4. Customer support escalation: "this user reports their checkout was slow at 14:32 UTC" mapped to the exact trace and the actual upstream call that delayed it.
  5. Capacity planning: historical metrics correlated to traffic patterns to project when the current cluster will saturate. See capacity testing.

How to get observable

Start by propagating a trace ID through every request: HTTP header (W3C traceparent), into queue messages, into background jobs. Then instrument with an OpenTelemetry SDK (vendor-neutral) emitting traces and metrics to your backend of choice. Then standardize log lines to include trace_id and customer_id so you can pivot from any of the three signals to the others.

Observability complements load testing. Run load testing or soak testing against a fully instrumented staging environment and the same dashboards you use in production become your test results: which endpoint slowed first, which query started timing out, which downstream call became the bottleneck. See also latency for why percentile thinking is the foundation.

If your team needs production-shape load test runs correlated against your existing observability stack, LoadFocus offers load testing services where engineers design the scenarios, run the matrix, and write up the breakdown against your APM and tracing data.

How fast is your website?

Elevate its speed and SEO seamlessly with our Free Speed Test.

Free Website Speed Test

Analyze your website's load speed and improve its performance with our free page speed checker.

×