Observability¶
You cannot reliably operate systems you cannot observe. This sounds obvious, and it is. The non-obvious part is that observability is not a list of features you turn on. It is a property of the system: how cheaply can the team ask new questions about the system's behavior in production?
The "three pillars" framing (logs, metrics, traces) is a useful starting point. The deeper definition, articulated most clearly by the modern observability community, is that observability is about being able to investigate failure modes the team did not anticipate when the system was built. Logs and metrics that answer pre-defined questions are monitoring. Observability lets you ask questions you did not know you would need to ask.1
The Components¶
A practical observability stack typically includes:
- Logs. Structured records of what the system did, when. Most useful when the structure is machine-readable, the context is rich (request IDs, user IDs, relevant state), and the volume is manageable.
- Metrics. Numerical measurements aggregated over time. Cheap to store, fast to query, well-suited to dashboards and alerting. Limited in resolution: a metric tells you what happened in aggregate, not which request was affected.
- Traces. End-to-end timelines of how a single request flowed through the system. Especially valuable in distributed systems, where the same request crosses many services and produces failures that no single service's logs can explain.
- Dashboards. Visualizations of metrics over time, organized around what the team needs to see at a glance. A dashboard is useful when it is consulted regularly and updated when it stops being useful.
- Alerting. Automated detection of conditions that require human attention. Good alerts are rare, actionable, and tied to user-visible problems. Bad alerts are noisy, vague, and trained out of relevance over time.
- Uptime checks. External probes that verify the system is reachable and behaving correctly from outside. Catch a class of failure that internal monitoring cannot see.
- Error tracking. Aggregated reporting of exceptions and errors, grouped by signature, with the context needed to diagnose. Often where the team learns about defects before the customer does.
The three pillars are necessary but not sufficient. What makes a system genuinely observable is the cardinality of the data the team can query against: how many dimensions can be used to slice the behavior, and how cheaply can new dimensions be added when the team needs them?
What Observability Actually Buys¶
A few specific capabilities:
- Fast root cause. When something is broken, the team can find out what, where, and why without a heroic debugging effort.
- Confidence in change. A change deployed to a well-observed system is a change the team can verify worked. A change deployed to an unobservable system is a change the team has to wait for support tickets to validate.
- Capacity planning grounded in reality. What the system is actually doing, rather than what the team thinks it is doing, is the basis for sensible scaling decisions.
- Postmortems that produce learning. An incident investigation in an observable system can reconstruct what happened with evidence. An incident in an unobservable system produces speculation and competing theories.
- Faster onboarding. A new engineer can read dashboards, query logs, and trace requests to build a model of the system. Without those, the only model they have access to is the one in someone else's head.
Common Anti-Patterns¶
- Alert fatigue. So many alerts that the team learns to ignore them. The cost is paid when a real alert arrives and is missed in the noise.
- Dashboards as decoration. Many dashboards, none of them consulted in actual incidents. The team can tell the difference by asking when each dashboard was last looked at.
- Logs without structure. Free-form log lines that have to be parsed by hand to extract useful information. Investment in structured logging pays for itself the first incident.
- Vanity metrics. Metrics chosen because they go up and to the right, not because they describe anything the team cares about. The fix is to track what the team would actually change behavior based on.
- Observability as someone else's job. Engineers ship code and assume "operations will instrument it." The instrumentation lives in the same codebase as the code that needs it, written by the same people. Anything else is the handoff trap.
- Monitoring instead of observability. Comprehensive dashboards for known problems, no ability to investigate unknown ones. Easy to mistake one for the other until the day you need the second.
What This Looks Like in Practice¶
- Instrument as part of writing the feature. Logs, metrics, and traces are added in the same PR as the code they describe. Adding them later is much harder and much rarer.
- Define service-level objectives (SLOs) before alerts. What is the system supposed to do, in user-visible terms? Alerts should fire when those objectives are at risk, not when arbitrary internal thresholds are crossed.2
- Treat noisy alerts as a quality bug. An alert that pages on-call without being actionable is a defect in the alerting system. Investigate and fix, the same as any other defect.
- Make production data queryable. A team that has to file a ticket to investigate a production issue does not have observability; it has a process for getting at observability when someone approves.
- Review the observability story in design. Asking "how would we know if this was broken?" early in design produces systems that are observable by construction.
Key principle
Mature systems are continuously observed. You cannot operate, debug, scale, or improve a system you cannot see. Observability is the precondition for treating production as a place to learn rather than a place to fear.
See also: Incident Response, Continuous Improvement, Deployment Strategies, Fix Forward, Quality Is Designed In.
-
Charity Majors, Liz Fong-Jones, and George Miranda, Observability Engineering (O'Reilly, 2022). The modern reference text on what observability actually means as distinct from monitoring, with a focus on the role of high-cardinality data and the ability to investigate unknown failure modes. The book's foundational claim, that observability is about asking new questions rather than pre-defining old ones, has become the dominant articulation of the discipline. ↩
-
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (eds.), Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2016). The canonical articulation of service-level objectives, error budgets, and the discipline of alerting on what users see rather than on what the system internally measures. ↩