Is It Time To Version Observability? (Signs Point To Yes)

Augh! I am so behind on so much writing, I’m even behind on writing shit that I need to reference in order to write other pieces of writing. Like this one. So we’re just gonna do this quick and dirty on the personal blog, and not bother bringing it up to the editorial standards of…anyone else’s sites. 😬

If you’d rather consume these ideas in other ways:

What does observability mean? No one knows

In 2016, we first borrowed the term “observability” from the wikipedia entry for control systems observability, where it is a measure of your ability to understand internal system states just by observing its outputs. We (Honeycomb) then spent a couple of years trying to work out how that definition might apply to software systems. Many twitter threads, podcasts, blog posts and lengthy laundry lists of technical criteria emerged from that work, including a whole ass book.

Metrics, logs, tracing, drama

In 2018, Peter Bourgon wrote a blog post proposing that “observability has three pillars: metrics, logs and traces. Ben Sigelman did a masterful job of unpacking why metrics, logs and traces are just telemetry. However, lots of people latched on to the three pillars language: vendors because they (coincidentally!) had metrics products, logging products, and tracing products to sell, engineers because it described their daily reality.

Since then the industry has been stuck in kind of a weird space, where the language used to describe the problems and solutions has evolved, but the solutions themselves are largely the same ones as five years ago, or ten years ago. They’ve improved, of course — massively improved — but structurally they’re variations on the same old pre-aggregated metrics.

It has gotten harder and harder to speak clearly about different philosophical approaches and technical solutions without wading deep into the weeds, where no one but experts should reasonably have to go.

This is what semantic versioning was made for

Look, I am not here to be the language police. I stopped correcting people on twitter back in 2019. We all do observability! One big happy family. 👍

I AM here to help engineers think clearly and crisply about the problems in front of them. So here we go. Let’s call the metrics, logs and traces crowd — the “three pillars” generation of tooling — that’s “Observability 1.0“. Tools like Honeycomb, which are built based on arbitrarily-wide structured log events, a single source of truth — that’s “Observability 2.0“.

Here is the twitter thread where I first teased out the differences between these generations of tooling (all the way back in December, yes, that’s how long I’ve been meaning to write this 😅).

About two months ago I wrote this thread about how we lost the battle to define ✨observability✨ — to give it a real, specific, falsifiable technical definition, distinct from monitoring or telemetry.

I complained, I argued, I grieved…and now I’m over it. SO over it. 🙄 https://t.co/QH3U0iZZ8G

— Charity Majors (@mipsytipsy) December 22, 2023

This is literally the problem that semantic versioning was designed to solve, by the way. Major version bumps are reserved for backwards-incompatible, breaking changes, and that’s what this is. You cannot simultaneously store your data across both multiple pillars and a single source of truth.

Incompatible. Breaking change. O11y 1.0, meet O11y 2.0.

small technical changes can unlock waves of powerful sociotechnical transformation

There are a LOT of ramifications and consequences that flow from this one small change in how your data gets stored. I don’t have the time or space to go into all of them here, but I will do a quick overview of the most important ones.

The Cloud: A Cautionary Tale

The historical analogue that keeps coming to mind for me is virtualization. VMs are old technology, they’ve been around since the 70s. But it wasn’t until the late 90s that VMware productized it, unlocking wave after wave of change, from cloud computing and SaaS to the very DevOps movement itself.

I believe the shift to observability 2.0 holds a similarly massive potential for change, based on what I see happening today, with teams who have already made the leap. Why?  In a word, precision. O11y 1.0 can only ever give you aggregates and random exemplars. O11y 2.0, on the other hand, can tell you precisely what happened when you flipped a flag, deployed to a canary, or made any other change in production.

Will these waves of sociotechnical transformation ever be realized? Who knows. The changes that get unlocked will depend to some extent on us (Honeycomb), and to an even greater extent on engineers like you. Anyway, I’ll talk about this more some other time. Right now, I just want to establish a baseline for this vocabulary.

1.0 vs 2.0: How does the data get stored?

1.0💙 O11y 1.0 has many sources of truth, in many different formats. Typically, you end up storing your data across metrics, logs, traces, APM, RUM, profiling, and possibly other tools as well. Some folks even find themselves falling back to B.I. (business intelligence) tools like Tableau in a pinch to understand what’s happening on their systems.

Each of these tools are siloed, with no connective tissue, or only a few, predefined connective links that connect e.g. a specific metric to a specific log line. Aggregation is done at write time, so you have to decide up front which data points to collect and which questions you want to be able to ask. You may find yourself eyeballing graph shapes and assuming they must be the same data, or copy-pasting IDs around from logging to tracing tools and back.

Wide Events are Kenough

2.0 💚 Data gets stored in arbitrarily-wide structured log events (often calledcanonical logs“), often with trace and span IDs appended. You can visualize the events over time as a

trace, or slice and dice your data to zoom in to individual events, or zoom out to a birds-eye view. You can interact with your data by group by, break down, etc.

You aggregate at read time, and preserve raw events for ad hoc querying. Hopefully, you derive your SLO data from the same data you query! Think of it as B.I. for systems/app/business data, all in one place. You can derive metrics, or logs, or traces, but it’s all the same data.

1.0 vs 2.0: on metrics vs logs

1.0 💙 The workhorse of o11y 1.0 is metrics. RUM tools are built on metrics to understand browser user sessions. APM tools are built using metrics to understand application performance. Long ago, the decision was made to use metrics as the source of truth  for telemetry because they are cheap and fast, and hardware used to be incredibly expensive.

The more complex our systems get, the worse of a tradeoff this becomes. Metrics are a terrible building block for understanding rich data, because you have to discard all thatTo live is to suffer, to survive is to impose a post-hoc narrative valuable context at write time, and they don’t support high (or even medium!) cardinality data. All you can do to enrich the data is via tags.

Metrics are a great tool for cheaply summarizing vast quantities of data. They are not equipped to help you introspect or understand complex systems. You will go broke and go mad if you try.

2.0 💚 The building block of o11y 2.0 is wide, structured log events. Logs are infinitely more powerful, useful and cost-effective than metrics are because they preserve context and relationships between data, and data is made valuable by context. Logs also allow you to capture high cardinality data and data relationships/structures, which give you the ability to compute outliers and identify related events.

1.0 vs 2.0: Who uses it, and how?

1.0 💙 Observability 1.0 is predominantly about how you operate your code. It centers around errors, incidents, crashes, bugs, user reports and problems. MTTR, MTTD, and reliability are top concerns.

O11y 1.0 is typically consumed using static dashboards — lots and lots of static dashboards. “Single pane of glass” is often mentioned as a holy grail. It’s easy to find something once you know what you’re looking for, but you need to know to look for it before you can find it.

2.0 💚 If o11y 1.0 is about how you operate your code, o11y 2.0 is about how you develop your code. O11y 2.0 is what underpins the entire software development lifecycle, enabling engineers to connect feedback loops end to end so they get fast feedback on the changes they make, while it’s still fresh in their heads. This is the foundation of your team’sThis Span Could Have Been An Attribute ability to move swiftly, with confidence. It isn’t just about understanding bugs and outages, it’s about proactively understanding your software and how your users are experiencing it.

Thus, o11y 2.0 has a much more exploratory, open-ended interface. Any dashboards should be dynamic, allowing you to drill down into a question or follow a trail of breadcrumbs as part of the debugging/understanding process. The canonical question of o11y 2.0 is “here’s a thing I care about … why do I care about it? What are all of the ways it is different from all the other things I don’t care for?”

When it comes to understanding your software, it’s often harder to identify the question than the answer. Once you know what the question is, you probably know the answer too. With o11y 1.0, it’s very easy to find something once you know what you’re looking for. With o11y 2.0, that constraint is removed.

1.0 vs 2.0: How do you interact with production?

1.0 💙 You deploy your code and wait to get paged. 🤞 Your job is done as a developer when you commit your code and tests pass.

2.0 💚 You practice observability-driven development: as you write your code, you instrument it. You deploy to production, then inspect your code through the lens of the instrumentation you just wrote. Is it behaving the way you expected it to? Does anything else look … weird?

Your job as a developer isn’t done until you know it’s working in production. Deploying to production is the beginning of gaining confidence in your code, not the denouement.

1.0 vs 2.0: How do you debug?

1.0 💙 You flip from dashboard to dashboard, pattern-matching and looking for similar shapes with your eyeballs.

You lean heavily on intuition, educated guesses, past experience, and a mental model of theObservability: high cardinality, high dimensionality, explorability system. This means that the best debuggers are ALWAYS the engineers who have been there the longest and seen the most.

Your debugging sessions are search-first: you start by searching for something you know should exist.

2.0 💚 You check your instrumentation, or you watch your SLOs. If something looks off, you see what all the mysterious events have in common, or you start forming hypotheses, asking a question, considering the result, and forming another one based on the answer. You interrogate your systems, following the trail of breadcrumbs to the answer, every time.

You don’t have to guess or rely on elaborate, inevitably out-of-date mental models. The data is right there in front of your eyes. The best debuggers are the people who are the most curious.

Your debugging questions are analysis-first: you start with your user’s experience.

1.0 vs 2.0: The cost model

1.0 💙 You pay to store your data again and again and again and again, multiplied by all the different formats and tool types you are paying to store it in. Cost goes up at a multiplier of your traffic increase. I wrote a whole piece earlier this year on the cost crisis in observability tooling, so I won’t go into it in depth here.

As your costs increase, the value you get out of your tools actually decreases.

If you are using metrics-based products, your costs go up based on cardinality. “Custom metrics” is a euphemism for “cardinality”; “100 free custom metrics” actually means “100 free cardinality”, aka unique values.

2.0 💚 You pay to store your data once. As your costs go up, the value you get out goes up too. You have powerful, surgical options for controlling costs via head-based or tail-based dynamic sampling.Every Pillar Has Its Price

You can have infinite cardinality. You are encouraged to pack hundreds or thousands of dimensions in per event, and any or all of those dimensions can be any data type you want. This luxurious approach to cardinality and data is one of the least well understood aspects of the switch from o11y 1.0 to 2.0.

Many observability engineering teams have spent their entire careers massaging cardinality to control costs. What if you just .. didn’t have to do that? What would you do with your lives? If you could just store and query on all the crazy strings you want, forever? 🌈

Metrics are a bridge to our past

Why are observability 1.0 tools so unbelievably, eyebleedingly expensive? As anyone who works with data can tell you, this is always what happens when you use the wrong tool for the job. Once again, metrics are a great tool for summarizing vast quantities of data. When it comes to understanding complex systems, they flail.

I wrote a whole whitepaper earlier this year that did a deep dive into exactly why tools built on top of metrics are so unavoidably costly. If you want the gnarly detail, download that.

The TLDR is this: tools built on metrics — whether RUM, APM, dashboards, etc — are a bridge to our past. If there’s one thing I’m certain of, it’s that tools built on top of wide, structured logs are the bridge to our future.

Wide, structured log events are the bridge to our future

Five years from now, I predict that the center of gravity will have swung dramatically; all modern engineering teams will be powering their telemetry off of tools backed by wide, structured log events, not metrics. It’s getting harder and harder and harder to try and wring relevant insights out of metrics-based observability tools. The end of the ZIRP era is bringing unprecedented cost pressure to bear, and it’s simply a matter of time.

The future belongs to tools built on wide, structured log events — a single source of truth that you can trace over time, or zoom in, zoom out, derive SLOs from, etc.Unstructured Logs Go Here (Trash)

It’s the only way to understand our systems in all their skyrocketing complexity. This constant dance with cost vs cardinality consumes entire teams worth of engineers and adds zero value. It adds negative value.

And here’s the weirdest part. The main thing holding most teams back psychologically from embracing o11y 2.0 seems to be the entrenched difficulties they have grappling with o11y 1.0, and their sense that they can’t adopt 2.0 until they get a handle on 1.0. Which gets things exactly backwards.

Because observability 2.0 is so much easier, simpler, and more cost effective than 1.0.

observability 1.0 *is* the hard way

It’s so fucking hard. We’ve been doing it so long that we are blind to just how HARD it is. But trying to teach teams of engineers to wrangle metrics, to squeeze the questions they want to ask into multiple abstract formats scattered across many different tools, with no visibility into what they’re doing until it comes out eventually in form of a giant bill… it’s fucking hard.

Observability 2.0 is so much simpler. You want data, you just toss it in. Format? don’t care. Cardinality? don’t care.

You want to ask the question, you just ask it. Format? don’t care.

Teams are beating themselves up trying to master an archaic, unmasterable set of technical tradeoffs based on data types from the 80s. It’s an unwinnable war. We can’t understand today’s complex systems without context-rich, explorable data.

We need more options for observability 2.0 tooling

My hope is that by sketching out these technical differences between o11y 1.0 and 2.0, we can begin to collect and build up a vendor-neutral library of o11y 2.0 options for folks. The world needs more options for understanding complex systems besides just Honeycomb and Baselime.

The world desperately needs an open source analogue to Honeycomb — something built for wide structured events, stored in a columnar store (or even just Clickhouse), with an interactive interface. Even just a written piece on how you solved it at your company would help move the industry forward.Reinstrument my code with opentelemetry ... or go fuck myself??

My other hope is that people will stop building new observability startups built on metrics. Y’all, Datadog and Prometheus are the last, best metrics-backed tools that will ever be built. You can’t catch up to them or beat them at that; no one can. Do something different. Build for the next generation of software problems, not the last generation.

If anyone knows of anything along these lines, please send me links? I will happily collect them and signal boost. Honeycomb is a great, lifechanging tool (and we have a generous free tier, hint hint) but one option does not a movement make.

<3 charity

P.S. Here’s a great piece written by Ivan Burmistrov on his experience using observability 2.0 type tooling at Facebook — namely Scuba, which was the inspiration for Honeycomb. It’s a terrific piece and you should read it.

P.P.S. And if you’re curious, here’s the long twitter thread I wrote in October of 2023 on how we lost the battle to define observability:

So, we lost the battle to define observability. You know it, I know it. Observability was supposed to *mean* something, and in the early days, it did.

“Observability” once meant the kind of exploratory, open ended investigation our systems increasingly demand.

— Charity Majors (@mipsytipsy) October 30, 2023