Chapter 22. Where to Go from Here

In this book, we’ve looked at observability for software systems from many angles. We’ve covered what observability is and how that concept operates when adapted for software systems—from its functional requirements, to functional outcomes, to sociotechnical practices that must change to support its adoption.

To review, this is how we defined observability at the start of this book:

Observability for software systems is a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre. You must be able to comparatively debug that bizarre or novel state across all dimensions of system state data, and combinations of dimensions, in an ad hoc iterative investigation, without being required to define or predict those debugging needs in advance. If you can understand any bizarre or novel state without needing to ship new code, you have observability.

Now that we’ve covered the many concepts and practices intertwined with observability in this book, we can tighten that definition a bit:

If you can understand any state of your software system, no matter how novel or bizarre, by arbitrarily slicing and dicing high-cardinality and high-dimensionality telemetry data into any view you need, and use the core analysis loop to comparatively debug and quickly isolate the correct source of issues, without being required to define or predict those debugging needs in advance, then you have observability.

Observability, Then Versus Now

We started writing this book more than three years ago. You might ask why on earth it has taken so long to get here.

First off, the state of observability has been a moving target. When we started writing this book, conversing with anyone about the topic required us to stop and first define “observability.” Nobody really understood what we meant whenever we talked about the cardinality of data or its dimensionality. We would frequently and passionately need to argue that the so-called three pillars view of observability was only about the data types, and that it completely ignores the analysis and practices needed to gain new insights.

As Cindy Sridharan states in the Foreword, the rise in prominence of the term “observability” has also led (inevitably and unfortunately) to it being used interchangeably with an adjacent concept: monitoring. We would frequently need to explain that “observability” is not a synonym for “monitoring,” or “telemetry,” or even “visibility.”

Back then, OpenTelemetry was in its infancy, and that was yet another thing to explain: how it was different from (or inherited from) OpenTracing and OpenCensus? Why would you use a new open standard that required a bit more setup work instead of your vendor’s more mature agent that worked right away? Why should anyone care?

Now, many people we speak to don’t need those explanations. There’s more agreement on how observability is different from monitoring. More people understand the basic concepts and that data misses the point without analysis. They also understand the benefits and the so-called promised land of observability, because they hear about the results from many of their peers. What many of the people we speak to today are looking for is more sophisticated analyses and low-level, specific guidance on how to get from where they are today to a place where they’re successfully practicing observability.

Second, this book initially started with a much shorter list of chapters. It had more basic material and a smaller scope. As we started to better understand which concerns were common and which had successful emergent patterns, we added more depth and detail. As we encountered more and more organizations using observability at massive scale, we were able to learn comparatively and incorporate those lessons by inviting direct participation in this book (we’re looking at you, Slack!).

Third, this book has been a collaborative effort with several reviewers, including those who work for our competitors. We’ve revised our takes, incorporated broader viewpoints, and revisited concepts throughout the authoring process to ensure that we’re reflecting an inclusive state of the art in the world of observability. Although we (the authors of this book) all work for Honeycomb, our goal has always been to write an objective and inclusive book detailing how observability works in practice, regardless of specific tool choices. We thank our reviewers for keeping us honest and helping us develop a stronger narrative.

Based on your feedback, we added more content around the sociotechnical challenges in adopting observability. Like any technological shift that also requires changing associated practices, you can’t just buy a tool and achieve observability. Adopting observability practices means changing the way you think about understanding your software’s behavior and, in turn, changing the relationship you have with your customers. Observability lets you empathize and align with your customers by letting you understand exactly how the changes you make impact their experience, day after day. Illustrating the way that plays out in practice across multiple teams within an organization and sifting out useful advice for beginners has taken time as repeatable patterns have emerged (and as they continue to evolve).

So, where should you go from here? First, we’ll recommend additional resources to fill in essential topics that are outside the scope of this book. Then, we’ll make some predictions about what to expect in the world of observability.

Additional Resources

The following are some resources we recommend:

Site Reliability Engineering by Betsy Beyer et al. (O’Reilly)

We’ve referenced this book a few times within our own. Also known as “the Google SRE book,” this book details how Google implemented DevOps practices within its SRE teams. This book details several concepts and practices that are adjacent to using observability practices when managing production systems. It focuses on practices that make production software systems more scalable, reliable, and efficient. The book introduces SRE practices and details how they are different from conventional industry approaches. It explores both the theory and practice of building and operating large-scale distributed systems. It also covers management practices that can help guide your own SRE adoption initiatives. Many of the techniques described in this book are most valuable when managing distributed systems. If you haven’t started down the path of using SRE principles within your own organization, this book will help you establish practices that will be complemented by the information you’ve learned in our book.

Implementing Service Level Objectives by Alex Hidalgo (O’Reilly)

This book provides an in-depth exploration of SLOs, which our book only briefly touches on (see Chapter 12 and Chapter 13). Hidalgo is a site reliability engineer, an expert at all things related to SLOs, and a friend to Honeycomb. His book outlines many more concepts, philosophies, and definitions relevant to the SLO world to introduce fundamentals you need in order to take further steps. He covers the implementation of SLOs in great detail with mathematical and statistical models, which are helpful to further understand why observability data is so uniquely suited to SLOs (the basis of Chapter 13). His book also covers cultural practices that must shift as a result of adopting SLOs and that further illustrate some of the concepts introduced in our book.

Cloud Native Observability with OpenTelemetry by Alex Boten (Packt Publishing)

This book explores OTel with more depth and detail than we covered in this book. Boten’s book details core components of OTel (APIs, libraries, tools) as well as its base concepts and signal types. If you are interested in using pipelines to manage telemetry data, this book shows you how that’s done using the OpenTelemetry Collector. While we touched on the OpenTelemetry Collector, this book covers it in much greater detail. If you would like to dive deeper into OTel core concepts to discover more of what’s possible, we recommend picking up a copy.

Distributed Tracing in Practice by Austin Parker et al. (O’Reilly)

This book offers an in-depth guide to approaching application instrumentation for tracing, collecting the data that your instrumentation produces, and mining it for operational insights. While specific to tracing, this book covers instrumentation best practices and choosing span characteristics that lead to valuable traces. It is written by our friends at Lightstep, and it presents additional views on where distributed tracing is headed that are both informative and useful.

Honeycomb’s blog

Here, you can find more information from us regarding the latest in the moving target of emergent observability practices. This blog is occasionally specific to Honeycomb’s own observability tools. But, more often, it explores general observability concepts, advice (see “Ask Miss o11y,” our observability advice column), and write-ups from Honeycomb’s engineering team that often illustrate how observability shapes our own evolving practices.

Additionally, the footnotes and notes throughout this book lead to many more interesting sources of relevant information from sources and authors that we respect and look to when shaping our own views and opinions.

Predictions for Where Observability Is Going

It’s a bold move to commit predictions to print in publications and an even bolder move to circle back and see how well they aged. But, given our position in the center of the observability ecosystem, we feel relatively well equipped to make a few informed predictions about where this industry is headed in the coming years. These predictions are being generated in March 2022.

Three years from now, we think that OTel and observability will be successfully intertwined and may seem inseparable. We already see a lot of overlap among groups of people interested in developing and adopting OTel and people who are interested in further developing the category of tooling that fits the definition of observability as outlined in this book. The momentum and rise of OTel as the de facto solution for application instrumentation has been greatly helped by the fact that support for trace data is its most mature format. Metrics, log data, and profiling are in earlier stages of development, but we expect to see those reach the same level of maturity quickly, opening the door for even wider adoption in a variety of settings. We also believe that the ability to trivially switch between different backend solutions with just a few configuration changes will become much easier than it is today (which is already fairly simple).1 We predict that more how-to articles detailing how to switch from one vendor to another will proliferate and become hot commodities.

Most of this book focused on examples of debugging backend applications and infrastructure. However, we predict that observability will creep into more frontend applications as well. Today’s state of the art with understanding and testing browser applications involves either real user monitoring (RUM) or synthetic monitoring.

As the name indicates, RUM involves measuring and recording the experience of real application users from the browser. The focus of RUM is to determine the actual service-level quality delivered to users, to detect application errors or slowdowns, and to determine whether changes to your code have the intended effect on user experience. RUM works by collecting and recording web traffic without impacting code performance. In most cases, JavaScript snippets are injected into the page or native code within the application to provide feedback from the browser or client. To make this large volume of data manageable, RUM tools often use sampling or aggregation for consolidation. That consolidation often means that you can understand overall performance as a whole, but you can’t break it down to the level of understanding performance for any one given user in detail.

However, despite those limitations, RUM does have the advantage of measuring real user experiences. This means RUM tools can catch a broad range of unexpected real-world issues in application behavior. RUM can help you see anything from a regression that presents itself in only a new version of a niche mobile browser your development team has never heard of, to network delays for certain IP addresses in a specific country or region halfway across the globe. RUM can be helpful in identifying and troubleshooting last-mile issues. RUM differs from synthetic monitoring in that it relies on actual people clicking a page in order to take measurements.

Synthetic monitoring is a different approach that relies on automated tests going over a given set of test steps in order to take measurements. These tools take detailed application performance and experience measurements in a controlled environment. Behavioral scripts (or paths) are created to simulate the actions that customers might take with your applications. The performance of those paths are then continuously monitored at specified intervals. Somewhat like unit tests in your code, these paths are typically ones that a developer owns and runs themselves. These paths—or simulations of typical user behavior when using your frontend application—must be developed and maintained, which takes effort and time. Commonly, only heavily used paths or business-critical processes are monitored for performance. Because synthetic tests must be scripted in advance, it’s simply not feasible to measure performance for every permutation of a navigational path that a user might take.

However, while synthetic monitoring tools don’t show you performance for real user experiences, they do have some advantages. They allow proactive testing for a wide array of known conditions you may care about (e.g., specific device types or browser versions). Because they typically create somewhat reproducible results, they can be included in automated regression test suites. That allows them to be run before code is deployed to real users, allowing them to catch performance issues before they can possibly create impacts to real users.

RUM and synthetic monitoring serve specific and different use cases. The use case for observability is to measure real user experiences in production—similar to RUM, but with a much higher degree of fidelity that allows you to debug individual customer experiences. As seen in Chapter 14, many teams use observability data in their CI/CD build pipelines (or in test suites). That means you can run end-to-end test scripts that exercise user paths in your system and monitor their performance by simply tagging originating test requests as such within your telemetry. We predict that within a few years, you won’t have to choose between RUM or synthetic monitoring for frontend applications. Instead, you’ll simply use observability for both use cases.

We also predict that within three years, OTel’s automatic instrumentation will have caught up and be comparable to the non-OTel auto-instrumentation packages offered in a variety of vendor-specific libraries and agents. Today, using OTel is still a choice for most teams because (depending on your language of choice) the automatic instrumentation included with OTel may not be up to par with the instrumentation offered by a specific vendor’s proprietary offerings. The open source nature of OTel, paired with its incredibly vibrant developer ecosystem, means that this will eventually no longer be the case. Automatic instrumentation with OTel will be at least as rich as alternative instrumentation that brings with it the trap of vendor lock-in. In time, using OTel will become a no-brainer and the de facto starting point for any application observability initiative (this is already starting to happen today).

You shouldn’t interpret that to mean that automatic instrumentation will become the only thing needed to generate useful telemetry for an observability tool. Custom instrumentation (see Chapter 7) will continue to be absolutely essential to debug the issues most relevant to the code driving your business logic. We predict that just as having code without comments is unthinkable, so too will having code without custom instrumentation. As engineers, we will all get accustomed to thinking in terms of instrumentation needed as we write new code.

In three years, build pipelines will be immensely faster, feedback loops will be shorter, and more teams will be automatically deploying changes to production at the final stage of their CI/CD pipelines (and they’ll really start practicing the D part of the CI/CD acronym). Continuous deployment can be a tricky affair, but the practice of decoupling feature releases from feature deployments will make it attainable for most organizations. Feature flags will continue to see more adoption, and deployment patterns like progressive delivery will become more common.

The space that we will personally be watching closely is that of developer workflows. As an industry, we need more ways to connect observability to the act of writing and shipping code, as early as possible (see Chapter 11). As an industry, we need to continue to collapse the space between input (writing code) and output (running code). Few developers have the tooling today to get fast feedback on how the behavior of their code changes after each deployment. Developers need to closely understand how changes to their code impact users in production with each new iteration. Incredibly few developers have the ability to actually do that. But, for those developers who do, the difference in experience is transformational.

Anecdotally, we hear stories from these developers about how that experience is so foundationally necessary that they can’t imagine working without that ability ever again. Quantitatively, we’re starting to see tangible benefits materialize: the ability to move faster, waste less time, make fewer errors, and catch those few errors more swiftly when they do occur (see Chapter 19). In short, learning how to use observability helps them become better software engineers. We believe the industry needs fundamentally better developer workflows in production, and we predict observability will be the path for many to get there.

Can observability lower the bar to achieve that type of transformational experience for more developers? Will those dopamine hits of feeling like a freaking wizard every time you solve a previously unsolvable problem in production be enough to keep the surge of observability adoption going? Can we, as an industry, make observability more accessible to every engineering team?

Time will tell. Watch this space. And let us know how you’re progressing. You can always drop us a line on Twitter: @mipsytipsy, @lizthegrey, and @gmiranda23 (respectively).

Charity, Liz, and George

1 Vera Reynolds, for example, provides the tutorial “OpenTelemetry (OTel) Is Key to Avoiding Vendor Lock-in” on sending trace data to Honeycomb and New Relic by using OTel.

If you find an error or have any questions, please email us at admin@erenow.org. Thank you!