Part V. Spreading Observability Culture

In Part IV, we focused on addressing challenges when practicing observability at scale. In this final section, we explore the cultural mechanisms you can use to help drive observability adoption efforts across your organization.

Observability often starts within one particular team or business unit in an organization. To spread a culture of observability, teams need support from various stakeholders across the business. This section breaks down how that support comes together to help proliferate an observability practice.

Chapter 19 presents a look at material outcomes, organizational challenges, and how to make a business case by looking at the business benefits behind adopting observability practices.

Chapter 20 looks at how teams beyond those in engineering can further their own goals with the use of observability tools. Helping adjacent teams learn how to understand and use observability data to achieve their goals will help create allies that can push your observability adoption initiatives forward.

Chapter 21 looks at industry-leading data that can help inform and measure your progress in achieving the benefits of observability laid out in earlier chapters. This model can be useful as a rough guide rather than as a rigid prescription.

Finally, in Chapter 22, we will illuminate some of the future path forward that we hope you will join us in building.

Chapter 19. The Business Case for Observability

Observability often starts within one particular team or business unit in an organization. To spread a culture of observability, teams need support from various stakeholders across the business.

In this chapter, we’ll start breaking down how that support comes together by laying out the business case for observability. Some organizations adopt observability practices in response to overcoming dire challenges that cannot be addressed by traditional approaches. Others may need a more proactive approach to changing traditional practices. Regardless of where in your observability journey you may be, this chapter will show you how to make a business case for observability within your own company.

We start by looking at both the reactive and proactive approaches to instituting change. We’ll examine nonemergency situations to identify a set of circumstances that can point to a critical need to adopt observability outside the context of catastrophic service outages. Then we’ll cover the steps needed to support creation of an observability practice, evaluate various tools, and know when your organization has achieved a state of observability that is “good enough” to shift your focus to other initiatives.

The Reactive Approach to Introducing Change

Change is hard. Many organizations tend to follow the path of least resistance. Why fix the things that aren’t broken (or perceived to be)? Historically, production systems have operated just fine for decades without observability. Why rock the boat now?

Simpler systems could be reasoned about by engineers intimately familiar with the finer points of their architectures. As seen in Chapter 3, it isn’t until traditional approaches suddenly and drastically fall short that some organizations realize their now-critical need for observability. But introducing fundamental change into an organization in reactive knee-jerk ways can have unintended consequences. The rush to fix mission-critical business problems often leads to oversimplified approaches that rarely lead to useful outcomes.

Consider the case of reactive change introduced as the result of critical service outages. For example, an organization might perform a root-cause analysis to determine why an outage occurred, and the analysis might point to a singular reason. In mission-critical situations, executives are often tempted to use that reason to drive simplified remediations that demonstrate the problem has been swiftly dealt with. When the smoking gun for an outage can be pointed to as the line in the root-cause analysis that says, “We didn’t have backups,” that can be used to justify demoting the employee who deleted the important file, engaging consultants to introduce a new backup strategy, and the executives breathing a sigh of relief once they believe the appropriate gap has been closed.

While that approach might seem to offer a sense of security, it’s ultimately false. Why was that one file able to create a cascading system failure? Why was a file that critical so easily deleted? Could the situation have been better mitigated with more immutable infrastructure? Any number of approaches in this hypothetical scenario might better treat the underlying causes rather than the most obvious symptoms. In a rush to fix problems quickly, often the oversimplified approach is the most tempting to take.

Another reactive approach in organizations originates from the inability to recognize dysfunction that no longer has to be tolerated. The most common obsolete dysfunction tolerated with traditional tooling is an undue burden on software engineering and operations teams that prevents them from focusing on delivering innovative work.

As seen in Chapter 3, teams without observability frequently waste time chasing down incidents with identical symptoms (and underlying causes). Issues often repeatedly trigger fire drills, and those drills cause stress for engineering teams and the business. Engineering teams experience alert fatigue that leads to burnout and, eventually, churn—costing the business lost expertise among staff and the time it takes to rebuild that expertise. Customers experiencing issues will abandon their transactions—costing the business revenue and customer loyalty. Being stuck in this constant firefighting and high-stress mode creates a downward spiral that undermines engineering team confidence when making changes to production, which in turn creates more fragile systems, which in turn require more time to maintain, which in turn slows the delivery of new features that provide business value.

Unfortunately, many business leaders often accept these hurdles as the normal state of operations. They introduce processes that they believe help mitigate these problems, such as change advisory boards or rules prohibiting their team from deploying code changes on a Friday. They expect on-call rotations to burn out engineers from time to time, so they allow on-call exemptions for their rockstar engineers. Many toxic cultural practices in engineering teams can be traced back to situations that start with a fundamental lack of understanding of their production systems.

Signs that your business may be hitting a breaking point without observability in its systems include—but are not limited to—some of the following scenarios:

· Customers discover and report critical bugs in production services long before they are detected and addressed internally.

· When minor incidents occur, detecting and recovering them often takes so long that they escalate into prolonged service outages.

· The backlog of investigation necessary to troubleshoot incidents and bugs continues to grow because new problems pile up faster than they can be retrospected or triaged.

· The amount of time spent on break/fix operational work exceeds the amount of time your teams spend on delivering new features.

· Customer satisfaction with your services is low because of repeated poor performance that your support teams cannot verify, replicate, or resolve.

· New features are delayed by weeks or months because engineering teams are dealing with disproportionately large amounts of unexpected work necessary to figure out how various services are all interacting with one another.

Other factors contributing to these scenarios may require additional mitigation approaches. However, teams experiencing a multitude of these symptoms more than likely need to address a systemic lack of observability in their systems. Teams operating in these ways display a fundamental lack of understanding their production systems’ behavior such that it negatively impacts their ability to deliver against business goals.

The Return on Investment of Observability

At its core, observability is about enabling teams to answer previously unknowable questions, or to address unknown unknowns, as we commonly phrase it. The ability to debug application issues, in a data-driven and repeatable manner with the core analysis loop (see Chapter 8) allows teams to effectively manage systems that commonly fail in unpredictable ways. Given the ubiquity of complex distributed systems as today’s de facto application architecture (heterogeneous environments comprising any mix of cloud infrastructure, on-premises systems, containers and orchestration platforms, serverless functions, various SaaS components, etc.), the ability to effectively debug unknown-unknowns can make or break your company’s mission-critical digital services.

As observability tool vendors, we have learned through anecdotal feedback and industry research that companies adopting observability practices gain highly tangible business benefits. We engaged Forrester Research to quantify these benefits among our own customer base.1 While the measures in that study are specific to our own solution, we do believe that some of the traits can be universally expected regardless of the tool (presuming it has the same observability capabilities we’ve described in this book).

We believe observability universally impacts the bottom line in four important ways:

Higher incremental revenue

Observability tools help teams improve uptime and performance, leading to increased incremental revenue directly as a result of improving code quality.

Cost savings from faster incident response

Observability significantly reduces labor costs via faster mean time to detect (MTTD) and mean time to resolve (MTTR), improved query response times, the ability to find bottlenecks quicker, reduction of time spent on call, and time saved by avoiding rollbacks.

Cost savings from avoided incidents

Observability tools enable developers to find causes of problems before they become critical and long-lasting, which helps prevent incidents.

Cost savings from decreased employee churn

Implementing observability results in improved job satisfaction and decrease in developer burnout, alert and on-call fatigue, and turnover.

Other quantifiable benefits may exist, depending on how tools are implemented. But the preceding benefits should be universal for businesses using tools that meet the functional requirements for observability (see Chapters 1 and 8)—and adopting the practices described in this book.

The Proactive Approach to Introducing Change

A proactive approach to introducing change is to recognize the symptoms in the reactive situations outlined earlier as abnormal and preventable. An early way to gain traction and make a business case for observability is to highlight the impact that can be made in reducing common metrics, like the time-to-detect (TTD) and time-to-resolve (TTR) issues within your services. While these measures are far from perfect, they are commonly in use in many organizations and often well understood by executive stakeholders.

NOTE

Adaptive Capacity Labs has a great take on moving past shallow incident data, in a blog post written by John Allspaw, and observability can also demonstrate wins in more nuanced ways. For the purposes of this chapter, we focus on the more flawed but more widely understood metrics of TTD and TTR.

An initial business case for introducing observability into your systems can be twofold. First, it provides your teams a way to find individual user issues that are typically hidden when using traditional monitoring tools, thereby lowering TTD (see Chapter 5). Second, automating the core analysis loop can dramatically reduce the time necessary to isolate the correct source of issues, thereby lowering TTR (see Chapter 8).

Once early gains in these areas are proven, it is easier to garner support for introducing more observability throughout your application stack and organization. Frequently, we see teams initially approach the world of observability from a reactive state—typically, seeking a better way to detect and resolve issues. Observability can immediately help in these cases. But second-order benefits should also be measured and presented when making a business case.

The upstream impact of detecting and resolving issues faster is that it reduces the amount of unexpected break/fix operational work for your teams. A qualitative improvement is often felt here by reducing the burden of triaging issues, which lowers on-call stress. This same ability to detect and resolve issues also leads to reducing the backlog of application issues, spending less time resolving bugs, and spending more time creating and delivering new features. Measuring this qualitative improvement—even just anecdotally—can help you build a business case that observability leads to happier and healthier engineering teams, which in turn creates greater employee retention and satisfaction.

A third-order benefit comes from the ability to understand the performance of individual user requests and the cause of bottlenecks: teams can quickly understand how best to optimize their services. More than half of mobile users will abandon transactions after three seconds of load time.2 Measuring the rate of successful user transactions and correlating it with gains in service performance is both possible to measure and likely to occur in an observable application. Another obvious business use case for observability is higher customer satisfaction and retention.

If the preceding outcomes matter to your business, you have a business case for introducing observability into your organization. Rather than waiting for a series of catastrophic failures to prompt your business to address the symptoms of nonobservable systems, the proactive approach introduces observability into your sociotechnical systems with small, achievable steps that have big impacts. Let’s examine how you can take those steps.

Introducing Observability as a Practice

Similar to introducing security or testability into your applications, observability is an ongoing practice that is a responsibility shared by anyone responsible for developing and running a production service. Building effective observable systems is not a one-time effort. You cannot simply take a checkbox approach to introducing technical capabilities and declare that your organization has “achieved” observability any more than you can do that with security or testability. Observability must be introduced as a practice.

Observability begins as a capability that can be measured as a technical attribute of a system: can your system be observed or not (see Chapter 1)? As highlighted several times throughout this book, production systems are sociotechnical. Once a system has observability as a technical attribute, the next step is measured by how well your teams and the system operate together (see Part III). Just because a system can be observed does not mean that it is being observed effectively.

The goal of observability is to provide engineering teams the capability to develop, operate, thoroughly debug, and report on their systems. Teams must be empowered to explore their curiosity by asking arbitrary questions about their system to better understand its behavior. They must be incentivized to interrogate their systems proactively, both by their tools and with management support. A sophisticated analytics platform is useless if the team using it feels overwhelmed by the interface or is discouraged from querying for fear of running up a large bill.

A well-functioning observability practice not only empowers engineers to ask questions that help detect and resolve issues in production, but also should encourage them to begin answering business intelligence questions in real time (see Chapter 20). If nobody is using the new feature that the engineering team has built, or if one customer is at risk of churning because they are persistently experiencing issues, that is a risk to the health of your business. Practicing observability should encourage engineers to adopt a cross-functional approach to measuring service health beyond its performance and availability.

As DevOps practices continue to gain mainstream traction, forward-thinking engineering leadership teams remove barriers between engineering and operations teams. Removing these artificial barriers empowers teams to take more ownership of the development and operation of their software. Observability helps engineers lacking on-call experience to better understand where failures are occurring and how to mitigate them, eroding the artificial wall between software development and operations. Similarly, observability erodes the artificial wall between software development, operations, and business outcomes. Observability gives software engineering teams the appropriate tools to debug and understand how their systems are being used. It helps them shed their reliance on functional handoffs, excessive manual work, runbooks, guesswork, and external views of system health measures that impact business goals.

It is beyond the scope of this chapter to outline all of the practices and traits commonly shared by high-performing engineering teams. The DORA 2019 Accelerate State of DevOps Report describes many of the essential traits that separate elite teams from their low-performing counterparts. Similarly, teams introducing observability benefit from many of the practices described in the report.

When introducing an observability practice, engineering leaders should first ensure that they are creating a culture of psychological safety. Blameless culture fosters a psychologically safe environment that supports experimentation and rewards curious collaboration. Encouraging experimentation is necessary to evolve traditional practices. DORA’s year-over-year reporting demonstrates both the benefits of blameless culture and its inextricable link with high-performing teams.

NOTE

A longer-form guide to practicing blameless culture can be found in PagerDuty’s Blameless Postmortem documentation.

With a blameless culture in practice, business leaders should also ensure that a clear scope of work exists when introducing observability (for example, happening entirely within one introductory team or line of business). Baseline performance measures for TTD and TTR can be used as a benchmark to measure improvement within that scope. The infrastructure and platform work required should be identified, allocated, and budgeted in support of this effort. Only then should the technical work of instrumentation and analysis of that team’s software begin.

Using the Appropriate Tools

Although observability is primarily a cultural practice, it does require engineering teams to possess the technical capability to instrument their code, store the emitted telemetry data, and analyze that data in response to their questions. A large portion of the initial technical effort to introduce observability requires setting up tooling and instrumentation.

At this point, some teams attempt to roll their own observability solutions. As seen in Chapter 15, the ROI of building a bespoke observability platform that does not align with your company’s core competencies is rarely worthwhile. Most organizations find that building a bespoke solution can be prohibitively difficult, time-consuming, and expensive. Instead, a wide range of solutions are available with various trade-offs to consider, such as commercial versus open source, on-premises versus hosted, or a combination of buying and building a solution to meet your needs.

Instrumentation

The first step to consider is how your applications will emit telemetry data. Traditionally, vendor-specific agents and instrumentation libraries were your only choice, and those choices brought with them a large degree of vendor lock-in. Currently, for instrumentation of both frameworks and application code, OpenTelemetry is the emerging standard (see Chapter 7). It supports every open source metric and trace analytics platform, and is supported by almost every commercial vendor in the space. There is no longer a reason to lock into one specific vendor’s instrumentation framework, nor to roll your own agents and libraries.

OTel allows you to configure your instrumentation to send data to the analytics tool of your choice. By using a common standard, it’s possible to easily demo the capabilities of any analytics tool by simply sending your instrumentation data to multiple backends at the same time.

When considering the data that your team must analyze, it’s an oversimplification to simply break observability into categories like metrics, logging, and tracing. While those can be valid categories of observability data, achieving observability requires those data types to interact in a way that gives your teams an appropriate view of their systems. While messaging that describes observability as three pillars is useful as a marketing headline, it misses the big picture. At this point, it is more useful to instead think about which data type or types are best suited to your use case, and which can be generated on demand from the others.

Data Storage and Analytics

Once you have telemetry data, you need to consider the way it’s stored and analyzed. Data storage and analytics are often bundled into the same solution, but that depends on whether you decide to use open source or proprietary options.

Commercial vendors typically bundle storage and analytics. Each vendor has differentiating features for storage and analytics, and you should consider which of those best help your teams reach their observability goals. Vendors of proprietary all-in-one solutions at the time of writing include Honeycomb, Lightstep, New Relic, Splunk, Datadog, and others.

Open source solutions typically require separate approaches to data storage and analytics. These open source frontends include solutions like Grafana, Prometheus, or Jaeger. While they handle analytics, they all require a separate data store in order to scale. Popular open source data storage layers include Cassandra, Elasticsearch, M3, and InfluxDB.

NOTE

Consider how the open source software you choose is licensed and how that impacts your usage. For example, both Elasticsearch and Grafana have recently made licensing changes you should consider before using these tools.

Having so many options is great. But you must also carefully consider and be wary of the operational load incurred by running your own data storage cluster. For example, the ELK stack is popular because it fulfills needs in the log management and analytics space. But end users frequently report that their maintenance and care of their ELK cluster gobbles up systems engineering time and grows quickly in associated management and infrastructure costs. As a result, you’ll find a competitive market for managed open source telemetry data storage (e.g., ELK as a service).

When considering data storage, we also caution against finding separate solutions for each category (or pillar) of observability data you need. Similarly, attempting to bolt modern observability functionality onto a traditional monitoring system is likely to be fraught with peril. Since observability arises from the way your engineers interact with your data to answer questions, having one cohesive solution that works seamlessly is better than maintaining three or four separate systems. Using disjointed systems for analysis places the burden of carrying context and translation between those systems on engineers and creates a poor usability and troubleshooting experience. For more details on how approaches can coexist, refer to Chapter 9.

Rolling Out Tools to Your Teams

When considering tooling options, it’s important to ensure that you are investing precious engineering cycles on differentiators that are core to your business needs. Consider whether your choice of tools is providing more innovation capacity or draining that capacity into managing bespoke solutions. Does your choice of tooling require creating a larger and separate team for management? Observability’s goal isn’t to create bespoke work within your engineering organization; it’s to save your business time and money while increasing quality.

That’s not to say that certain organizations should not create observability teams. However, especially in larger organizations, a good observability team will focus on helping each product team achieve observability in its platform or partner with those teams through the initial integration process. After evaluating which platform best fits the needs of your pilot team, an observability team can help make the same solutions more accessible to your engineering teams as a whole. For more details on structuring an observability team, refer to Chapter 15.

Knowing When You Have Enough Observability

Like security and testability, more work always remains to be done with observability. Business leaders may struggle with knowing when to make investing in observability a priority and when observability is “good enough” that other concerns can take precedence. While we encourage full instrumentation coverage in your applications, we also recognize that observability exists in a landscape with competing needs. From a pragmatic perspective, it helps to know how to recognize when you have enough observability as a useful checkpoint for determining the success of a pilot project.

If the symptom of teams flying blind without observability is excessive rework, teams with sufficient observability should have predictable delivery and sufficient reliability. Let’s examine how to recognize that milestone both in terms of cultural practices and key results.

Once observability practices have become a foundational practice within a team, the outside intervention required to maintain a system with excellent observability should become a minimal and routine part of ongoing work. Just as a team wouldn’t think to check in new code without associated tests, so too should teams practicing observability think about associated instrumentation as part of any code-review process. Instead of merging code and shutting down their laptops at the end of the day, it should be second nature for teams practicing observability to see how their code behaves as it reaches each stage of deployment.

Instead of code behavior in production being “someone else’s problem,” teams with enough observability should be excited to see how real users benefit from the features they are delivering. Every code review should consider whether the telemetry bundled with the change is appropriate to understand the impact this change will have in production. Observability should also not just be limited to engineers; bundled telemetry should empower product managers and customer success representatives to answer their own questions about production (see Chapter 20). Two useful measures indicating that enough observability is present are a marked improvement in self-serve fulfillment for one-off data requests about production behavior and a reduction in product management guesswork.

As teams reap the benefits of observability, their confidence level for understanding and operating in production should rise. The proportion of unresolved “mystery” incidents should decrease, and time to detect and resolve incidents will decrease across the organization. However, a frequent mistake for measuring success at this point is over-indexing on shallow metrics such as the overall number of incidents detected. Finding more incidents and comfortably digging into near misses is a positive step as your teams gain an increased understanding of the way production behaves. That often means previously undetected problems are now being more fully understood. You’ll know you’ve reached an equilibrium when your engineering teams live within its modern observability tooling to understand problems and when disjointed legacy tooling is no longer a primary troubleshooting method.

Whenever your teams encounter a new problem that poses questions your data cannot answer, they will find it easier to take the time to fill in that telemetry gap rather than attempting to guess at what might be wrong. For example, if a mystery trace span is taking too long for inexplicable reasons, they will add subspans to capture smaller units of work within it, or add attributes to understand what is triggering the slow behavior. Observability always requires some care and feeding as integrations are added or the surface area of your code changes. But even so, the right choice of observability platform will still drastically reduce your overall operational burdens and TCO.

Conclusion

The need for observability is recognized within teams for a variety of reasons. Whether that need arises reactively in response to a critical outage, or proactively by realizing how its absence is stifling innovation on your teams, it’s critical to create a business case in support of your observability initiative.

Similar to security and testability, observability must be approached as an ongoing practice. Teams practicing observability must make a habit of ensuring that any changes to code are bundled with proper instrumentation, just as they’re bundled with tests. Code reviews should ensure that the instrumentation for new code achieves proper observability standards, just as they ensure it also meets security standards. Observability requires ongoing care and maintenance, but you’ll know that observability has been achieved well enough by looking for the cultural behaviors and key results outlined in this chapter.

In the next chapter, we’ll look at how engineering teams can create alliances with other internal teams to help accelerate the adoption of observability culture.

1 You can reference a recap of Forrester Consulting’s Total Economic Impact (TEI) framework findings for Honeycomb in the blog post “What Is Honeycomb’s ROI? Forrester’s Study on the Benefits of Observability” by Evelyn Chea.

2 Tammy Everts, “Mobile Load Time and User Abandonment”, Akamai Developer Blog, September 9, 2016.

If you find an error or have any questions, please email us at admin@erenow.org. Thank you!