Part III. Observability for Teams

In Part II, we examined various technical aspects of observability, how those concepts build on one another to enable the core analysis loop and debugging from first principles, and how that practice can coexist with traditional monitoring. In this part, we switch gears to look at the changes in social and cultural practices that help drive observability adoption across different teams.

Chapter 10 tackles many of the common challenges teams face when first starting down the path of observability. How and where you start will always depend on multiple factors, but this chapter recaps many of the techniques we’ve seen work effectively.

Chapter 11 focuses on how developer workflows change when using observability. Though we’ve referenced this topic in earlier chapters, here we walk through more concrete steps. You’ll learn about the benefits developers gain by adding custom instrumentation into their code early in the development phase, and how that’s used to debug their tests and to ensure that their code works correctly all the way through production.

Chapter 12 looks at the potential that observability unlocks when it comes to using more sophisticated methods for monitoring the health of your services in production. This chapter introduces service-level objectives (SLOs) and how they can be used for more effective alerting.

Chapter 13 builds on the preceding chapter by demonstrating why event data is a key part of creating more accurate, actionable, and debuggle alerts than using SLOs based on metrics data.

Chapter 14 looks at how teams can use observability to debug and better understand other parts of their stack, like their CI/CD build pipelines. This guest-contributed chapter is written by Frank Chen, senior staff software engineer at Slack.

This part of the book focuses on team workflows that can change and benefit from observability practices by detailing various scenarios and use cases to address common pain points for engineering teams managing modern software systems operating at any scale. In Part IV, we’ll look at specific and unique challenges that occur when using observability tools at scale.

Chapter 10. Applying Observability Practices in Your Team

Let’s switch gears to focus on the fundamentals of observability from a social and cultural practice perspective. In this chapter, we provide several tips to help you get started with observability practices. If you’re in a leadership role within your engineering team—such as a team lead, a manager, or maybe the resident observability fan/champion—the hardest thing to figure out (after getting management approval) in an observability implementation strategy is knowing where to start.

For us, this is a particularly tricky chapter to write. Having helped many teams start down this path, we know that no universal recipe for success exists. How and where you get started will always depend on many factors. As unsatisfying as “it depends” can be for an answer, the truth is that your journey with observability depends on particulars including the problems most pertinent to you and your team, the gaps in your existing tooling, the level of support and buy-in from the rest of your organization, the size of your team, and other such considerations.

Whatever approach works best for you is, by definition, not wrong. The advice in this chapter is not intended to suggest that this is the one true way to get started with observability (there is no singular path!). That said, we have seen a few emergent patterns and, if you are struggling with where to begin, some of these suggestions may be helpful to you. Feel free to pick and choose from any of the tips in this chapter.

Join a Community Group

Observability is an emerging practice, and the approaches are still relatively young, with plenty of exploration left to do. Whenever the practices and technology behind our sociotechnical systems are rapidly evolving, one of the best ways to learn and improve is by participating in a community of people who are struggling with variations on the same themes as you. Community groups connect you with other professionals who can quickly become a helpful network of friends and acquaintances.

As you and your community face similar challenges, you’ll have an opportunity to learn so much very quickly just by hanging out in Slack groups and talking to other people who are banging against the same types of problems. Community groups allow you to connect with people beyond your normal circles from a variety of backgrounds. By actively participating and understanding how other teams handle some of the same challenges you have, you’ll make comparative observations and learn from the experiences of others.

Over time, you’ll also discover other community members with common similarities in tech stack, team size, organizational dynamics, and so forth. Those connections will give you someone to turn to as a sounding board, for background, or personal experiences with solutions or approaches you might also be considering. Having that type of shared context before you pull the trigger on new experiments can be invaluable. Actively participating in a community group will save you a ton of time and heartbreak.

Participating in a community will also keep you attuned to developments you may have otherwise missed. Different providers of observability tools will participate in different communities to better understand user challenges, gather feedback on new ideas, or just generally get a pulse on what’s happening. Participating in a community specific to your observability tool of choice can also give you a pulse on what’s happening as it happens.

When joining a community, remember that community relationships are a two-way street. Don’t forget to do your share of chopping wood and carrying water: show up and start contributing by helping others first. Being a good community citizen means participating and helping the group for a while before dropping any heavy asks for help. In other words, don’t speak up only when you need something from others. Communities are only as strong as you make them.

If you need a place to start, we recommend checking out the CNCF Technical Advisory Group (TAG) for Observability. There you’ll find both Slack chat groups as well as regular online meetings. The OpenTelemetry Community page also lists useful resources. More generally, a lot of conversations around observability happen via Twitter (search for “observability” and you’ll find people and topics to follow). More specifically, product-focused Slack groups such as Honeycomb’s Pollinators Slack exist, where you’ll find a mix of general and vendor-specific information. We also recommend Michael Hausenblas’s newsletter o11y news.

Start with the Biggest Pain Points

Introducing any new technology can be risky. As a result, new technology initiatives often target small and inconspicuous services as a place to start. Counterintuitively, for observability, starting small is one of the bigger mistakes we often see people make.

Observability tools are designed to help you quickly find elusive problems. Starting with an unobtrusive and relatively unimportant service will have the exact opposite effect of proving the value of observability. If you start with a service that already works relatively well, your team will experience all of the work it takes to get started with observability and get none of the benefits.

When spearheading a new initiative, it’s important to get points on the board relatively quickly. Demonstrate value, pique curiosity, and garner interest by solving hard or elusive problems right off the bat. Has a flaky and flappy service been waking people up for weeks, yet nobody can figure out the right fix? Start there. Do you have a problem with constant database congestion, yet no one can figure out the cause? Start there. Are you running a service bogged down by an inexplicable load that’s being generated by a yet-to-be-identified user? That’s your best place to start.

Quickly demonstrating value will win over naysayers, create additional support, and further drive observability adoption. Don’t pick an easy problem to start with. Pick a hard problem that observability is designed to knock out of the park. Start with that service. Instrument the code, deploy it to production, explore with great curiosity, figure out how to find the answer, and then socialize that success. Show off your solution during your weekly team meeting. Write up your findings and methodologies; then share them with the company. Make sure whoever is on call for that service knows how to find that solution.

The fastest way to drive adoption is to solve the biggest pain points for teams responsible for managing their production services. Target those pains. Resist the urge to start small.

Buy Instead of Build

Similar to starting with the biggest pain points, the decision of whether to build your own observability tooling or buy a commercially available solution comes down to proving return on investment (ROI) quickly. We examine this argument more closely in Chapter 15. For now, we’ll frame the decision at the outset to favor putting in the least amount of effort to prove the greatest amount of value. Building an entire solution first is the greatest amount of possible effort with the longest possible time to value.

Be prepared to try out multiple observability solutions to see if they meet the functional requirements laid out in Chapter 1. Remember that observability allows you to understand and explain any state your system can get into, no matter how novel or bizarre. You must be able to comparatively debug that bizarre or novel state across all dimensions of system state data, and combinations of dimensions, in an ad hoc manner, without being required to define or predict those debugging needs in advance.

Unfortunately, at the time of this writing, few tools exist that are able to deliver on those requirements. While the marketing departments of various vendors are happy to apply the observability label to tools in their product suites, few are able to unlock the workflows and benefits outlined in this book. Therefore, as a user, you should be prepared to try out many tools to see which actually deliver and which are simply repackaging the same traditional monitoring tools sold for years.

Your best way to do that is to instrument your applications by using OpenTelemetry (see Chapter 7). Using OTel may not be as fast and easy as another vendor’s proprietary agent or libraries, but it shouldn’t be slow and difficult to use either. The small up-front investment in time necessary to do this right from the start will pay extreme dividends later when you decide to try multiple solutions to see which best meets your needs.

That misinformation and uncertainty is an unfortunate reality of today’s vendor ecosystem. As a result, it can be tempting to bypass that entire mess by building your own solution. However, that choice is also fraught with a few unfortunate realities (see Chapter 15).

First, few off-the-shelf open source observability tools are available for you to run yourself, even if you wanted to. The three leading candidates you’ll come across are Prometheus, the ELK stack (Elasticsearch, Logstash, and Kibana), and Jaeger. While each provides a specific valuable solution, none offers a complete solution that can deliver on the functional requirements for observability outlined in this book.

Prometheus is a TSDB for metrics monitoring. While Prometheus is arguably one of the most advanced metrics monitoring systems with a vibrant development community, it still operates solely in the world of metrics-based monitoring solutions. It carries with it the inherent limitations of trying to use coarse measures to discover finer-grained problems (see Chapter 2).

The ELK stack focuses on providing a log storage and querying solution. Log storage backends are optimized for plain-text search at the expense of other types of searching and aggregation. While useful when searching for known errors, plain-text search becomes impractical when searching for answers to compound questions like “Who is seeing this problem, and when are they seeing it?” Analyzing and identifying relevant patterns among a flood of plain-text logs critical for observability is challenging in an entirely log-based solution (see Chapter 8).

Jaeger is an event-based distributed tracing tool. Jaeger is arguably one of the most advanced open source distributed tracing tools available today. As discussed in Chapter 6, trace events are a fundamental building block for observable systems. However, a necessary component is also the analytical capability to determine which trace events are of interest during your investigation. Jaeger has some support for filtering certain data, but it lacks sophisticated capabilities for analyzing and segmenting all of your trace data (see Chapter 8).

Each of these tools provides different parts of a system view that can be used to achieve observability. The challenges in using them today are either running disparate systems that place the burden of carrying context between them into the minds of their operators, or building your own bespoke solution for gluing those individual components together. Today, no open source tool exists to provide observability capabilities in one out-of-the-box solution. Hopefully, that will not be true forever. But it is today’s reality.

Lastly, whichever direction you decide to take, make sure that the end result is actually providing you with observability. Stress-test the solution. Again, resist the urge to start small. Tackle big, difficult problems. Observability requires an ability to debug system state across all high-cardinality fields, with high dimensionality, in interactive and performant real-time exploration. Can this solution deliver the types of analysis and iterative investigation described in earlier chapters? If the answer is yes, congratulations! You found a great observability solution. If the answer is no, the product you are using has likely been misleadingly labeled as an observability solution when it is not.

The key to getting started with observability is to move quickly and demonstrate value early in the process. Choosing to buy a solution will keep your team focused on solving problems with observability tooling rather than on building their own. If you instrument your applications with OTel, you can avoid the trap of vendor lock-in and send your telemetry data to any tool you ultimately decide to use.

Flesh Out Your Instrumentation Iteratively

Properly instrumenting your applications takes time. The automatic instrumentation included with projects like OTel are a good place to start. But the highest-value instrumentation will be specific to the needs of your individual application. Start with as much useful instrumentation as you can but plan to develop your instrumentation as you go.

NOTE

For more examples of automatic versus manual instrumentation, we recommend “What Is Auto-Instrumentation?”, a blog post by Mike Goldsmith.

One of the best strategies for rolling out observability across an entire organization is to instrument a painful service or two as you first get started. Use that instrumentation exercise as a reference point and learning exercise for the rest of the pilot team. Once the pilot team is familiar with the new tooling, use any new debugging situation as a way to introduce more and more useful instrumentation.

Whenever an on-call engineer is paged about a problem in production, the first thing they should do is use the new tooling to instrument problem areas of your application. Use the new instrumentation to figure out where issues are occurring. After the second or third time people take this approach, they usually catch on to how much easier and less time-consuming it is to debug issues by introducing instrumentation first. Debugging from instrumentation first allows you to see what’s really happening.

Once the pilot team members are up to speed, they can help others learn. They can provide coaching on creating helpful instrumentation, suggest helpful queries, or point others toward more examples of helpful troubleshooting patterns. Each new debugging issue can be used to build out the instrumentation you need. You don’t need a fully developed set of instrumentation to get immediate value with observability.

A NOTE ON INSTRUMENTATION CONVENTIONS

The focus when you get started with instrumentation should be to prove as much value as possible with your chosen observability tool. The fact that you are iteratively fleshing out your instrumentation is more important than how you do it.

However, keep in mind that as your instrumentation grows and as adoption spreads across teams, you should introduce naming conventions for custom telemetry data you are generating. For examples of organization-wide standards, see Chapter 14.

Look for Opportunities to Leverage Existing Efforts

One of the biggest barriers to adopting any new technology is the sunk-cost fallacy. Individuals and organizations commit the sunk-cost fallacy when they continue a behavior or endeavor as a result of previously invested time, money, or effort.1 How much time, money, and effort has your organization already invested in traditional approaches that are no longer serving your needs?

Resistance to fundamental change often hits a roadblock when the perception of wasted resources creeps in. What about all those years invested in understanding and instrumenting for the old solutions? Although the sunk-cost fallacy isn’t logical, the feelings behind it are real, and they’ll stop your efforts dead in their tracks if you don’t do anything about them.

Always be on the lookout for and leap at any chance to forklift other work into your observability initiatives. As examples, if there’s an existing stream of data you can tee to a secondary destination or critical data that can be seen in another way, jump on the opportunity to ship that data into your observability solution. Examples of situations you could use to do this include:

· If you’re using an ELK stack—or even just the Logstash part—it’s trivial to add a snippet of code to fork the output of a source stream to a secondary destination. Send that stream to your observability tool. Invite users to compare the experience.

· If you’re already using structured logs, all you need to do is add a unique ID to log events as they propagate throughout your entire stack. You can keep those logs in your existing log analysis tool, while also sending them as trace events to your observability tool.

· Try running observability instrumentation (for example, Honeycomb’s Beelines or OTel) alongside your existing APM solution. Invite users to compare and contrast the experience.

· If you’re using Ganglia, you can leverage that data by parsing the Extensible Markup Language (XML) dump it puts into /var/tmp with a once-a-minute cronjob that shovels that data into your observability tool as events. That’s a less than optimal use of observability, but it certainly creates familiarity for Ganglia users.

· Re-create the most useful of your old monitoring dashboards as easily referenceable queries within your new observability tool. While dashboards certainly have their shortcomings (see Chapter 2), this gives new users a landing spot where they can understand the system performance they care about at a glance, and also gives them an opportunity to explore and know more.

Anything you can do to blend worlds will help lower that barrier to adoption. Other people need to understand how their concerns map into the new solution. Help them see their current world in the new world you’re creating for them. It’s OK if this sneak peek isn’t perfect. Even if the experience is pretty rough, you’re shooting for familiarity. The use of the new tooling in this approach might be terrible, but if the names of things are familiar and the data is something they know, it will still invite people to interact with it more than a completely scratch data set.

Prepare for the Hardest Last Push

Using the preceding strategies to tackle the biggest pain points first and adopt an iterative approach can help you make fast progress as you’re getting started. But those strategies don’t account for one of the hardest parts of implementing an observability solution: crossing the finish line. Now that you have some momentum going, you also need a strategy for polishing off the remaining work.

Depending on the scope of work and size of your team, rolling out new instrumentation iteratively as part of your on-call approach can typically get most teams to a point where they’ve done about half to two-thirds of the work required to introduce observability into every part of the stack they intend. Inevitably, most teams discover that some parts of their stack are under less-active development than others. For those rarely touched parts of the stack, you’ll need a solid completion plan, or your implementation efforts are likely to lag.

Even with the best of project management intentions, as some of the pain that was driving observability adoption begins to ease, so too can the urgency of completing the implementation work. The reality most teams live in is that engineering cycles are scarce, demands are always competing, and another pain to address is always waiting around the corner once they’ve dealt with the one in front of them.

The goal of a complete implementation is to have built a reliable go-to debugging solution that can be used to fully understand the state of your production applications whenever anything goes wrong. Before you get to that end state, you will likely have various bits of tooling that are best suited to solving different problems. During the implementation phase, that disconnect can be tolerable because you’re working toward a more cohesive future. But in the long-term, not completing your implementation could create a drain of time, cognitive capacity, and attention from your teams.

That’s when you need to make a timeline to chug through the rest quickly. Your target milestone should be to accomplish the remaining instrumentation work necessary so that your team can use your observability tool as its go-to option for debugging issues in production. Consider setting up a special push to get to the finish line, like a hack week culminating in a party with prizes (or something equally silly and fun) to bring the project over the top to get it done.

During this phase, it’s worth noting that your team should strive to genericize instrumentation as often as possible so that it can be reused in other applications or by other teams within your organization. A common strategy here is to avoid repeating the initial implementation work by creating generic observability libraries that allow you to swap out underlying solutions without getting into code internals, similar to the approach taken by OTel (see Chapter 7).

Conclusion

Knowing exactly where and how to start your observability journey depends on the particulars of your team. Hopefully, these general recommendations are useful to help you figure out places to get started. Actively participating in a community of peers can be invaluable as your first place to dig in. As you get started, focus on solving the biggest pain points rather than starting in places that already operate smoothly enough. Throughout your implementation journey, remember to keep an inclination toward moving fast, demonstrating high value and ROI, and tackling work iteratively. Find opportunities to include as many parts of your organization as possible. And don’t forget to plan for completing the work in one big final push to get your implementation project over the finish line.

The tips in this chapter can help you complete the work it takes to get started with observability. Once that work is complete, using observability on a daily basis helps unlock other new ways of working by default. The rest of this part of the book explores those in detail. In the next chapter, we’ll examine how observability-driven development can revolutionize your understanding of the way new code behaves in production.

1 Hal R. Arkes and Catherine Blumer, “The Psychology of Sunk Costs,” Organizational Behavior and Human Decision Processes 35 (1985): 124–140.

If you find an error or have any questions, please email us at admin@erenow.org. Thank you!