Rethink platform engineering and developer impact in the age of AI. Tune in to our webinar on Thursday, May 22.

Back to Podcasts
Livin' on the Edge Podcast

Charity Majors on Instrumenting Systems, Observability-Driven Development, and Honeycomb

About

When building microservice-based (distributed) systems, engineers must learn to accept that most problems that will be seen in the future cannot be predicted today. Therefore, being able to observe the system and formulate and verify hypotheses in relation to issues is vitally important.Being able to answer ad hoc questions from your observability system, without having to ship custom code or metrics updates, is vitally important. If engineers have to invest large amounts of time creating a custom dashboard for each issue they encounter, their workspace will be “littered with failed dashboards that are never seen again.”

Episode guests

Charity Majors

CTO at Honeycomb

Charity Majors is the CEO and co-founder of Honeycomb, a tool for software engineers to explore their code on production systems. Charity has been on call since age 17, a terrifying thought. She has been every sort of systems engineer and manager at Facebook, Parse, Linden Lab etc, but somehow always ends up responsible for the databases. She likes free software, free speech and peaty single-malts.

In this episode of the Ambassador Livin’ on the Edge podcast, Charity Majors, CTO at Honeycomb and author of many great blog posts on observability and leadership, discusses the new approach needed when instrumenting microservices and distributed systems, the benefits of “observability-driven development (ODD)”, and how Honeycomb can help engineers with asking ad hoc questions about their production systems.

Be sure to check out the additional episodes of the “Livin' on the Edge” podcast.

Key takeaways from the podcast included

  • On-call alerting should be triggered by service level objectives (SLOs), rather than simply being triggered by an infrastructure failure or a monitoring threshold being breached. Engineers should only be woken up if the business is being impacted.
  • Engineers must move away from the classic approach of simply monitoring well-understood infrastructure metrics towards actively instrumenting code in order to be able to have more of a constant “conversation” with production systems.
  • Engineers should strive to understand what “normal” looks like in their system. By establishing baselines and scanning top-level metrics each day, an engineer should be able to quickly identify if something fundamental is going wrong after a release of their code.
  • The four metrics correlated with high performing organizations, as published by Dr Nicole Forsgren in Accelerate, should always be tracked: lead time, deployment frequency, mean time to recovery, and change failure percentage.
  • Engineers work within a socio-technical system. Teamwork is vitally important, and so is the ability to rapidly develop and share mental models of issues. The UX of internal tooling is more important than many engineering teams realize.
  • Test-Driven Development (TDD) is a very useful methodology. A failing test that captures a requirement is created before any production code is written. However, due to the typical use of mocks and stubs to manage the interaction with external dependencies, TDD can effectively “end at the border of your laptop”.
  • Observability-Driven Development (ODD) is focused on defining instrumentation to determine what is happening in relation to a requirement before any code is written. “Just as you wouldn’t accept a pull-request without tests, you should never accept a pull-request unless you can answer the question, “how will I know when this isn’t working?”
  • Developers need to understand “just enough” about the business requirements and the underlying infrastructure in order to be able to instrument their systems correctly.
  • Using modern release approaches like canary releases, dark launching, and feature flagging can help to mitigate the impact of any potential issues associated with the release.
  • Honeycomb is a tool for introspecting and interrogating your production systems. Honeycomb supports high-dimensionality of monitoring data. Engineers add a language-specific “Beeline” library or SDK to their application, and within their code they can add custom, business-specific metadata to each monitoring span, such as user ID or arbitrary customer data.
  • Honeycomb’s “BubbleUp” feature is intended to help explain how some data points are different from the other points returned by a query. The goal is to try to explain how a subset of data differs from other data; this feature surfaces potential places to look for "signals" within data.
  • Although the “three pillars” observability model is useful, the primary goals of any observability system are to help an engineer to understand the underlying system, identify issues, and locate the cause of issues.