DevOps Notes: Widespread Telemetry for Problem-Solving

Monitoring dashboard with metrics

Photo by Stephen Dawson on Unsplash

✪ Ryan's Profile Image ✪ Ryan   |   Sept. 19, 2023,  Last edited Sept. 19, 2023   |   Johnson City, TN, US

     Chapter 14, Create Telemetry to Enable Seeing and Solving Problems in authors Gene Kim’s, Jez Humble’s, Patrick Debois’, and John Willis’ The DevOps Handbook is concerned with leveraging telemetry across the full application stack to detect or predict system problems early and using scientific problem-solving to address them. The authors spend the extent of Chapter 14 highlighting the advantages of effective telemetry, identifying and classifying different metrics, and discussing the processes needed to collect and add telemetry on a continuous basis. Additionally, the authors stress the need to make telemetry information understandable and easily accessible.

 

Telemetry for Better and Faster Diagnosis and Fixing

 

     In making their case for the widescale use of telemetry, the authors review the substantial improvements to Mean Time To Repair (MTTR). The authors identify the top technical practices responsible for fast MTTR as version control in Operations and the use of telemetry and proactive monitoring in production environments. They use evidence to justify the assertion that the best-performing organizations are “much better at diagnosing and fixing service incidents”. Next, the authors identify specific factors in the application stack where telemetry ought to be tracked. These factors include application features, application health, database, operating system, storage, networking, and security. They advocate the need for sufficient telemetry from both Development and Operations, stating that if a feature is important enough for an engineer to implement, it deserves enough production telemetry to confirm its correct operation and that it contributes to desired outcomes.

Metrics and Implementation

 

      After having explored the practices that reduce MTTR and enable effective diagnosis and resolution of service problems, the authors expand on the metrics that should be collected, suggesting that all potentially significant application events should generate logging entries. They exemplify numerous events that would be considered significant or potentially significant such as delays, resources, data changes (as in CRUD operations), authentication/authorization decisions, and system and application changes. The authors state that telemetry enables the application of the scientific method to formulate hypotheses about what is causing a particular problem and what is required to solve it. The authors suggest that to use a scientific problem-solving approach, Operations needs to create the infrastructure and libraries necessary for developers to add new metrics with ease. Ideally, only a line of code should allow developers to render a new metric that can be widely accessed.

 

     Once new metrics are being integrated and charted as a part of daily work, the next step in using telemetry to optimize DevOps is to radiate metric information to the rest of the organization such that all who may want access to the telemetry may access it easily. Metrics such as count of automatic tests, velocity, incidents, continuous integration status are a few that may be useful for the entire organization to track. Besides the considerable reduction in MTTR that effective telemetry along with prolific radiation of telemetry information enable, the authors also advocate for leveraging telemetry to give customers transparency in the interest of establishing trust.

 

Finding and Filling Telemetry Gaps

 

     Having established the need for information radiation to the organization and its customers, the authors identify ways to “find and fill” telemetry gaps. They identify key metrics that pertain to respective organizational levels such as the business level, the infrastructure level, in the client software, and in the deployment pipeline. This is all in the interest of detecting and correcting problems before they grow such that fewer customers will be impacted and the impacts may be less severe. They offer a few tools—Zookeeper, Etcd, and Consul—that can dynamically discover links between services and their infrastructure for generating new metrics to better fill areas of missing telemetry. Lastly, the authors suggest overlaying important events on charts such as production deployments to monitor how services may be impacted during such events.