DevOps Notes: Continual Learning and Experimentation in the Tech Industry

✪ Ryan's Profile Image ✪ Ryan   |   Sept. 19, 2023,  Last edited Sept. 19, 2023   |   Johnson City, TN, US

     Chapter 4, The Third Way: The Principles of Continual Learning and Experimentation in authors Gene Kim’s, Jez Humble’s, Patrick Debois’, and John Willis’ The DevOps Handbook is concerned with promoting work culture dedicated to improvement through continual post-mortem investigation, knowledge-sharing, building resilience in systems, and promoting scientific means of pursuing goals. The chapter looks at different work culture types, assessing their effectiveness in the technology value stream, with special attention to what may be called generative safety culture. The authors also advocate for the institutional formalization of improvement in daily work based on a scientific process and they offer recommendations for building resilient cultures with leadership that enables the structured pursuit of goals by tackling them in hypothesis/test-driven iterations.

 

Work Cultures and their Viability

 

     The authors observe that a trend in manufacturing is that systemic problems with quality and safety typically emerge from organizations with rigidly defined and enforced rules for work. They explain that the same cultures that have a general apathy toward improvements and learning are often the cultures with high fear and low trust among a managers who seek to blame and dish out punishment. As an alternative to such a fearful and stagnant culture, the authors begin to describe a “safety culture” that facilitates organizational learning to mitigate risk.



The Problem of Error in Technology Value Streams



     The nature of a technology value stream is one of immense complexity in which it is not possible to predict all the outcomes to actions taken. Unexpected and catastrophic outcomes will sometimes result. Some management teams have a tendency to identify and blame individuals and create more processes and approvals in response to failures. The authors warn that this makes organizations more bureaucratic and not more careful. Specifically, they warn that such a management response is counterproductive in the technology sector because they can inadvertently discourage problems from being reported in the first place.



Generative Safety Culture: An Optimal Culture for a Technical Industry



     As a solution to the nature of error in a high-complexity value stream, the authors advocate for a generative culture over two other organizational culture classifications: pathological culture and bureaucratic culture. A pathological culture is one characterized by high fear and frequent threats where information is hoarded, protected, or even distorted for political reasons and self-preservation causing failures to be hidden. A bureaucratic culture is characterized by rules, processes, and more siloed departments where failure is processed through an evaluation system which may choose to punish responsible individuals. In sharp contrast to the other two cultures, a generative culture is characterized by people actively seeking and sharing information, sharing responsibilities across work centers, and responding to failure with legitimate inquiry and reflection without putting blame on individuals. The fundamental process by which generative cultures improve is through frequently conducting post-mortems on incidents and determining countermeasures which feed into the organization’s knowledge.

 

Institutionalizing Improvement Daily

 

     The authors suggest that due to technical debt in Technology, processes degrade over time for the maintenance associated with the suboptimal or problematic components of systems. The solution, then, is to institutionalize the practice of paying down technical debt on a daily basis.



Resilience Patterns in Daily Work



     The authors suggest that we introduce tension into our systems. We should be working daily on goals such as deployment lead time reduction, increased test coverage, decreased test execution time, and at times, architecture redesign. Furthermore, they argue that it is beneficial to periodically rehearse large-scale failures by turning off entire data centers, or like Netflix’s Chaos Monkey, by injecting faults across production systems. These practices, in turn, make the organization and its systems more resilient to respond to failures and anomalies.

 

Knowledge-Sharing

 

     In Technology organizations, individual knowledge is highly tacit and ought to be codified as best it can for the entire organization. To improve organizational knowledge-sharing, post-mortem reports ought to be globally searchable and code repositories embodying the best collective knowledge of the organization should also be globally available.

 

Leadership’s Responsibility in Reinforcing Learning Culture

 

     The objective of leadership, the authors state, is to create the conditions for teams to discover greatness. Leadership ought to set big-picture “True North” goals for their teams—"sustain zero accidents" or "double throughput in a year". Then they should enforce the use of the scientific method in their teams to pursue iterative short-term goals to make progress toward the big-picture goals. Teams should be stating the problem, developing a hypothesis of how a countermeasure will solve the problem, test their hypothesis, interpret their results, and feed their knowledge into the next iteration. This is the structure, the authors state, that leaders ought to facilitate for all internal improvement processes.

 

Conclusion

 

     Among the different work cultures and organizational practices, generative safety culture, along with the right knowledge-sharing and learning processes, is highly viable for the Technology industry. For a culture to make real improvements, management should not place blame on individuals but focus on implementing countermeasures to problems and failures and testing their results. Ideally, a technology organization ought to conduct frequent post-mortems and make those reports easily available and searchable throughout the organization. Along with post-mortem reports, internal code libraries showcasing collective organizational knowledge should be accessible throughout the organization. Furthermore, Technology value streams should make improvement a regular part of daily work, introducing tension to build resilience and applying scientific methodology to iteratively chip away at improvement goals.