Back in 2023, I wrote “On-Call is an opportunity” on my personal blog describing the motivation for On-Call Optimizer: To create a product that enables teams to develop an on-call culture that delivers both personal growth and improved performance and velocity as a team.
This post summarizes key insights that underpin the development and direction of the On-Call Optimizer product in achieving that goal.
Incidents are inevitable and present opportunities for growth
Complex software systems operate in constantly degraded modes and require human operators to maintain acceptable performance. Even with excellent engineering, surprises will occur at inconvenient times, necessitating a response.
Providing this type of on-call response can be stressful. This leads many teams to avoid or attempt to transfer the responsibility away. In doing so, they deprive themselves of the ability to learn and grow from the lessons and experience that can be gained when responding to unexpected situations.
Teams need to realize that when managed sustainably, on-call represents an opportunity to grow and develop. Rather than attempting to avoid or eliminate the on-call role, teams should aim to develop practices and culture that provides a stable, sustainable environment for on-call response to occur within so that the opportunity for growth can be realized.
Creating a sustainable on-call environment
On-call can, and should, be an activity that all software engineers participate in and find fulfilling. Creating the environment to achieve that is not impossible, but it does take well developed team practices. Without these practices, or with poorly managed on-call logistics the on-call experience can be miserable and harmful. These outcomes are not inevitable!
Effective on-call practices require consideration of how on-call impacts all aspects of the team, it cannot be considered an additional responsibility. This integration typically begins with incident management processes that provide guardrails against harm while enabling learning, but needs to extend to scheduling, monitoring, alerting, training, knowledge distribution, and system improvement capabilities. Underlying everything must be a culture that values learning and provides support to those in the hot seat when dealing with incidents.
Growth through iteration
The challenge for most teams in transforming their approach to on-call is balancing the investment required to establish and sustain the necessary practices against all the other demands on their time.
A practical approach is to start with small changes, committing to iterative experiments and improvements towards the eventual destination of sustainable and fulfilling on-call.
Culture is defined by actions rather than by words. By steadily developing new actions and practices over time teams can achieve remarkable changes in their on-call culture and sustainability from very small initial investments.
How On-Call Optimizer helps
On-Call Optimizer’s role is to lower the cost to the team of taking those initial incremental steps.
By delivering on-call practices such as flexible scheduling which have been tried and proven effective in teams large and small throughout my career at Google and working with other companies, On-Call Optimizer lowers the barrier to entry and makes it quick and easy for your team to take the first step towards iteratively improving your on-call practices and culture today.