Part of the Multi-team Software Delivery Assessment (README)
Copyright © 2018-2021 Conflux Digital Ltd
Licenced under CC BY-SA 4.0
Permalink: SoftwareDeliveryAssessment.com
Based on selected criteria from the following books:
- Site Reliability Engineering by Betsy Beyer, Chris Jones, Jennifer Petoff, & Niall Murphy
- The Site Reliability Workbook edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, & Stephen Thorne
- Team Guide to Software Operability by Matthew Skelton, Alex Moore, & Rob Thatcher
Definition of on-call: for this assessment, "on-call" means being available and responsible for diagnosing and fixing (through workarounds or updated code) any problems in the live/production systems that relate to software that you and your team creates and evolves. You might be available during working hours or outside of working hours
NOTE: The subject of on-call is very emotive and there is significant context and nuance behind the assessment criteria here. We recommend that you read at least these two articles:
Try to understand the social context in which the criteria for Tired and Inspired would make sense. At one extreme, paying people 3x or 4x normal salary to be on-call could incentivize more bugs reaching the live systems (because the more problems that occur in live, the more money they get paid for being on-call); conversely, having on-call open only to those people with compatible home lives could exclude many people with home care responsibilities, depriving them of valuable experience.
Purpose: Assess the approach to on-call support within the software system.
Method: Use the Spotify Squad Health Check approach to assess the team's answers to the following questions, and also capture the answers:
Question | Tired (1) | Inspired (5) |
---|---|---|
1. Purpose of on-call - How would you define "on-call"? | On-call is a way to get developers to fix problems that people in Support or Live Services don't know how to fix. | On-call is a sensing mechanism to help teams build better software. |
2. Benefits of on-call - What are some ways in which the software benefits by having developers on-call? | Bugs are fixed quickly. | The needs of all kinds of users can be better understood by having team members on-call. We can better empathise with primary/secondary/tertiary users by seeing the problems for ourselves. |
3. Reward - How are you rewarded for being on-call out of working hours? | Significant compensation/money - 3x or 4x normal salary - plus additional time off. The more bugs in the software that reach live/production, the more money we make. | We are recognized for our increasing skills as engineers: experience from on-call counts towards our performance reviews. We may also get some time off to recover from out-of-hours on-call and/or some additional money for out-of-hours on-call. Overall, on-call feels valuable for our careers. |
4. On-call UX - What is the User Experience (UX) / Developer Experience (DevEx) of being on-call at the moment? | It is painful and slow to diagnose problems. | The tools and access rights make diagnosis exciting and an opportunity to learn. |
5. Learning from on-call - What happens to knowledge gained during on-call? How is the software improved based on on-call experiences? | Little time is allocated to fix problems after they are discovered when on-call. | Learning from on-call is used to prioritise key aspects of the team's work. |
6. Attitude to on-call - Under what circumstances would on-call not be a burden? | On-call would not be a burden if we never had to do it. | On-call is not a burden - it's a privilege to be able to learn how the software actually works. |
7. Future on-call - What would be needed for this team/squad to be happy to be on-call? | We would want significant additional money/compensation. | We would want a great UX/DevEx and opportunity to learn when on-call. |
8. Tooling for on-call - What tooling or process is missing, ineffective, or insufficient at the moment in relation to on-call? | All aspects of the on-call experience are ineffective. | Only very small things feel like a problem. |
9. Improving on-call - How much time do you spend as a team improving the on-call experience? How often do you work on improvements to on-call? | We don't have time to improve the on-call experience. | We make improvements and tweaks to on-call every week - it's continuous and part of our remit. |
10. Flexibility of on-call - How flexible is the on-call rota or schedule? In what ways does the schedule meet the different needs of team members? | Everyone must follow the same on-call schedule, including out-of-hours work. | Team members have flexibility to fit on-call work around their personal commitments and/or can opt to do on-call work solely during working hours (office hours). |
11. Accessibility of on-call - How accessible is on-call? Specifically, what proportion of your team members are actually on-call regularly? | Only one or two people from our team typically go on-call. Other people find on-call too difficult or confusing. | Everyone on our team takes part regularly in the on-call rota, whether during office ours or out-of-hours. We all share our on-call experiences and learning. |