1 private link
At best, making oncall the exclusive responsibility of an elite SRE class increases our tolerance for complexity.
See Simplicity.
Oncall is a form of toil – it needs to be done but it doesn’t leave our systems in a better state.
Stakeholders see high-profile incident response/oncall happening, and don’t demand clarity on what other work the group is undertaking.
To go further – incident command and management is a specific set of skills that you can definitely be good at, and where the business really, really needs a consistent and competent response, every time. At Twilio, we have a specific team that manages all incidents, follow-up actions, and operational insights around incidents company-wide. We’ve found that making sure that the data and insights around incidents and their followup flows back into the business is a full-time job. Relying on a rotation of variably interested volunteers to ensure this happens will get you mixed results.
It will be useful to have Chapter 11 (Being On-Call) from Google's SRE book available (it's one in a series).