1 private link
What is the origin of the word "mainframe", referring to a large, complex computer? Most sources agree that the term is related to the frame...
At best, making oncall the exclusive responsibility of an elite SRE class increases our tolerance for complexity.
See Simplicity.
Oncall is a form of toil – it needs to be done but it doesn’t leave our systems in a better state.
Stakeholders see high-profile incident response/oncall happening, and don’t demand clarity on what other work the group is undertaking.
To go further – incident command and management is a specific set of skills that you can definitely be good at, and where the business really, really needs a consistent and competent response, every time. At Twilio, we have a specific team that manages all incidents, follow-up actions, and operational insights around incidents company-wide. We’ve found that making sure that the data and insights around incidents and their followup flows back into the business is a full-time job. Relying on a rotation of variably interested volunteers to ensure this happens will get you mixed results.
It will be useful to have Chapter 11 (Being On-Call) from Google's SRE book available (it's one in a series).