At best, making oncall the exclusive responsibility of an elite SRE class increases our tolerance for complexity.
See Simplicity.
Oncall is a form of toil β it needs to be done but it doesnβt leave our systems in a better state.
Stakeholders see high-profile incident response/oncall happening, and donβt demand clarity on what other work the group is undertaking.
To go further β incident command and management is a specific set of skills that you can definitely be good at, and where the business really, really needs a consistent and competent response, every time. At Twilio, we have a specific team that manages all incidents, follow-up actions, and operational insights around incidents company-wide. Weβve found that making sure that the data and insights around incidents and their followup flows back into the business is a full-time job. Relying on a rotation of variably interested volunteers to ensure this happens will get you mixed results.
It will be useful to have Chapter 11 (Being On-Call) from Google's SRE book available (it's one in a series).