What I'd do differently.
Honest gaps from the timebox. Two are tactical next steps. The third is where this project points toward a platform.
Latency SLIs
I measured available but not fast enough. A slow-but-200 response counted as good. Natural next step.
Burn-rate alerting
Look at how the budget is being consumed and page on burn rate, not on individual errors. Would close the loop on the culture shift.
SLI definitions as code
Push from a one-off measurement system into a self-service reliability platform: any team declares an SLO the same way, gets the pipeline and dashboards for free. The platform-engineering version of what I built.
The pattern.
The pipeline is the artifact. The repeatable thing is spotting a reliability gap and closing it — with or without a mandate.
Platform Services Tech Lead; org-wide observability strategy.
Founded an SRE Working Group; became company-wide KPIs.
This system; the org's first factual reliability baseline.
Reliability as an on-call chore → a measured property of the system.