Caught something the service itself was mislabelling.
A 500-for-4xx mislabelling burned our budget for failures that weren't ours and hid a real client-facing problem — retroactive proof the black-box principle mattered.
Caught a service reporting client errors as server failures
Digging into an anomaly on the SLO dashboards, I found one service returning HTTP 500 for problems that were really 4xx-class bad-request issues. That mattered on two fronts: 5xx means we failed, 4xx means the client sent something invalid — so we were burning our error budget for failures that weren't ours, and the mislabelling hid a real client-facing problem behind a generic “server error.” Every self-reported service metric showed a legitimate 5xx spike. I dug into each failing case using the distributed tracing I'd set up — a wrapped OpenTelemetry library with trace-to-log correlation across our Go services — cross-referencing traces, logs and code paths to confirm it was malformed input rather than a real server fault. I re-classified those routes as a pragmatic fix and it drove the work to correct the status codes at source — so the metric measured what it was supposed to.
Shifted reliability culture
Teams stopped asking 'who's on-call?' and started asking 'is this worth burning budget for?' — reliability moved from an on-call behaviour to a property of the system with a shared vocabulary. Outlived my involvement.
First canonical source of truth
The org's first factual basis for whether the API and controllers were actually working — not an opinion, a number everyone referenced.
Visibility into the key user journeys
For the first time we could see, at minute granularity, whether the interactions that mattered to customers were actually working — surfaced issues that were invisible to the services' own metrics.
Durability without me
I left shortly after shipping. Pipeline has stayed solid; BigQuery data is queried and investigated frequently.