Optimised for reasoning, not purity.
Four trade-offs where I picked the choice that people could actually reason about and communicate — not the technically purest one. Flip any probe to see the counter-question.
Optimised for reasoning, not purity.
Every trade-off below picked the choice non-engineers could hold in their head — because a metric only shifts culture if sales, support, and exec can reason about it too. A technically purer number that nobody outside the team understands changes nothing.
Minute-bucketing as the unit
A 'good minute' = no 5xx AND fewer than 5% of controllers unavailable. Not a request-success ratio.
Why: API traffic was spiky. A request-ratio SLI breaks when a window has zero requests. Minutes are stable, always defined, and easy to explain internally and to customers.
5% controller-unavailability floor
The controller SLI only fires if more than 5% of asset controllers are down in a minute.
Why: Rolling-update deploys briefly bring a small subset down by design. Without a floor we'd alert on planned rollout churn every deploy.
Unified SLO = logical AND
A minute is only good if BOTH the API and the controllers are healthy. Never averaged.
Why: The conservative view. Averaging would let a healthy API mask controller problems (or vice versa). For a customer-facing promise, never overstate health.
Calendar-month error budget
Month starts at 100%. Each bad minute burns it down. Resets on the 1st.
Why: Easy for non-technical stakeholders to reason about, aligns with contract and billing cadence, and gives sales a clean number to quote.