03 · Design decisions

Optimised for reasoning, not purity.

Four trade-offs where I picked the choice that people could actually reason about and communicate — not the technically purest one. Flip any probe to see the counter-question.

The through-line

Optimised for reasoning, not purity.

Every trade-off below picked the choice non-engineers could hold in their head — because a metric only shifts culture if sales, support, and exec can reason about it too. A technically purer number that nobody outside the team understands changes nothing.

Minute-bucketing as the unit

A 'good minute' = no 5xx AND fewer than 5% of controllers unavailable. Not a request-success ratio.

Why: API traffic was spiky. A request-ratio SLI breaks when a window has zero requests. Minutes are stable, always defined, and easy to explain internally and to customers.

5% controller-unavailability floor

The controller SLI only fires if more than 5% of asset controllers are down in a minute.

Why: Rolling-update deploys briefly bring a small subset down by design. Without a floor we'd alert on planned rollout churn every deploy.

Unified SLO = logical AND

A minute is only good if BOTH the API and the controllers are healthy. Never averaged.

Why: The conservative view. Averaging would let a healthy API mask controller problems (or vice versa). For a customer-facing promise, never overstate health.

Calendar-month error budget

Month starts at 100%. Each bad minute burns it down. Resets on the 1st.

Why: Easy for non-technical stakeholders to reason about, aligns with contract and billing cadence, and gives sales a clean number to quote.