Define the terms, then show the pipeline.
Start with the service levels — that's where the impact is. Then the pipeline that measures them.
The Service Levels
What we promised and what we measure — SRE-book definitions, Fever's numbers.
“An explicit contract with users that includes consequences for missing the SLOs it contains.”
Fever's: 99% monthly availability of Partner API + asset-control platform. Contractual, commercial consequences attached. ~7h 12m–7h 26m downtime budget depending on month length, excluding scheduled maintenance.
“A target value for a service level, measured by an SLI.”
Fever's: internal target combining API + controllers via logical AND. Deliberately tighter than the 99% SLA so budget burns visibly before the contract is at risk. ~43 bad-minute budget per month.
“A quantitative measure of one aspect of the service level provided.”
Fever's: fraction of minutes with zero 5xx observed at the ALB. Black-box — measured where the customer sees it, not what the service reports about itself.
“A quantitative measure of one aspect of the service level provided.”
Fever's: fraction of minutes where at least 95% of expected controllers are running on Fargate. 5% floor accounts for rolling deploys.
Trade-offs on each number — and why 99% SLA / 99.9% SLO / two SLIs — in presenter notes (press N). How each SLI is computed lives on the Decisions page.
Data Ingestion Pipeline
How a good minute actually gets counted — black-box signals in, queryable SQL out.
Why BigQuery + dbt instead of a metrics DB — replayable history, analyst-inspectable — in presenter notes (press N).
Rejected alternatives
Why not the off-the-shelf options.
Prometheus / metrics DB
No replayable history. If an SLI definition changes, you can't backfill — past months stay wrong forever.
Vendor SLO product (Grafana Cloud SLO, Nobl9, etc.)
White-box by default, per-seat cost at our scale, and lock-in on the definition layer we most wanted to own.
Synthetic probes
Wouldn't reflect real customer traffic — a synthetic 200 tells you nothing about the request that actually failed.