02 · Architecture

Define the terms, then show the pipeline.

Start with the service levels — that's where the impact is. Then the pipeline that measures them.

The Service Levels

What we promised and what we measure — SRE-book definitions, Fever's numbers.

SLA

99%

“An explicit contract with users that includes consequences for missing the SLOs it contains.”

Fever's: 99% monthly availability of Partner API + asset-control platform. Contractual, commercial consequences attached. ~7h 12m–7h 26m downtime budget depending on month length, excluding scheduled maintenance.

SLO

99.9%

“A target value for a service level, measured by an SLI.”

Fever's: internal target combining API + controllers via logical AND. Deliberately tighter than the 99% SLA so budget burns visibly before the contract is at risk. ~43 bad-minute budget per month.

SLI · API

no 5xx / min

“A quantitative measure of one aspect of the service level provided.”

Fever's: fraction of minutes with zero 5xx observed at the ALB. Black-box — measured where the customer sees it, not what the service reports about itself.

SLI · Controllers

≥95% up / min

“A quantitative measure of one aspect of the service level provided.”

Fever's: fraction of minutes where at least 95% of expected controllers are running on Fargate. 5% floor accounts for rolling deploys.

Unified SLO rule: a minute is good only if both API and controller SLIs pass. Logical AND, not average — never overstate health.

Trade-offs on each number — and why 99% SLA / 99.9% SLO / two SLIs — in presenter notes (press N). How each SLI is computed lives on the Decisions page.

Data Ingestion Pipeline

How a good minute actually gets counted — black-box signals in, queryable SQL out.

Black-box measurement pipeline

Click any node for detail · Press Play to walk the path

Why BigQuery + dbt instead of a metrics DB — replayable history, analyst-inspectable — in presenter notes (press N).

Rejected alternatives

Why not the off-the-shelf options.

Rejected

Prometheus / metrics DB

No replayable history. If an SLI definition changes, you can't backfill — past months stay wrong forever.

Rejected

Vendor SLO product (Grafana Cloud SLO, Nobl9, etc.)

White-box by default, per-seat cost at our scale, and lock-in on the definition layer we most wanted to own.

Rejected

Synthetic probes

Wouldn't reflect real customer traffic — a synthetic 200 tells you nothing about the request that actually failed.