02 · Architecture

Define the terms, then show the pipeline.

Start with the service levels — that's where the impact is. Then the pipeline that measures them.

The Service Levels

What we promised and what we measure — SRE-book definitions, Fever's numbers.

SLA
99%

An explicit contract with users that includes consequences for missing the SLOs it contains.

Fever's: 99% monthly availability of Partner API + asset-control platform. Contractual, commercial consequences attached. ~7h 12m–7h 26m downtime budget depending on month length, excluding scheduled maintenance.

SLO
99.9%

A target value for a service level, measured by an SLI.

Fever's: internal target combining API + controllers via logical AND. Deliberately tighter than the 99% SLA so budget burns visibly before the contract is at risk. ~43 bad-minute budget per month.

SLI · API
no 5xx / min

A quantitative measure of one aspect of the service level provided.

Fever's: fraction of minutes with zero 5xx observed at the ALB. Black-box — measured where the customer sees it, not what the service reports about itself.

SLI · Controllers
≥95% up / min

A quantitative measure of one aspect of the service level provided.

Fever's: fraction of minutes where at least 95% of expected controllers are running on Fargate. 5% floor accounts for rolling deploys.

Unified SLO rule: a minute is good only if both API and controller SLIs pass. Logical AND, not average — never overstate health.

Trade-offs on each number — and why 99% SLA / 99.9% SLO / two SLIs — in presenter notes (press N). How each SLI is computed lives on the Decisions page.

Data Ingestion Pipeline

How a good minute actually gets counted — black-box signals in, queryable SQL out.

Black-box measurement pipeline
Black-box SLI measurement pipeline: partner API access logs and Fargate controller counts converge through Lambda enrichment, an internal event bus, GCP PubSub, BigQuery raw tables, and layered dbt models into Grafana dashboards.API SLIController SLIPartner APIALB access logsFargatecontroller countLambdaenrich + emitTopicinternal busPubSubGCPBigQueryraw source tablesdbtSLI + SLO modelsGrafanamonth + trend + orgs
Click any node for detail · Press Play to walk the path

Why BigQuery + dbt instead of a metrics DB — replayable history, analyst-inspectable — in presenter notes (press N).

Rejected alternatives

Why not the off-the-shelf options.

Rejected

Prometheus / metrics DB

No replayable history. If an SLI definition changes, you can't backfill — past months stay wrong forever.

Rejected

Vendor SLO product (Grafana Cloud SLO, Nobl9, etc.)

White-box by default, per-seat cost at our scale, and lock-in on the definition layer we most wanted to own.

Rejected

Synthetic probes

Wouldn't reflect real customer traffic — a synthetic 200 tells you nothing about the request that actually failed.