Skip to content

Reliability model

Reliability is the property we are most opinionated about. This page spells out the guarantees and the mechanisms behind them.

SLO targets

MetricTarget
Platform uptime99.95% monthly
Median build start latency< 30 seconds from PR push
Rollback time (click → live)< 10 seconds
PR comment idempotency100% (no duplicate comments, ever)
Build determinism100% same-commit → same-artifact

These targets are enforced internally and are part of Enterprise contracts.

Mechanisms

Atomic job claiming

Every job (build, screenshot, comparison, promote, rollback, cleanup) is claimed by exactly one worker at a time. We use BullMQ’s atomic Redis primitive, not application-level locks. Two worker nodes running in parallel cannot both claim the same job.

Idempotent retries

Every job is designed so that re-running it produces the same outcome:

  • Builds are pinned to a specific commit and lockfile. The same inputs produce the same outputs.
  • PR comments are matched by an embedded HTML marker (<!-- yofix-deploy:projectId={id} -->), so retries always update-in-place rather than re-posting.
  • Comparisons are pure functions of (baseline image, preview image, config). The same inputs produce the same diff.
  • Promote/rollback are atomic at the manifest layer; a half-applied swap is impossible.

Isolated worker pools

Build workers and VRT workers run in separate pools. A misbehaving build (out-of-memory, long-running install, infinite loop) cannot starve VRT capacity or vice versa. Each pool has configured CPU and memory caps via systemd; jobs that exceed caps are killed and retried.

Health-checked deploys

The deploy pipeline targets multiple servers and compares SHA hashes after pulling. A version-skew between primary and worker fails the deploy. Services stay on the previous version if migrations fail.

Deterministic comparisons

Screenshot capture uses a fixed viewport, fixed device pixel ratio, and fixed font rendering. The comparison stack (SSIM + pixelmatch + region detection) is deterministic for a given pair of images. There is no flake from “rendering jitter.”

When things go wrong

When a job ultimately fails (network, OOM, dependency resolution), the system:

  1. Records the failure with full context in the BuildIncident table.
  2. Updates the PR comment to show the failure with a link to logs.
  3. Notifies the project’s owner channel (Slack / email, configurable).
  4. Leaves the previous successful artifact serving production unaffected.

We never silently swallow errors. We never “best-effort” fall back to a half-broken state. The reliability promise is: if a build hasn’t fully succeeded, your production traffic isn’t touched.

Next