Reliability model
Reliability is the property we are most opinionated about. This page spells out the guarantees and the mechanisms behind them.
SLO targets
| Metric | Target |
|---|---|
| Platform uptime | 99.95% monthly |
| Median build start latency | < 30 seconds from PR push |
| Rollback time (click → live) | < 10 seconds |
| PR comment idempotency | 100% (no duplicate comments, ever) |
| Build determinism | 100% same-commit → same-artifact |
These targets are enforced internally and are part of Enterprise contracts.
Mechanisms
Atomic job claiming
Every job (build, screenshot, comparison, promote, rollback, cleanup) is claimed by exactly one worker at a time. We use BullMQ’s atomic Redis primitive, not application-level locks. Two worker nodes running in parallel cannot both claim the same job.
Idempotent retries
Every job is designed so that re-running it produces the same outcome:
- Builds are pinned to a specific commit and lockfile. The same inputs produce the same outputs.
- PR comments are matched by an embedded HTML marker
(
<!-- yofix-deploy:projectId={id} -->), so retries always update-in-place rather than re-posting. - Comparisons are pure functions of (baseline image, preview image, config). The same inputs produce the same diff.
- Promote/rollback are atomic at the manifest layer; a half-applied swap is impossible.
Isolated worker pools
Build workers and VRT workers run in separate pools. A misbehaving build (out-of-memory, long-running install, infinite loop) cannot starve VRT capacity or vice versa. Each pool has configured CPU and memory caps via systemd; jobs that exceed caps are killed and retried.
Health-checked deploys
The deploy pipeline targets multiple servers and compares SHA hashes after pulling. A version-skew between primary and worker fails the deploy. Services stay on the previous version if migrations fail.
Deterministic comparisons
Screenshot capture uses a fixed viewport, fixed device pixel ratio, and fixed font rendering. The comparison stack (SSIM + pixelmatch + region detection) is deterministic for a given pair of images. There is no flake from “rendering jitter.”
When things go wrong
When a job ultimately fails (network, OOM, dependency resolution), the system:
- Records the failure with full context in the
BuildIncidenttable. - Updates the PR comment to show the failure with a link to logs.
- Notifies the project’s owner channel (Slack / email, configurable).
- Leaves the previous successful artifact serving production unaffected.
We never silently swallow errors. We never “best-effort” fall back to a half-broken state. The reliability promise is: if a build hasn’t fully succeeded, your production traffic isn’t touched.