Disaster Recovery
This document describes how Itstabyl recovers from infrastructure failure and what customers can expect during an incident. It is a public-facing summary of the internal DR runbook; operational specifics (paging rotations, secret-rotation procedures) live in private operator runbooks.
Recovery objectives
| Objective | Target | Definition | | ----------------------------------- | ------ | ----------------------------------------------------------- | | RTO (Recovery Time Objective) | 4 hours | Maximum tolerable wall-clock time from detected outage to restored service. | | RPO (Recovery Point Objective) | 1 hour | Maximum tolerable data loss measured backward from incident start. |
These targets cover the production scan pipeline — the web service, the pipeline workers, and the Postgres database. The public marketing pages, this documentation site, and the status page itself are best-effort and explicitly excluded from the DR plan.
Backup posture
Itstabyl runs the production database on managed Postgres with point-in-time recovery (PITR) enabled.
- PITR window: the database can be restored to any point within the preceding 7 days, capped by the managed-provider plan.
- Daily logical snapshots: retained for 30 days in cold storage in a separate availability zone from the primary.
- Weekly off-region backups: retained for 90 days as a defence against region-wide failure.
- Restore drills: performed quarterly. The most recent drill timestamp
is reported on the status page (operator follow-up
OP-5).
Pipeline outputs are derived artefacts. They are not separately backed up because they are reproducible from the source repository commit SHA at deterministic cost (re-running the scan); customers can re-trigger a scan from the dashboard if a derived artefact is lost.
Failure-mode coverage
Itstabyl is built to degrade safely under third-party outage. Vendor circuit breakers cover the critical external dependencies — Clerk (authentication), Stripe (billing), Resend (transactional email), and the queue backend driving the pipeline workers.
A tripped breaker fails fast and returns HTTP 503 with a Retry-After
header instead of blocking workers and amplifying the outage. The
breaker state itself is observable via /api/v1/health.
Per-vendor recovery procedures, including what customers see and the breaker-state behaviour during recovery, are documented in the runbooks index. Specific recovery paths:
- Clerk outage: sign-in is unavailable; existing sessions continue to work until expiry. Sign-up is paused. Already-running scans complete normally.
- Stripe outage: new subscription checkouts fail closed; existing subscriptions continue to be honoured from the cached plan flag.
- Resend outage: transactional email queues for retry; the in-app notification surface remains authoritative.
- Queue/worker outage: the API accepts scans into a durable inbox and drains them when the worker tier recovers. Submitted scans are not lost.
- Database PITR-window event: restore from PITR; if PITR is unusable, fall back to the latest daily snapshot (RPO ≤ 24h in this degenerate case, with a status-page disclosure).
Incident communication
In-flight incident updates are posted to the status page (operator
follow-up OP-5) and, for incidents lasting longer than 30 minutes,
broadcast to the email address on file for affected customers. A
post-incident review is published within 7 days of resolution for any
incident that breached the SLA target.
Contact
Incidents in progress are best tracked on the status page. To report a production-down condition, escalate to the support address surfaced in-app on dashboard error states.