Disaster Recovery

This document describes how Itstabyl recovers from infrastructure failure and what customers can expect during an incident. It is a public-facing summary of the internal DR runbook; operational specifics (paging rotations, secret-rotation procedures) live in private operator runbooks.

Recovery objectives

| Objective | Target | Definition | | ----------------------------------- | ------ | ----------------------------------------------------------- | | RTO (Recovery Time Objective) | 4 hours | Maximum tolerable wall-clock time from detected outage to restored service. | | RPO (Recovery Point Objective) | 1 hour | Maximum tolerable data loss measured backward from incident start. |

These targets cover the production scan pipeline — the web service, the pipeline workers, and the Postgres database. The public marketing pages, this documentation site, and the status page itself are best-effort and explicitly excluded from the DR plan.

Backup posture

Itstabyl runs the production database on managed Postgres with point-in-time recovery (PITR) enabled.

Pipeline outputs are derived artefacts. They are not separately backed up because they are reproducible from the source repository commit SHA at deterministic cost (re-running the scan); customers can re-trigger a scan from the dashboard if a derived artefact is lost.

Failure-mode coverage

Itstabyl is built to degrade safely under third-party outage. Vendor circuit breakers cover the critical external dependencies — Clerk (authentication), Stripe (billing), Resend (transactional email), and the queue backend driving the pipeline workers.

A tripped breaker fails fast and returns HTTP 503 with a Retry-After header instead of blocking workers and amplifying the outage. The breaker state itself is observable via /api/v1/health.

Per-vendor recovery procedures, including what customers see and the breaker-state behaviour during recovery, are documented in the runbooks index. Specific recovery paths:

Incident communication

In-flight incident updates are posted to the status page (operator follow-up OP-5) and, for incidents lasting longer than 30 minutes, broadcast to the email address on file for affected customers. A post-incident review is published within 7 days of resolution for any incident that breached the SLA target.

Contact

Incidents in progress are best tracked on the status page. To report a production-down condition, escalate to the support address surfaced in-app on dashboard error states.