Disaster Recovery

This document describes how Itstabyl recovers from infrastructure failure and what customers can expect during an incident. It is a public-facing summary of the internal DR runbook; operational specifics (paging rotations, secret-rotation procedures) live in private operator runbooks.

Recovery objectives

| Objective | Target | Definition | | ----------------------------------- | ------ | ----------------------------------------------------------- | | RTO (Recovery Time Objective) | 4 hours | Maximum tolerable wall-clock time from detected outage to restored service. | | RPO (Recovery Point Objective) | 1 hour | Maximum tolerable data loss measured backward from incident start. |

These targets cover the production scan pipeline — the web service, the pipeline workers, and the Postgres database. The public marketing pages, this documentation site, and the status page itself are best-effort and explicitly excluded from the DR plan.

Backup posture

Itstabyl runs the production database on managed Postgres with point-in-time recovery (PITR) enabled.

PITR window: the database can be restored to any point within the preceding 7 days, capped by the managed-provider plan.
Daily logical snapshots: retained for 30 days in cold storage in a separate availability zone from the primary.
Weekly off-region backups: retained for 90 days as a defence against region-wide failure.
Restore drills: performed quarterly. The most recent drill timestamp is reported on the status page (operator follow-up OP-5).

Pipeline outputs are derived artefacts. They are not separately backed up because they are reproducible from the source repository commit SHA at deterministic cost (re-running the scan); customers can re-trigger a scan from the dashboard if a derived artefact is lost.

Failure-mode coverage

Itstabyl is built to degrade safely under third-party outage. Vendor circuit breakers cover the critical external dependencies — Clerk (authentication), Stripe (billing), Resend (transactional email), and the queue backend driving the pipeline workers.

A tripped breaker fails fast and returns HTTP 503 with a Retry-After header instead of blocking workers and amplifying the outage. The breaker state itself is observable via /api/v1/health.

Per-vendor recovery procedures, including what customers see and the breaker-state behaviour during recovery, are documented in the runbooks index. Specific recovery paths:

Clerk outage: sign-in is unavailable; existing sessions continue to work until expiry. Sign-up is paused. Already-running scans complete normally.
Stripe outage: new subscription checkouts fail closed; existing subscriptions continue to be honoured from the cached plan flag.
Resend outage: transactional email queues for retry; the in-app notification surface remains authoritative.
Queue/worker outage: the API accepts scans into a durable inbox and drains them when the worker tier recovers. Submitted scans are not lost.
Database PITR-window event: restore from PITR; if PITR is unusable, fall back to the latest daily snapshot (RPO ≤ 24h in this degenerate case, with a status-page disclosure).

Incident communication

In-flight incident updates are posted to the status page (operator follow-up OP-5) and, for incidents lasting longer than 30 minutes, broadcast to the email address on file for affected customers. A post-incident review is published within 7 days of resolution for any incident that breached the SLA target.

Contact

Incidents in progress are best tracked on the status page. To report a production-down condition, escalate to the support address surfaced in-app on dashboard error states.