Backup Management
Backup visibility, recovery confidence, and the operating model teams use to keep resilience work reviewable instead of assumed.
Scope
Backups only create assurance when teams can explain what is protected, how current it is, and what happens when recovery is required. This guide keeps that public-safe operating view and excludes internals.
Operator guide for the Cadres backup management subsystem. Updated 2026-04-17.
Current repo-truth note (
O05/R18, 2026-04-17): - The control-plane CRUD, dashboard, route-access, scheduler, and restore workflows below match the current repo-backed implementation. - Fresh DB-backed backendpytest, a fresh frontend production build, and live deployed-runtime/browser proof are still environment-limited in this workspace and are not claimed here.
Table of Contents
- Getting Started
- Backup Engine (BES) Management
- Storage Target Management
- Third-Party Platform Integration
- Native Backup Policy Management
- Host Assignment
- Application-Consistent Hooks
- Replication Rules (3-2-1 Strategy)
- Monitoring and Dashboard
- Backup Job Management
- Restore Operations
- Storage Health and Capacity
- Protected Objects
- Permissions Reference
Getting Started
Setting up backup management requires these steps in order:
- Register a BES – Deploy a Backup Engine Service instance in the customer network and register it with the backend
- Create storage targets – Attach storage (S3, SMB, NFS, etc.) to the BES
- Create a backup policy – Define what to back up, how often, and retention rules
- Assign hosts – Assign managed hosts to the policy
- Monitor – Use the dashboard and protected objects view to track backup health
For third-party platforms (Veeam, Cohesity, etc.), skip steps 1-4 and instead register the platform via the Platforms page. The platform sync will import jobs and protected objects automatically.
Current implementation note (2026-04-17): the native scheduler/background loops
are registered from backend/core/scheduler.py, so policy schedules,
auto-retry, retention, health reconciliation, and platform sync now run through
the current control plane instead of existing only as dormant modules.
Backup Engine (BES) Management
Registering a BES
BES instances self-register with the backend using the organization secret. This is done from the BES instance itself, not from the UI. The BES sends its name, deployment model, public key, hostname, IP address, and port during registration.
After registration, the BES appears on the Engines page with status “pending”. It transitions to “online” after its first heartbeat.
Deployment models:
- on_premise – BES running on customer-owned hardware
- cloud_vm – BES running on a cloud VM (AWS, Azure, GCP)
- cadres_hosted – BES hosted by Cadres
Viewing Engines
Navigate to Backup > Engines. The list shows all registered BES instances with: - Name, hostname, IP address - Status (pending, online, offline, degraded) - Version - Storage target count - Total managed bytes - Last heartbeat timestamp
Filter by organization, status, or search by name/hostname/IP.
Editing an Engine
Click an engine to edit its configuration. Admin-editable fields: - Name – Display name - PXE Enabled – Enable PXE boot for bare-metal restore - PXE Network – Network CIDR for PXE boot
Deleting an Engine
Deleting a BES deregisters it and cascades deletion to all associated storage targets and policies. This is destructive and cannot be undone.
Heartbeat and Health
BES instances send heartbeats approximately every minute. If a heartbeat is not received for more than 3 minutes, the system automatically marks the BES as “offline” and fires a critical alert.
The BES also reports engine-level health metrics (CPU, memory, disk usage, active jobs, uptime) which are logged for diagnostic purposes.
Storage Target Management
Creating a Storage Target
Navigate to Backup > Storage and click “Add Storage Target”.
Required fields: - Name – Display name for the target - Target Type – One of: S3, S3 Compatible, SMB, NFS, iSCSI, Local - BES – Which backup engine owns this storage - Alert Threshold – Percentage at which a warning fires (default 85%) - Critical Threshold – Percentage at which policies auto-pause (default 95%)
Connection config (varies by type): - S3/S3 Compatible: endpoint URL, bucket, region, prefix - SMB: server, share name, path - NFS: server, export path, mount options - iSCSI: target portal, IQN, LUN - Local: filesystem path
Credentials (optional, stored encrypted in PAM vault): - Username - Secret (access key, password, etc.) - Credential type
Credentialed storage targets require an organization-scoped BES. When a BES is account-scoped, the UI can still create credentialless targets and run storage verification, but credentialed target creation fails closed because the backing vault identities remain organization-owned.
The critical threshold must be higher than the alert threshold.
Editing a Storage Target
Click a target to edit name, thresholds, connection config. Credential rotation is supported – provide new credentials to replace existing ones.
Deleting a Storage Target
A storage target cannot be deleted if any backup policies reference it (FK RESTRICT). Remove or reassign all policies first.
Deletion also removes the linked PAM Identity credential.
Native Backup Policy Management
Creating a Policy
Navigate to Backup > Policies and click “Add Policy”.
Core settings:
- Name – Policy display name
- BES – Which backup engine handles this policy
- Storage Target – Where to store backups (must be on the selected BES)
- Backup Type – file (file-level) or image (full disk image)
Scheduling (cron expressions):
- Full Schedule – When to run full backups (required, e.g., 0 2 * * 0 for Sunday 2 AM)
- Incremental Schedule – When to run incrementals (optional, e.g., 0 2 * * 1-6 for Mon-Sat 2 AM)
- Synthetic Full Schedule – When to synthesize a full from base + incrementals (optional)
Path selection:
- Exclude Paths – Directories to skip
- Exclude Patterns – Glob patterns to skip (e.g., *.tmp, *.log)
- Exclusion Template – Pre-built exclusion set name
Retention (Grandfather-Father-Son): - Daily – Keep N most recent daily backups (1-365, default 7) - Weekly – Keep N most recent weekly backups (0-52, default 4) - Monthly – Keep N most recent monthly backups (0-120, default 12)
Data protection: - Compression – lz4 (fast), zstd (better ratio), or none - Encryption – AES-256-GCM (default), AES-256-CBC, or none - Deduplication – Content-addressed dedup (default enabled)
Performance: - Max Transfer MBps – Bandwidth cap (optional) - Max IO Priority – normal/low/high - Throttle Schedule – Cron for when throttling applies
Behavior: - VSS Enabled – Use VSS snapshots on Windows, LVM freeze on Linux (default true) - Catch Up On Reconnect – Run missed backups when agent reconnects (default true) - Verify After Backup – Automatically verify backup integrity (default true) - Verify Sample Count – Number of files to sample for verification (1-10000, default 50) - Pre-Patch Backup – Run an incremental before OS patching (default false) - Archive On Host Deactivation – Archive backups when host is decommissioned (default true) - Enabled – Policy active state
Editing a Policy
All fields above can be modified after creation except the BES assignment. If the storage target is changed, it must belong to the same BES.
Disabling/Enabling a Policy
Toggle the Enabled state. A disabled policy will not have scheduled backups dispatched.
Note: Policies may be auto-paused by the system when their storage target hits critical capacity. The paused_reason field shows “storage_critical” in this case. These policies auto-resume when storage returns to healthy.
Deleting a Policy
Deleting a policy cascades to host assignments, replication rules, and app hooks.
Host Assignment
Assigning Hosts
From a policy detail view, use the host assignment interface to add hosts. The host must: - Belong to the same account - Be accessible by the user’s RBAC org scope - Not already be assigned to this policy (unless archived)
If a host was previously assigned and then archived, reassigning it reactivates the existing assignment.
Removing Hosts
Remove a host assignment to stop backing up that host under this policy. This does not delete existing restore points.
Host Assignment States
| State | Meaning |
|---|---|
active |
Host is actively backed up by this policy |
paused |
Backups temporarily paused for this host |
archived |
Host decommissioned; backups preserved for retention |
Application-Consistent Hooks
Managing Hooks via the UI
When editing an existing backup policy, expand the App Hooks collapsible section (7th section in the policy form). From there you can:
- View existing hooks – Listed with hook type badge, template name, script reference, timeout, and fail-on-error indicator.
- Add a hook – Click “Add Hook” to open the inline form. Select hook type, optionally choose an app template or “Custom Script”, pick a script from the searchable dropdown, set timeout and sequence, and toggle fail-on-error.
- Edit a hook – Click an existing hook row to edit it inline.
- Delete a hook – Click the trash icon on a hook row (confirmation required).
Available Templates
Pre-built templates with script snippets:
|----------|-------------|-------------------|
| mysql | MySQL / MariaDB | mysqldump --single-transaction |
| mongodb | MongoDB | mongodump with replica set support |
| elasticsearch | Elasticsearch | Snapshot to registered repository |
Hook Execution Order
Hooks of the same type execute in sequence order (ascending), then by ID for ties.
Replication Rules (3-2-1 Strategy)
Replication rules copy backup data from a policy’s primary storage target to secondary targets for redundancy.
The 3-2-1 Rule
- 3 copies of data (primary + 2 replicas)
- 2 different media types (e.g., disk + object storage + tape)
- 1 offsite copy
Creating a Replica Rule
Navigate to Backup > Replicas or from a policy detail view:
- Policy – Which backup policy to replicate
- Storage Target – Secondary storage target for the copy (must belong to same account, can be on a different BES)
- KEK Source – Key encryption key source:
account(default) orpam_vault(external key management) - Replicate After Full – Copy after full backups (default true)
- Replicate After Incremental – Copy after incremental backups (default true)
- Is Offsite – Mark as offsite copy for compliance (default false)
How Replication Works
- After a backup completes, the replication module checks if any replica rules apply
- If so, it dispatches a replication job to the BES
- The BES copies chunks from the primary storage target to the replica target
- Replication status is tracked and reported back via ingest
Monitoring and Dashboard
Dashboard Overview
Navigate to Backup > Dashboard for an aggregated health view.
Summary metrics: - Protected Count – Total objects with backup coverage - At Risk Count – Objects where the last backup failed - RPO Breach Count – Objects exceeding their RPO threshold - Unprotected Host Count – Managed hosts with no backup coverage
Engine status: - BES Online – Engines reporting healthy heartbeats - BES Offline – Engines with stale heartbeats (> 3 minutes)
Recent activity (last 24 hours): - Success Count – Successful backup job runs - Failed Count – Failed backup job runs - Warning Count – Backup job runs with warnings
Exception queues (exception-first triage):
- Failed Backups (red) – Failed backup jobs with retry button. Shows policy name, status, failure time. Empty state: “No failed backups”.
- RPO Breaches (orange) – Protected objects exceeding their RPO threshold. Shows object name, host, last backup time, RPO hours, breach duration. Fetched via . Row click deep-links to Backup > Restore with the matching host and restore point context. Empty state: “All objects within RPO”.
- Storage Health Alerts (yellow) – Storage targets in warning or critical health. Shows target name, type, usage %, health status, last check time. Inline health check and integrity verify action buttons. Card click deep-links to Backup > Storage. Empty state: “All storage healthy”.
- Verification Failures (yellow) – Restore points where verification failed. Shows restore point ID, host, policy, backup type, completion time. Fetched via . Row click deep-links to Backup > Restore with the matching host and restore point context. Empty state: “All verifications passing”.
Each section loads independently with its own loading state and error handling. Action buttons are gated by backup.manage permission.
Route access note: the relevant workflow requires backup.view, and deep links to tabs such
as restore or policies fail closed in the UI when the caller lacks the
corresponding backup.restore or backup.manage permission.
Backup Job Management
Viewing Jobs
Navigate to Backup > Jobs for a unified view of all backup jobs, covering both third-party platform synced jobs and native BES jobs.
Filter options: - Platform or BES source - Job status (last run) - Job type (full, incremental, differential, snapshot, log) - Organization - Date range - Search by policy name or source job ID
Job Run History
Click a job to see its execution history. Each run shows: - Status (success, warning, failed, running, pending) - Start/end time and duration - Data transferred and protected size - Restore point ID (if applicable) - Error message (if failed) - Whether the run is retryable
Retrying a Failed Job
Click Retry on a failed job to dispatch a new run. Requirements:
- The latest run must be in failed or warning state
- The run must be marked as retryable (is_retryable=True)
For native BES jobs, retry dispatches agent jobs to all active hosts on the policy. For platform jobs, retry creates a pending run that the next sync picks up.
Automatic Retries
The system automatically retries failed jobs up to 3 times (configurable via in auto_retry.py). Auto-retries:
- Only trigger when the latest run is failed and retryable
- Skip jobs that already have a pending or running run
- Use the same build_backup_payload() as manual dispatches
- Set is_retryable=False on retry runs to prevent cascading retries
Restore Operations
Browsing Restore Points
Navigate to Backup > Restore to view all restore points.
Filter options: - Host - Policy - BES - Chain ID - Status (complete, running, failed, expired) - Verification status (passed, failed, partial, pending)
Current implementation note (2026-04-17): restore-point listing, browse, search, file restore, image restore, and BMR now run through the current backend/agent/BES contract. The remaining local evidence gap is environment verification, not a known repo-backed runtime hole.
Browsing Files in a Restore Point
Select a restore point and click Browse to explore its file tree. This dispatches a browse job to the BES and returns results asynchronously (poll the job for results).
Provide a path to browse (default “/”).
Searching Files
Use the Search feature to find specific files within a restore point. Enter a search query (filename, pattern, etc.) and the system dispatches a search job to the BES.
File-Level Restore
From a restore point, click Restore Files:
- Select files – Choose files or directories to restore (up to 10,000 paths)
- Target path – Directory where files will be restored
- Overwrite – Whether to overwrite existing files at the target (default: no)
- Target host – Optionally restore to a different host (defaults to the original host)
The restore job runs with high priority. The agent: 1. Retrieves the manifest from the BES 2. Filters for the requested paths 3. Downloads chunks from the BES 4. Reassembles files with original permissions
Image-Level Restore
For image-type backups, use Restore Image to perform a full disk restore:
- Target host – The host to restore the image to (required)
- Target path – The mount or image path to restore into (required)
- Overwrite – Whether to overwrite existing image data at the target path
- The job runs with critical priority
Bare-Metal Restore (BMR)
For complete system recovery to new or wiped hardware:
- Select a restore point and click BMR
- Target method –
pxe(network boot) oriso(boot media) target hardware
Requirements:
- Restore point must be complete
- BES must be online
- For PXE: the BES must have pxe_enabled=True
The system generates a one-time BMR token (UUID, 4-hour expiry) for replay protection. The BMR job runs with critical priority.
Storage Health and Capacity
Health Check vs Integrity Verify
Two distinct operations for storage verification:
| Operation | What It Tests | Who Runs It |
|---|---|---|
| Health Check | Reachability probe plus BES capacity/health reporting | Any eligible online relay host in the target org when scoped, otherwise an online host in the same account |
| Integrity Verify | Catalog validation, chain integrity, restore rehearsal | BES (needs backup catalog) |
Both dispatch agent jobs and return results asynchronously.
Capacity Monitoring
Storage targets automatically report capacity data via the BES health ingest. The system tracks: - Total, used, and free bytes - Usage percentage - Growth rate (EWMA smoothing) - Estimated time-to-full projection
Capacity Alerts
| Threshold | Status | Action |
|---|---|---|
| < Alert threshold (default 85%) | healthy |
No action |
| >= Alert threshold | warning |
Alert fired to notification channels |
| >= Critical threshold (default 95%) | critical |
Alert fired + all policies targeting this storage auto-paused |
When storage returns to healthy from critical, auto-paused policies are automatically re-enabled. Manually disabled policies are not affected.
Capacity History
View historical capacity trends via Backup > Storage > Capacity History. Shows up to 365 days of time-series data (usage percent, total/used/free bytes) with the current snapshot including growth rate and estimated full date.
Protected Objects
What Are Protected Objects?
Protected objects represent anything covered by backup – VMs, physical servers, filesets, cloud resources, NAS shares. They are created automatically when: - A third-party platform sync discovers them - A native backup policy is assigned to a host
Viewing Protected Objects
Navigate to Backup > Protected Objects.
Filter options: - Organization - Object type (vm, physical, fileset, cloud_resource, nas) - RPO breach (true/false) - Platform or BES source - Host - Last backup status (success, failed, warning, or “never” for objects never backed up) - Search by object name or hostname
RPO Monitoring
Each protected object has an RPO (Recovery Point Objective) in hours. The system flags objects where the time since last successful backup exceeds the RPO as a “breach”.
RPO breach objects appear in:
- The protected objects list (filter by rpo_breach=true)
- The dashboard rpo_breach_count metric
Permissions Reference
| Permission | Grants Access To |
|---|---|
backup.view |
View engines, platforms, jobs, protected objects, policies, replicas, storage targets, templates, dashboard |
backup.manage |
Create, update, delete engines/platforms/policies/replicas/storage/hooks. Trigger sync, run-now, test connectivity, health checks, retry jobs |
backup.restore |
View restore points, browse/search files, trigger file/image/BMR restores |