IT Service & Operations Manual

Backup Management

Backup visibility, recovery confidence, and the operating model teams use to keep resilience work reviewable instead of assumed.

Audience: Infrastructure, operations, and resilience teamsFocus: Backup posture and recovery readinessStatus: Public manual

Scope

Backups only create assurance when teams can explain what is protected, how current it is, and what happens when recovery is required. This guide keeps that public-safe operating view and excludes internals.

Operator guide for the Cadres backup management subsystem. Updated 2026-04-17.

Current repo-truth note (O05 / R18, 2026-04-17): - The control-plane CRUD, dashboard, route-access, scheduler, and restore workflows below match the current repo-backed implementation. - Fresh DB-backed backend pytest, a fresh frontend production build, and live deployed-runtime/browser proof are still environment-limited in this workspace and are not claimed here.

Getting Started
Backup Engine (BES) Management
Storage Target Management
Third-Party Platform Integration
Native Backup Policy Management
Host Assignment
Application-Consistent Hooks
Replication Rules (3-2-1 Strategy)
Monitoring and Dashboard
Backup Job Management
Restore Operations
Storage Health and Capacity
Protected Objects
Permissions Reference

Getting Started

Setting up backup management requires these steps in order:

Register a BES – Deploy a Backup Engine Service instance in the customer network and register it with the backend
Create storage targets – Attach storage (S3, SMB, NFS, etc.) to the BES
Create a backup policy – Define what to back up, how often, and retention rules
Assign hosts – Assign managed hosts to the policy
Monitor – Use the dashboard and protected objects view to track backup health

For third-party platforms (Veeam, Cohesity, etc.), skip steps 1-4 and instead register the platform via the Platforms page. The platform sync will import jobs and protected objects automatically.

Current implementation note (2026-04-17): the native scheduler/background loops are registered from backend/core/scheduler.py, so policy schedules, auto-retry, retention, health reconciliation, and platform sync now run through the current control plane instead of existing only as dormant modules.

Backup Engine (BES) Management

Registering a BES

BES instances self-register with the backend using the organization secret. This is done from the BES instance itself, not from the UI. The BES sends its name, deployment model, public key, hostname, IP address, and port during registration.

After registration, the BES appears on the Engines page with status “pending”. It transitions to “online” after its first heartbeat.

Deployment models: - on_premise – BES running on customer-owned hardware - cloud_vm – BES running on a cloud VM (AWS, Azure, GCP) - cadres_hosted – BES hosted by Cadres

Viewing Engines

Navigate to Backup > Engines. The list shows all registered BES instances with: - Name, hostname, IP address - Status (pending, online, offline, degraded) - Version - Storage target count - Total managed bytes - Last heartbeat timestamp

Filter by organization, status, or search by name/hostname/IP.

Editing an Engine

Click an engine to edit its configuration. Admin-editable fields: - Name – Display name - PXE Enabled – Enable PXE boot for bare-metal restore - PXE Network – Network CIDR for PXE boot

Deleting an Engine

Deleting a BES deregisters it and cascades deletion to all associated storage targets and policies. This is destructive and cannot be undone.

Heartbeat and Health

BES instances send heartbeats approximately every minute. If a heartbeat is not received for more than 3 minutes, the system automatically marks the BES as “offline” and fires a critical alert.

The BES also reports engine-level health metrics (CPU, memory, disk usage, active jobs, uptime) which are logged for diagnostic purposes.

Storage Target Management

Creating a Storage Target

Navigate to Backup > Storage and click “Add Storage Target”.

Required fields: - Name – Display name for the target - Target Type – One of: S3, S3 Compatible, SMB, NFS, iSCSI, Local - BES – Which backup engine owns this storage - Alert Threshold – Percentage at which a warning fires (default 85%) - Critical Threshold – Percentage at which policies auto-pause (default 95%)

Connection config (varies by type): - S3/S3 Compatible: endpoint URL, bucket, region, prefix - SMB: server, share name, path - NFS: server, export path, mount options - iSCSI: target portal, IQN, LUN - Local: filesystem path

Credentials (optional, stored encrypted in PAM vault): - Username - Secret (access key, password, etc.) - Credential type

Credentialed storage targets require an organization-scoped BES. When a BES is account-scoped, the UI can still create credentialless targets and run storage verification, but credentialed target creation fails closed because the backing vault identities remain organization-owned.

The critical threshold must be higher than the alert threshold.

Editing a Storage Target

Click a target to edit name, thresholds, connection config. Credential rotation is supported – provide new credentials to replace existing ones.

Deleting a Storage Target

A storage target cannot be deleted if any backup policies reference it (FK RESTRICT). Remove or reassign all policies first.

Deletion also removes the linked PAM Identity credential.

Native Backup Policy Management

Creating a Policy

Navigate to Backup > Policies and click “Add Policy”.

Core settings: - Name – Policy display name - BES – Which backup engine handles this policy - Storage Target – Where to store backups (must be on the selected BES) - Backup Type – file (file-level) or image (full disk image)

Scheduling (cron expressions): - Full Schedule – When to run full backups (required, e.g., 0 2 * * 0 for Sunday 2 AM) - Incremental Schedule – When to run incrementals (optional, e.g., 0 2 * * 1-6 for Mon-Sat 2 AM) - Synthetic Full Schedule – When to synthesize a full from base + incrementals (optional)

Path selection: - Exclude Paths – Directories to skip - Exclude Patterns – Glob patterns to skip (e.g., *.tmp, *.log) - Exclusion Template – Pre-built exclusion set name

Retention (Grandfather-Father-Son): - Daily – Keep N most recent daily backups (1-365, default 7) - Weekly – Keep N most recent weekly backups (0-52, default 4) - Monthly – Keep N most recent monthly backups (0-120, default 12)

Data protection: - Compression – lz4 (fast), zstd (better ratio), or none - Encryption – AES-256-GCM (default), AES-256-CBC, or none - Deduplication – Content-addressed dedup (default enabled)

Performance: - Max Transfer MBps – Bandwidth cap (optional) - Max IO Priority – normal/low/high - Throttle Schedule – Cron for when throttling applies

Behavior: - VSS Enabled – Use VSS snapshots on Windows, LVM freeze on Linux (default true) - Catch Up On Reconnect – Run missed backups when agent reconnects (default true) - Verify After Backup – Automatically verify backup integrity (default true) - Verify Sample Count – Number of files to sample for verification (1-10000, default 50) - Pre-Patch Backup – Run an incremental before OS patching (default false) - Archive On Host Deactivation – Archive backups when host is decommissioned (default true) - Enabled – Policy active state

Editing a Policy

All fields above can be modified after creation except the BES assignment. If the storage target is changed, it must belong to the same BES.

Disabling/Enabling a Policy

Toggle the Enabled state. A disabled policy will not have scheduled backups dispatched.

Note: Policies may be auto-paused by the system when their storage target hits critical capacity. The paused_reason field shows “storage_critical” in this case. These policies auto-resume when storage returns to healthy.

Deleting a Policy

Deleting a policy cascades to host assignments, replication rules, and app hooks.

Host Assignment

Assigning Hosts

From a policy detail view, use the host assignment interface to add hosts. The host must: - Belong to the same account - Be accessible by the user’s RBAC org scope - Not already be assigned to this policy (unless archived)

If a host was previously assigned and then archived, reassigning it reactivates the existing assignment.

Removing Hosts

Remove a host assignment to stop backing up that host under this policy. This does not delete existing restore points.

Host Assignment States

State	Meaning
`active`	Host is actively backed up by this policy
`paused`	Backups temporarily paused for this host
`archived`	Host decommissioned; backups preserved for retention

Application-Consistent Hooks

Managing Hooks via the UI

When editing an existing backup policy, expand the App Hooks collapsible section (7th section in the policy form). From there you can:

View existing hooks – Listed with hook type badge, template name, script reference, timeout, and fail-on-error indicator.
Add a hook – Click “Add Hook” to open the inline form. Select hook type, optionally choose an app template or “Custom Script”, pick a script from the searchable dropdown, set timeout and sequence, and toggle fail-on-error.
Edit a hook – Click an existing hook row to edit it inline.
Delete a hook – Click the trash icon on a hook row (confirmation required).

Available Templates

Pre-built templates with script snippets:

|----------|-------------|-------------------| | mysql | MySQL / MariaDB | mysqldump --single-transaction | | mongodb | MongoDB | mongodump with replica set support | | elasticsearch | Elasticsearch | Snapshot to registered repository |

Hook Execution Order

Hooks of the same type execute in sequence order (ascending), then by ID for ties.

Replication Rules (3-2-1 Strategy)

Replication rules copy backup data from a policy’s primary storage target to secondary targets for redundancy.

The 3-2-1 Rule

3 copies of data (primary + 2 replicas)
2 different media types (e.g., disk + object storage + tape)
1 offsite copy

Creating a Replica Rule

Navigate to Backup > Replicas or from a policy detail view:

Policy – Which backup policy to replicate
Storage Target – Secondary storage target for the copy (must belong to same account, can be on a different BES)
KEK Source – Key encryption key source: account (default) or pam_vault (external key management)
Replicate After Full – Copy after full backups (default true)
Replicate After Incremental – Copy after incremental backups (default true)
Is Offsite – Mark as offsite copy for compliance (default false)

How Replication Works

After a backup completes, the replication module checks if any replica rules apply
If so, it dispatches a replication job to the BES
The BES copies chunks from the primary storage target to the replica target
Replication status is tracked and reported back via ingest

Monitoring and Dashboard

Dashboard Overview

Navigate to Backup > Dashboard for an aggregated health view.

Summary metrics: - Protected Count – Total objects with backup coverage - At Risk Count – Objects where the last backup failed - RPO Breach Count – Objects exceeding their RPO threshold - Unprotected Host Count – Managed hosts with no backup coverage

Engine status: - BES Online – Engines reporting healthy heartbeats - BES Offline – Engines with stale heartbeats (> 3 minutes)

Recent activity (last 24 hours): - Success Count – Successful backup job runs - Failed Count – Failed backup job runs - Warning Count – Backup job runs with warnings

Exception queues (exception-first triage):

Failed Backups (red) – Failed backup jobs with retry button. Shows policy name, status, failure time. Empty state: “No failed backups”.
RPO Breaches (orange) – Protected objects exceeding their RPO threshold. Shows object name, host, last backup time, RPO hours, breach duration. Fetched via . Row click deep-links to Backup > Restore with the matching host and restore point context. Empty state: “All objects within RPO”.
Storage Health Alerts (yellow) – Storage targets in warning or critical health. Shows target name, type, usage %, health status, last check time. Inline health check and integrity verify action buttons. Card click deep-links to Backup > Storage. Empty state: “All storage healthy”.
Verification Failures (yellow) – Restore points where verification failed. Shows restore point ID, host, policy, backup type, completion time. Fetched via . Row click deep-links to Backup > Restore with the matching host and restore point context. Empty state: “All verifications passing”.

Each section loads independently with its own loading state and error handling. Action buttons are gated by backup.manage permission.

Route access note: the relevant workflow requires backup.view, and deep links to tabs such as restore or policies fail closed in the UI when the caller lacks the corresponding backup.restore or backup.manage permission.

Backup Job Management

Viewing Jobs

Navigate to Backup > Jobs for a unified view of all backup jobs, covering both third-party platform synced jobs and native BES jobs.

Filter options: - Platform or BES source - Job status (last run) - Job type (full, incremental, differential, snapshot, log) - Organization - Date range - Search by policy name or source job ID

Job Run History

Click a job to see its execution history. Each run shows: - Status (success, warning, failed, running, pending) - Start/end time and duration - Data transferred and protected size - Restore point ID (if applicable) - Error message (if failed) - Whether the run is retryable

Retrying a Failed Job

Click Retry on a failed job to dispatch a new run. Requirements: - The latest run must be in failed or warning state - The run must be marked as retryable (is_retryable=True)

For native BES jobs, retry dispatches agent jobs to all active hosts on the policy. For platform jobs, retry creates a pending run that the next sync picks up.

Automatic Retries

The system automatically retries failed jobs up to 3 times (configurable via in auto_retry.py). Auto-retries: - Only trigger when the latest run is failed and retryable - Skip jobs that already have a pending or running run - Use the same build_backup_payload() as manual dispatches - Set is_retryable=False on retry runs to prevent cascading retries

Restore Operations

Browsing Restore Points

Navigate to Backup > Restore to view all restore points.

Filter options: - Host - Policy - BES - Chain ID - Status (complete, running, failed, expired) - Verification status (passed, failed, partial, pending)

Current implementation note (2026-04-20): restore-point listing, browse, search, file restore, image restore, BMR, and storage verification now run integration proof passing.

Browsing Files in a Restore Point

Select a restore point and click Browse to explore its file tree. This dispatches a browse job to the BES and returns results asynchronously (poll the job for results).

Provide a path to browse (default “/”).

Searching Files

Use the Search feature to find specific files within a restore point. Enter a search query (filename, pattern, etc.) and the system dispatches a search job to the BES.

File-Level Restore

From a restore point, click Restore Files:

Select files – Choose files or directories to restore (up to 10,000 paths)
Target path – Directory where files will be restored
Overwrite – Whether to overwrite existing files at the target (default: no)
Target host – Optionally restore to a different host (defaults to the original host)

The restore job runs with high priority. The agent: 1. Retrieves the manifest from the BES 2. Filters for the requested paths 3. Downloads chunks from the BES 4. Reassembles files with original permissions

Image-Level Restore

For image-type backups, use Restore Image to perform a full disk restore:

Target host – The host to restore the image to (required)
Target path – Optional mount or image path override when the restore workflow needs an explicit destination path
Overwrite – Whether to overwrite existing image data at the target path when one is supplied
The job runs with critical priority

Bare-Metal Restore (BMR)

For complete system recovery to new or wiped hardware:

Select a restore point and click BMR
Target method – pxe (network boot) or iso (boot media) target hardware

Requirements: - Restore point must be complete - BES must be online - For PXE: the BES must have pxe_enabled=True

The system generates a one-time BMR token (UUID, 4-hour expiry) for replay protection. The BMR job runs with critical priority.

Storage Health and Capacity

Health Check vs Integrity Verify

Two distinct operations for storage verification:

Operation	What It Tests	Who Runs It
Health Check	Reachability probe plus BES capacity/health reporting	Any eligible online relay host in the target org when scoped, otherwise an online host in the same account
Integrity Verify	Catalog validation, chain integrity, restore rehearsal	BES (needs backup catalog)

Both dispatch agent jobs and return results asynchronously.

Capacity Monitoring

Storage targets automatically report capacity data via the BES health ingest. The system tracks: - Total, used, and free bytes - Usage percentage - Growth rate (EWMA smoothing) - Estimated time-to-full projection

Capacity Alerts

Threshold	Status	Action
< Alert threshold (default 85%)	`healthy`	No action
>= Alert threshold	`warning`	Alert fired to notification channels
>= Critical threshold (default 95%)	`critical`	Alert fired + all policies targeting this storage auto-paused

When storage returns to healthy from critical, auto-paused policies are automatically re-enabled. Manually disabled policies are not affected.

Capacity History

View historical capacity trends via Backup > Storage > Capacity History. Shows up to 365 days of time-series data (usage percent, total/used/free bytes) with the current snapshot including growth rate and estimated full date.

Protected Objects

What Are Protected Objects?

Protected objects represent anything covered by backup – VMs, physical servers, filesets, cloud resources, NAS shares. They are created automatically when: - A third-party platform sync discovers them - A native backup policy is assigned to a host

Viewing Protected Objects

Navigate to Backup > Protected Objects.

Filter options: - Organization - Object type (vm, physical, fileset, cloud_resource, nas) - RPO breach (true/false) - Platform or BES source - Host - Last backup status (success, failed, warning, or “never” for objects never backed up) - Search by object name or hostname

RPO Monitoring

Each protected object has an RPO (Recovery Point Objective) in hours. The system flags objects where the time since last successful backup exceeds the RPO as a “breach”.

RPO breach objects appear in: - The protected objects list (filter by rpo_breach=true) - The dashboard rpo_breach_count metric

Permissions Reference

Permission	Grants Access To
`backup.view`	View engines, platforms, jobs, protected objects, policies, replicas, storage targets, templates, dashboard
`backup.manage`	Create, update, delete engines/platforms/policies/replicas/storage/hooks. Trigger sync, run-now, test connectivity, health checks, retry jobs
`backup.restore`	View restore points, browse/search files, trigger file/image/BMR restores