Cadres Manual | Remediation & Runbooks

Prerequisites

User role with alerts.view permission for viewing alerts and incidents
User role with alerts.manage permission for creating/editing runbooks and approving/cancelling executions
Alert rules must be configured and active for alerts to fire and trigger remediation
Hosts must be online and reporting metrics for alert evaluation to work
Scripts or workflows referenced by runbook actions must already exist in the system

Understanding runbooks

A runbook defines the automated response to a specific alert condition. When an alert fires and an investigation identifies a probable cause, the remediation engine searches for a matching runbook and creates an execution.

Automatic vs manual runbooks

Type	Behavior	When to use
Requires approval (default)	Execution is created with status `pending_approval`. An operator must manually approve before the action runs.	Production systems, destructive actions, high-risk changes.
Auto-approve	Execution is created with status `approved` and runs immediately when `auto_approve_if_health_below` threshold is set and host health drops below that value.	Non-critical systems, well-tested remediation patterns.
Suggested (system-generated)	If no runbook matches an alert, the engine creates a suggested runbook with `is_suggested=True`. These require manual configuration before use.	New alert patterns without existing runbooks.

System runbook library

25 system runbooks are seeded at installation covering common operational scenarios. System runbooks are global (no organization scope) and do not require approval by default. They serve as a starting library that you can supplement with organization-specific runbooks.

#	Runbook	Category
1	High CPU Remediation	Performance
2	Disk Full Cleanup	Storage
3	Service Down Restart	Availability
4	Memory Pressure Remediation	Performance
5	NTP Drift Correction	Configuration
6	SSH Service Recovery	Availability
7	Zombie Process Cleanup	Performance
8	Log Rotation Force	Storage
9	Web Server Restart (nginx)	Availability
10	Swap Usage High	Performance
11	Failed Login Monitor	Security
12	Windows Update Reset	Patching
13	Certificate Expiry Warning	Security
14	DNS Resolution Failure	Network
15	Port Exhaustion Recovery	Network
16	Database Service Recovery	Availability
17	Network Interface Flapping	Network
18	Process Crash Loop Detection	Availability
19	Backup Failure Recovery	Operations
20	High I/O Wait Remediation	Performance
21	Stale Reboot Pending	Patching
22	Filesystem Read-Only Recovery	Storage
23	Agent Heartbeat Loss	Monitoring
24	Windows Service Recovery	Availability
25	Patch Failure Evidence Collection	Patching

Event-aware runbook matching

Runbooks can match on event-driven alerts in addition to metric-based alerts. Two additional trigger fields are available for event sources:

Trigger Event SourceMatch on the alert's event source (e.g., process baseline, AD security, vulnerability).

Trigger Event TypeMatch on the specific event type within a source (e.g., living-off-the-land detected, account lockout).

When an event alert fires, the remediation engine matches against both the event-specific trigger fields and the standard severity trigger. This allows runbooks to respond to AD security events, vulnerability discoveries, SNMP device failures, and other non-metric alert sources.

Auto-approve by investigation confidence

Runbooks can set an auto-approve confidence threshold (0.0 to 1.0). When the causal investigation's confidence score meets or exceeds this threshold, the execution is auto-approved without manual intervention. This is separate from the health-based auto-approve.

Creating runbook definitions

Navigate to Remediation > Runbooks.
Click Create Runbook.
Configure trigger conditions -- the engine matches ALL non-null conditions:

Trigger Metric TypeMatch on the alert's metric type (e.g., CPU usage percentage).

Trigger Cause PatternPattern matched against the investigation's probable cause text.

Trigger SeverityMinimum severity threshold: info, low, medium, warning, high, or critical.

Trigger Heuristic RuleMatch on the investigation's heuristic rule name.
Configure the action:

Action TypeOne of: Script, Workflow, Service Restart, or Command.

Action ConfigurationAction-specific parameters (script, workflow, service name, command text, etc.).
Configure safety constraints:

Requires ApprovalWhether manual approval is needed before execution (default: Yes).

Max Executions Per DayMaximum times this runbook can execute in 24 hours (default: 5).

Cooldown (minutes)Minimum time between executions of this runbook (default: 30).

Auto-Approve ThresholdOptional host health percentage below which approval is skipped.
Configure validation (post-execution health check):

Validation MetricMetric to check after the action completes.

Validation OperatorComparison operator (less than, greater than, equals, etc.).

Validation ThresholdTarget value the metric must satisfy.

Stabilization WindowMinutes to wait before validating (default: 5).
Optionally configure a rollback action (same structure as the primary action). Rollback executes automatically if validation fails.
Click Save.

Linking runbooks to alert rules

Runbooks are matched to alerts automatically based on trigger conditions -- you do not manually link a runbook to a specific alert rule. The matching occurs at runtime during the alert evaluation cycle:

An alert fires for a host based on an alert rule's threshold and duration.
The causal engine runs an investigation, producing a probable cause, confidence score, and heuristic rule.
The remediation engine queries all active runbooks and matches trigger conditions against the alert's metric type, investigation cause pattern, severity, and heuristic rule.
If a match is found and safety constraints (daily limit, cooldown) pass, a remediation execution is created.
If no match is found, the engine may create a suggested runbook for the operator to review and configure.

Direct triggers on alert rules. Alert rules also support direct workflow and script triggers without the runbook engine. Runbooks add approval gates, validation, and rollback on top.

Remediation execution

Triggered execution (automatic)

When the remediation engine matches a runbook to an alert, the execution proceeds through the lifecycle automatically. If the runbook requires approval, it pauses at "Pending Approval" until an operator approves.

Manual approval workflow

Navigate to Remediation > Executions.
Find executions with status "Pending Approval".
Review the runbook, alert, and investigation details.
Click Approve to allow execution, or Cancel to abort.
Monitor execution progress through the state transitions on the execution detail page.

Remediation lifecycle

Each remediation execution moves through a defined state machine:

Status	Description	Next states
Pending Approval	Waiting for operator approval. Created when the runbook requires approval.	Approved, Cancelled
Approved	Approved and queued for execution. Auto-set when approval is not required.	Executing
Executing	Action job is running on the target host.	Stabilizing, Failed
Stabilizing	Waiting for the stabilization window to elapse before validation.	Validated, Failed
Validated	Post-execution validation check passed.	Completed
Completed	Remediation finished successfully. Terminal state.	—
Failed	Action or validation failed. If rollback is configured, it runs automatically.	Rolling Back, or terminal
Rolling Back	Rollback action is executing.	Rolled Back, Failed
Rolled Back	Rollback completed. Terminal state.	—
Cancelled	Operator cancelled the execution. Terminal state.	—

Validation failure triggers automatic rollback. If validation fails and the runbook has a rollback action configured, rollback runs without further approval. Monitor the Executions page to catch rollback failures.

Remediation history and reporting

Navigate to Remediation > Executions to browse all executions with filters for status, host, and runbook.
Click an execution to see full details: action job results, validation results, rollback results, error messages, and timestamps for each state transition.
Navigate to Remediation > Dashboard for aggregate statistics.

Total ExecutionsTotal remediation executions in the selected period.

Success RatePercentage of completed executions with outcome "fixed".

Pending CountNumber of executions currently awaiting approval.

Active RunbooksCount of active runbook definitions.

Causal analysis and root cause chain

When an alert fires, the causal analysis engine runs an automated investigation to identify the probable root cause. This investigation drives both the remediation matching and operator decision-making.

Investigation output

Probable CauseText description of the likely root cause (e.g., "Disk I/O contention from backup process").

ConfidenceScore from 0.0 to 1.0 indicating certainty of the heuristic analysis.

Heuristic RuleName of the heuristic that identified the cause (used for runbook matching).

Incident timeline

For a unified view of the full causal chain from alert to resolution:

Navigate to an alert incident and click the Timeline tab.
The timeline shows alerts, investigations, and remediations in chronological order.
This provides the complete chain: what triggered, what was investigated, what was attempted, and the final outcome.

Investigation failures are non-blocking. If the causal engine fails to produce an investigation, the alert still fires and notifications still send. The failure is logged but does not prevent alert creation or notification delivery.

Permissions reference

Action	Permission
View/manage runbooks	alerts.manage
Approve/cancel remediation executions	alerts.manage
View remediation stats	alerts.manage
View alerts and incidents	alerts.view

Navigation reference

Feature	Location
Runbooks	Remediation > Runbooks -- create, edit, delete runbook definitions
Executions	Remediation > Executions -- view, approve, cancel remediation executions
Dashboard	Remediation > Dashboard -- aggregate statistics and success rates

Troubleshooting

Symptom	Cause	Fix
Runbook not triggering on alert	Trigger conditions do not match the alert and investigation	Verify the trigger metric type, cause pattern, severity, and heuristic rule match the actual alert and investigation output.
Execution stuck in "Pending Approval"	Approval is required and no one has approved	Manually approve via the Executions page, or set an auto-approve health threshold on the runbook.
Remediation failed after action completed	Post-execution validation check failed	Check the validation metric, operator, and threshold. The metric may not have recovered within the stabilization window.
Rollback itself failed	Rollback action encountered an error on the host	Check execution detail for error messages. Manual intervention required.
Executions blocked by daily limit	Maximum executions per day reached	Increase the limit if appropriate, or investigate why the alert recurs so frequently.
Execution blocked by cooldown	Cooldown period not yet elapsed	Wait for cooldown to expire, or reduce it if the runbook is safe for rapid re-execution.
Suggested runbook appeared	No existing runbook matched the alert	Review the suggested runbook, configure an action, and activate it.

Close the loop from alert detection to automated fix

Prerequisites

Understanding runbooks

Automatic vs manual runbooks

System runbook library

Event-aware runbook matching

Auto-approve by investigation confidence

Creating runbook definitions

Linking runbooks to alert rules

Remediation execution

Triggered execution (automatic)

Manual approval workflow

Remediation lifecycle

Remediation history and reporting

Causal analysis and root cause chain

Investigation output

Incident timeline

Permissions reference

Navigation reference

Troubleshooting