Prerequisites
- User role with alerts.view permission for viewing alerts and incidents
- User role with alerts.manage permission for creating/editing runbooks and approving/cancelling executions
- Alert rules must be configured and active for alerts to fire and trigger remediation
- Hosts must be online and reporting metrics for alert evaluation to work
- Scripts or workflows referenced by runbook actions must already exist in the system
Understanding runbooks
A runbook defines the automated response to a specific alert condition. When an alert fires and an investigation identifies a probable cause, the remediation engine searches for a matching runbook and creates an execution.
Automatic vs manual runbooks
| Type | Behavior | When to use |
|---|---|---|
| Requires approval (default) | Execution is created with status pending_approval. An operator must manually approve before the action runs. | Production systems, destructive actions, high-risk changes. |
| Auto-approve | Execution is created with status approved and runs immediately when auto_approve_if_health_below threshold is set and host health drops below that value. | Non-critical systems, well-tested remediation patterns. |
| Suggested (system-generated) | If no runbook matches an alert, the engine creates a suggested runbook with is_suggested=True. These require manual configuration before use. | New alert patterns without existing runbooks. |
System runbook library
25 system runbooks are seeded at installation covering common operational scenarios. System runbooks are global (no organization scope) and do not require approval by default. They serve as a starting library that you can supplement with organization-specific runbooks.
| # | Runbook | Category |
|---|---|---|
| 1 | High CPU Remediation | Performance |
| 2 | Disk Full Cleanup | Storage |
| 3 | Service Down Restart | Availability |
| 4 | Memory Pressure Remediation | Performance |
| 5 | NTP Drift Correction | Configuration |
| 6 | SSH Service Recovery | Availability |
| 7 | Zombie Process Cleanup | Performance |
| 8 | Log Rotation Force | Storage |
| 9 | Web Server Restart (nginx) | Availability |
| 10 | Swap Usage High | Performance |
| 11 | Failed Login Monitor | Security |
| 12 | Windows Update Reset | Patching |
| 13 | Certificate Expiry Warning | Security |
| 14 | DNS Resolution Failure | Network |
| 15 | Port Exhaustion Recovery | Network |
| 16 | Database Service Recovery | Availability |
| 17 | Network Interface Flapping | Network |
| 18 | Process Crash Loop Detection | Availability |
| 19 | Backup Failure Recovery | Operations |
| 20 | High I/O Wait Remediation | Performance |
| 21 | Stale Reboot Pending | Patching |
| 22 | Filesystem Read-Only Recovery | Storage |
| 23 | Agent Heartbeat Loss | Monitoring |
| 24 | Windows Service Recovery | Availability |
| 25 | Patch Failure Evidence Collection | Patching |
Event-aware runbook matching
Runbooks can match on event-driven alerts in addition to metric-based alerts. Two additional trigger fields are available for event sources:
When an event alert fires, the remediation engine matches against both the event-specific trigger fields and the standard severity trigger. This allows runbooks to respond to AD security events, vulnerability discoveries, SNMP device failures, and other non-metric alert sources.
Auto-approve by investigation confidence
Runbooks can set an auto-approve confidence threshold (0.0 to 1.0). When the causal investigation's confidence score meets or exceeds this threshold, the execution is auto-approved without manual intervention. This is separate from the health-based auto-approve.
Creating runbook definitions
- Navigate to Remediation > Runbooks.
- Click Create Runbook.
- Configure trigger conditions -- the engine matches ALL non-null conditions:
Trigger Metric TypeMatch on the alert's metric type (e.g., CPU usage percentage).Trigger Cause PatternPattern matched against the investigation's probable cause text.Trigger SeverityMinimum severity threshold: info, low, medium, warning, high, or critical.Trigger Heuristic RuleMatch on the investigation's heuristic rule name.
- Configure the action:
Action TypeOne of: Script, Workflow, Service Restart, or Command.Action ConfigurationAction-specific parameters (script, workflow, service name, command text, etc.).
- Configure safety constraints:
Requires ApprovalWhether manual approval is needed before execution (default: Yes).Max Executions Per DayMaximum times this runbook can execute in 24 hours (default: 5).Cooldown (minutes)Minimum time between executions of this runbook (default: 30).Auto-Approve ThresholdOptional host health percentage below which approval is skipped.
- Configure validation (post-execution health check):
Validation MetricMetric to check after the action completes.Validation OperatorComparison operator (less than, greater than, equals, etc.).Validation ThresholdTarget value the metric must satisfy.Stabilization WindowMinutes to wait before validating (default: 5).
- Optionally configure a rollback action (same structure as the primary action). Rollback executes automatically if validation fails.
- Click Save.
Linking runbooks to alert rules
Runbooks are matched to alerts automatically based on trigger conditions -- you do not manually link a runbook to a specific alert rule. The matching occurs at runtime during the alert evaluation cycle:
- An alert fires for a host based on an alert rule's threshold and duration.
- The causal engine runs an investigation, producing a probable cause, confidence score, and heuristic rule.
- The remediation engine queries all active runbooks and matches trigger conditions against the alert's metric type, investigation cause pattern, severity, and heuristic rule.
- If a match is found and safety constraints (daily limit, cooldown) pass, a remediation execution is created.
- If no match is found, the engine may create a suggested runbook for the operator to review and configure.
Remediation execution
Triggered execution (automatic)
When the remediation engine matches a runbook to an alert, the execution proceeds through the lifecycle automatically. If the runbook requires approval, it pauses at "Pending Approval" until an operator approves.
Manual approval workflow
- Navigate to Remediation > Executions.
- Find executions with status "Pending Approval".
- Review the runbook, alert, and investigation details.
- Click Approve to allow execution, or Cancel to abort.
- Monitor execution progress through the state transitions on the execution detail page.
Remediation lifecycle
Each remediation execution moves through a defined state machine:
| Status | Description | Next states |
|---|---|---|
| Pending Approval | Waiting for operator approval. Created when the runbook requires approval. | Approved, Cancelled |
| Approved | Approved and queued for execution. Auto-set when approval is not required. | Executing |
| Executing | Action job is running on the target host. | Stabilizing, Failed |
| Stabilizing | Waiting for the stabilization window to elapse before validation. | Validated, Failed |
| Validated | Post-execution validation check passed. | Completed |
| Completed | Remediation finished successfully. Terminal state. | — |
| Failed | Action or validation failed. If rollback is configured, it runs automatically. | Rolling Back, or terminal |
| Rolling Back | Rollback action is executing. | Rolled Back, Failed |
| Rolled Back | Rollback completed. Terminal state. | — |
| Cancelled | Operator cancelled the execution. Terminal state. | — |
Remediation history and reporting
- Navigate to Remediation > Executions to browse all executions with filters for status, host, and runbook.
- Click an execution to see full details: action job results, validation results, rollback results, error messages, and timestamps for each state transition.
- Navigate to Remediation > Dashboard for aggregate statistics.
Causal analysis and root cause chain
When an alert fires, the causal analysis engine runs an automated investigation to identify the probable root cause. This investigation drives both the remediation matching and operator decision-making.
Investigation output
Incident timeline
For a unified view of the full causal chain from alert to resolution:
- Navigate to an alert incident and click the Timeline tab.
- The timeline shows alerts, investigations, and remediations in chronological order.
- This provides the complete chain: what triggered, what was investigated, what was attempted, and the final outcome.
Permissions reference
| Action | Permission |
|---|---|
| View/manage runbooks | alerts.manage |
| Approve/cancel remediation executions | alerts.manage |
| View remediation stats | alerts.manage |
| View alerts and incidents | alerts.view |
Navigation reference
| Feature | Location |
|---|---|
| Runbooks | Remediation > Runbooks -- create, edit, delete runbook definitions |
| Executions | Remediation > Executions -- view, approve, cancel remediation executions |
| Dashboard | Remediation > Dashboard -- aggregate statistics and success rates |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Runbook not triggering on alert | Trigger conditions do not match the alert and investigation | Verify the trigger metric type, cause pattern, severity, and heuristic rule match the actual alert and investigation output. |
| Execution stuck in "Pending Approval" | Approval is required and no one has approved | Manually approve via the Executions page, or set an auto-approve health threshold on the runbook. |
| Remediation failed after action completed | Post-execution validation check failed | Check the validation metric, operator, and threshold. The metric may not have recovered within the stabilization window. |
| Rollback itself failed | Rollback action encountered an error on the host | Check execution detail for error messages. Manual intervention required. |
| Executions blocked by daily limit | Maximum executions per day reached | Increase the limit if appropriate, or investigate why the alert recurs so frequently. |
| Execution blocked by cooldown | Cooldown period not yet elapsed | Wait for cooldown to expire, or reduce it if the runbook is safe for rapid re-execution. |
| Suggested runbook appeared | No existing runbook matched the alert | Review the suggested runbook, configure an action, and activate it. |