Remediation & Runbooks

Close the loop from alert detection to automated fix

Define runbooks with trigger conditions, safety constraints, and validation criteria. The remediation engine matches alerts to runbooks, executes fixes with approval gates, and validates the outcome before closing.

Technical Manual
Status: Available

Prerequisites

  • User role with alerts.view permission for viewing alerts and incidents
  • User role with alerts.manage permission for creating/editing runbooks and approving/cancelling executions
  • Alert rules must be configured and active for alerts to fire and trigger remediation
  • Hosts must be online and reporting metrics for alert evaluation to work
  • Scripts or workflows referenced by runbook actions must already exist in the system

Understanding runbooks

A runbook defines the automated response to a specific alert condition. When an alert fires and an investigation identifies a probable cause, the remediation engine searches for a matching runbook and creates an execution.

Automatic vs manual runbooks

TypeBehaviorWhen to use
Requires approval (default)Execution is created with status pending_approval. An operator must manually approve before the action runs.Production systems, destructive actions, high-risk changes.
Auto-approveExecution is created with status approved and runs immediately when auto_approve_if_health_below threshold is set and host health drops below that value.Non-critical systems, well-tested remediation patterns.
Suggested (system-generated)If no runbook matches an alert, the engine creates a suggested runbook with is_suggested=True. These require manual configuration before use.New alert patterns without existing runbooks.

System runbook library

25 system runbooks are seeded at installation covering common operational scenarios. System runbooks are global (no organization scope) and do not require approval by default. They serve as a starting library that you can supplement with organization-specific runbooks.

#RunbookCategory
1High CPU RemediationPerformance
2Disk Full CleanupStorage
3Service Down RestartAvailability
4Memory Pressure RemediationPerformance
5NTP Drift CorrectionConfiguration
6SSH Service RecoveryAvailability
7Zombie Process CleanupPerformance
8Log Rotation ForceStorage
9Web Server Restart (nginx)Availability
10Swap Usage HighPerformance
11Failed Login MonitorSecurity
12Windows Update ResetPatching
13Certificate Expiry WarningSecurity
14DNS Resolution FailureNetwork
15Port Exhaustion RecoveryNetwork
16Database Service RecoveryAvailability
17Network Interface FlappingNetwork
18Process Crash Loop DetectionAvailability
19Backup Failure RecoveryOperations
20High I/O Wait RemediationPerformance
21Stale Reboot PendingPatching
22Filesystem Read-Only RecoveryStorage
23Agent Heartbeat LossMonitoring
24Windows Service RecoveryAvailability
25Patch Failure Evidence CollectionPatching

Event-aware runbook matching

Runbooks can match on event-driven alerts in addition to metric-based alerts. Two additional trigger fields are available for event sources:

Trigger Event SourceMatch on the alert's event source (e.g., process baseline, AD security, vulnerability).
Trigger Event TypeMatch on the specific event type within a source (e.g., living-off-the-land detected, account lockout).

When an event alert fires, the remediation engine matches against both the event-specific trigger fields and the standard severity trigger. This allows runbooks to respond to AD security events, vulnerability discoveries, SNMP device failures, and other non-metric alert sources.

Auto-approve by investigation confidence

Runbooks can set an auto-approve confidence threshold (0.0 to 1.0). When the causal investigation's confidence score meets or exceeds this threshold, the execution is auto-approved without manual intervention. This is separate from the health-based auto-approve.

Creating runbook definitions

  1. Navigate to Remediation > Runbooks.
  2. Click Create Runbook.
  3. Configure trigger conditions -- the engine matches ALL non-null conditions:
    Trigger Metric TypeMatch on the alert's metric type (e.g., CPU usage percentage).
    Trigger Cause PatternPattern matched against the investigation's probable cause text.
    Trigger SeverityMinimum severity threshold: info, low, medium, warning, high, or critical.
    Trigger Heuristic RuleMatch on the investigation's heuristic rule name.
  4. Configure the action:
    Action TypeOne of: Script, Workflow, Service Restart, or Command.
    Action ConfigurationAction-specific parameters (script, workflow, service name, command text, etc.).
  5. Configure safety constraints:
    Requires ApprovalWhether manual approval is needed before execution (default: Yes).
    Max Executions Per DayMaximum times this runbook can execute in 24 hours (default: 5).
    Cooldown (minutes)Minimum time between executions of this runbook (default: 30).
    Auto-Approve ThresholdOptional host health percentage below which approval is skipped.
  6. Configure validation (post-execution health check):
    Validation MetricMetric to check after the action completes.
    Validation OperatorComparison operator (less than, greater than, equals, etc.).
    Validation ThresholdTarget value the metric must satisfy.
    Stabilization WindowMinutes to wait before validating (default: 5).
  7. Optionally configure a rollback action (same structure as the primary action). Rollback executes automatically if validation fails.
  8. Click Save.

Linking runbooks to alert rules

Runbooks are matched to alerts automatically based on trigger conditions -- you do not manually link a runbook to a specific alert rule. The matching occurs at runtime during the alert evaluation cycle:

  1. An alert fires for a host based on an alert rule's threshold and duration.
  2. The causal engine runs an investigation, producing a probable cause, confidence score, and heuristic rule.
  3. The remediation engine queries all active runbooks and matches trigger conditions against the alert's metric type, investigation cause pattern, severity, and heuristic rule.
  4. If a match is found and safety constraints (daily limit, cooldown) pass, a remediation execution is created.
  5. If no match is found, the engine may create a suggested runbook for the operator to review and configure.
Direct triggers on alert rules. Alert rules also support direct workflow and script triggers without the runbook engine. Runbooks add approval gates, validation, and rollback on top.

Remediation execution

Triggered execution (automatic)

When the remediation engine matches a runbook to an alert, the execution proceeds through the lifecycle automatically. If the runbook requires approval, it pauses at "Pending Approval" until an operator approves.

Manual approval workflow

  1. Navigate to Remediation > Executions.
  2. Find executions with status "Pending Approval".
  3. Review the runbook, alert, and investigation details.
  4. Click Approve to allow execution, or Cancel to abort.
  5. Monitor execution progress through the state transitions on the execution detail page.

Remediation lifecycle

Each remediation execution moves through a defined state machine:

StatusDescriptionNext states
Pending ApprovalWaiting for operator approval. Created when the runbook requires approval.Approved, Cancelled
ApprovedApproved and queued for execution. Auto-set when approval is not required.Executing
ExecutingAction job is running on the target host.Stabilizing, Failed
StabilizingWaiting for the stabilization window to elapse before validation.Validated, Failed
ValidatedPost-execution validation check passed.Completed
CompletedRemediation finished successfully. Terminal state.
FailedAction or validation failed. If rollback is configured, it runs automatically.Rolling Back, or terminal
Rolling BackRollback action is executing.Rolled Back, Failed
Rolled BackRollback completed. Terminal state.
CancelledOperator cancelled the execution. Terminal state.
Validation failure triggers automatic rollback. If validation fails and the runbook has a rollback action configured, rollback runs without further approval. Monitor the Executions page to catch rollback failures.

Remediation history and reporting

  1. Navigate to Remediation > Executions to browse all executions with filters for status, host, and runbook.
  2. Click an execution to see full details: action job results, validation results, rollback results, error messages, and timestamps for each state transition.
  3. Navigate to Remediation > Dashboard for aggregate statistics.
Total ExecutionsTotal remediation executions in the selected period.
Success RatePercentage of completed executions with outcome "fixed".
Pending CountNumber of executions currently awaiting approval.
Active RunbooksCount of active runbook definitions.

Causal analysis and root cause chain

When an alert fires, the causal analysis engine runs an automated investigation to identify the probable root cause. This investigation drives both the remediation matching and operator decision-making.

Investigation output

Probable CauseText description of the likely root cause (e.g., "Disk I/O contention from backup process").
ConfidenceScore from 0.0 to 1.0 indicating certainty of the heuristic analysis.
Heuristic RuleName of the heuristic that identified the cause (used for runbook matching).

Incident timeline

For a unified view of the full causal chain from alert to resolution:

  1. Navigate to an alert incident and click the Timeline tab.
  2. The timeline shows alerts, investigations, and remediations in chronological order.
  3. This provides the complete chain: what triggered, what was investigated, what was attempted, and the final outcome.
Investigation failures are non-blocking. If the causal engine fails to produce an investigation, the alert still fires and notifications still send. The failure is logged but does not prevent alert creation or notification delivery.

Permissions reference

ActionPermission
View/manage runbooksalerts.manage
Approve/cancel remediation executionsalerts.manage
View remediation statsalerts.manage
View alerts and incidentsalerts.view

Navigation reference

FeatureLocation
RunbooksRemediation > Runbooks -- create, edit, delete runbook definitions
ExecutionsRemediation > Executions -- view, approve, cancel remediation executions
DashboardRemediation > Dashboard -- aggregate statistics and success rates

Troubleshooting

SymptomCauseFix
Runbook not triggering on alertTrigger conditions do not match the alert and investigationVerify the trigger metric type, cause pattern, severity, and heuristic rule match the actual alert and investigation output.
Execution stuck in "Pending Approval"Approval is required and no one has approvedManually approve via the Executions page, or set an auto-approve health threshold on the runbook.
Remediation failed after action completedPost-execution validation check failedCheck the validation metric, operator, and threshold. The metric may not have recovered within the stabilization window.
Rollback itself failedRollback action encountered an error on the hostCheck execution detail for error messages. Manual intervention required.
Executions blocked by daily limitMaximum executions per day reachedIncrease the limit if appropriate, or investigate why the alert recurs so frequently.
Execution blocked by cooldownCooldown period not yet elapsedWait for cooldown to expire, or reduce it if the runbook is safe for rapid re-execution.
Suggested runbook appearedNo existing runbook matched the alertReview the suggested runbook, configure an action, and activate it.