Prerequisites
- Role with alerts.view for viewing alerts, rules, and incidents
- Role with alerts.manage for creating/editing rules, acknowledging, resolving, and remediation
- Role with monitoring.view / monitoring.configure for monitoring configuration
- Hosts must be online and reporting metrics for alert evaluation to work
Setting up notification channels
Before creating alert rules, configure at least one notification channel so alerts can be delivered to your team.
- Navigate to Alerts > Notification Channels.
- Click Add Channel.
- Select the channel type: Email, Webhook, Slack, or Microsoft Teams.
- Configure channel-specific settings (SMTP configuration for email, webhook URL, Slack webhook URL, Teams incoming webhook URL).
- Click Test to verify delivery works.
- Click Save to create the channel.
Creating an alert rule
Alert rules define what to monitor, what threshold triggers an alert, and how to notify your team.
- Navigate to Alerts > Alert Rules.
- Click Create Rule.
- Enter a rule name and description.
- Set scope: Account-wide, Organization, Location, or specific Host. Optionally filter by host groups.
- Select the metric to monitor from the MetricDefinition catalog (e.g. CPU usage, memory usage, disk space).
- Set the operator and threshold (e.g.
> 90for CPU above 90%). - Set the duration in minutes (1-1440). The threshold must be sustained for this entire window before an alert fires. This prevents transient spikes from triggering false alerts.
- Set severity: info, low, medium, warning, high, or critical.
- Attach one or more notification channels for external delivery.
- Optionally configure advanced settings (see below).
- Click Save.
Advanced rule settings
60 to send hourly reminders until someone acknowledges the alert or it resolves. Prevents critical alerts from being silently ignored during shift changes.
Alert evaluation cycle
The alert engine runs a background evaluation loop every 30 seconds. Understanding this cycle helps explain why alerts fire or don't fire.
- The alert engine evaluates all active rules every 30 seconds.
- All active alert rules are loaded in batch.
- For each rule, target hosts are resolved based on scope (account, org, location, host, host groups).
- Metric values are queried in batch -- one query per rule for all target hosts.
- For each host, the engine checks if ALL values within the duration window breach the threshold (sustained check).
- If breached and no existing alert: a new alert is created, correlated to an incident, notifications are queued, and an investigation is triggered.
- If breached and existing alert: the occurrence count is incremented. Re-notification and escalation checks run.
- If NOT breached and an existing alert exists: the alert is auto-resolved.
Alert triage
Work through active alerts using acknowledge, investigate, and resolve actions.
- Navigate to the Alerts page.
- Filter by severity, status (active / acknowledged / resolved), or host.
- Click an alert to see details: metric value, threshold, occurrence count, investigation results.
- Acknowledge: Click "Acknowledge" to indicate you are looking at the issue. This stops re-notification for this alert.
- Resolve: Click "Resolve" to mark as fixed. Add an optional comment explaining the resolution.
- Bulk actions: Select multiple alerts via checkboxes, then choose "Acknowledge" or "Resolve" for batch processing (up to 500 alerts).
Alert incidents
Incidents group related alerts by host within a 5-minute time window. This reduces noise when multiple alert rules fire for the same host simultaneously.
- Navigate to the Alert Incidents page.
- Incidents display the highest-severity alert in the group and the count of associated alerts.
- Open an incident to see all associated alerts and a unified timeline (alerts, investigations, and remediations in chronological order).
- Acknowledge the incident to indicate triage is underway.
- Resolve the incident to close it. Resolving an incident auto-resolves all remaining active alerts within it.
Investigations
When a new alert fires, the system automatically triggers a causal investigation. The causal engine analyzes metrics and context to produce a probable cause, confidence score, and heuristic rule name.
- Investigations are created automatically -- no manual action required to start one.
- Results appear in the alert detail view: probable cause, confidence, and heuristic rule used.
- Investigation results feed into the remediation engine for automatic runbook matching.
- If investigation fails (e.g. insufficient data), the alert still fires normally -- investigation failures are logged but don't block alerting.
Notification channels
Alerts are delivered through multiple channels simultaneously.
| Channel | Delivery Method | Configuration |
|---|---|---|
| In-app | WebSocket push + database persistence | Automatic for all alerts. Click the bell icon to view. Real-time delivery via WebSocket. |
| SMTP | SMTP server, port, credentials, sender address, recipient list | |
| Webhook | HTTP POST | Target URL. Payload includes alert details, host info, and metric values. |
| Slack | Incoming webhook | Slack webhook URL. Messages formatted with alert severity, host, and metric context. |
| Microsoft Teams | Incoming webhook | Teams webhook URL. Adaptive card format with alert details. |
In-app notifications
- Click the notification bell icon in the header. Unread count is shown as a badge.
- Click a notification to navigate to the relevant resource (alert, incident, etc.).
- Use Mark All Read to clear the unread badge.
Escalation policies
Escalation is configured per alert rule, not as a global policy. Two mechanisms work together to ensure critical alerts get attention.
Re-notification
Set a re-notification interval (in minutes) on the alert rule. The system re-sends notifications for active, unacknowledged alerts at this interval. Example: setting it to 60 sends hourly reminders until the alert is acknowledged or resolved. This prevents critical alerts from being buried during busy periods or shift changes.
Escalation
Set an escalation threshold (occurrence count) and escalation channel on the alert rule. When the alert's occurrence count reaches the threshold, a one-time escalation notification is sent to the designated channel. This is useful for routing persistent problems to a senior on-call team or management channel.
Alert rule templates
System-generated alert rules are created automatically for common operational scenarios. They cannot be modified or deleted -- if you need different thresholds, create a custom rule instead.
- System rules appear in the Alert Rules list with a "system" badge.
- Attempting to edit or delete a system rule returns HTTP 403: "System-generated rules cannot be modified".
- Custom rules with the same metric and scope take precedence in practice (both will fire independently, so duplicate notifications are possible -- consider disabling the system rule's notification channels if you create a custom replacement).
Event-driven alerting
In addition to metric-based threshold alerts, Cadres supports event-driven alerts from non-metric sources. These fire through the same alert pipeline (incident correlation, deduplication, investigation, notification, webhook dispatch) but are triggered by discrete events rather than sustained metric values.
Event sources
| Source | Triggers On | Example Events |
|---|---|---|
| Active Directory Security | Security-relevant AD events | Account lockout (high), group membership changes (medium) |
| AD Replication | Replication failures | Failed replication partners detected during monitoring |
| Vulnerability | New critical CVEs | CVSS >= 7.0 or CISA KEV match on a host |
| SNMP Status | Device status transitions | SNMP device becomes unreachable; auto-resolves on recovery |
| Discovery | Asset staleness | Discovered host not seen for > 24 hours; resolves when host reappears |
| Hardware Forecast | Predicted hardware failures | Disk, CPU, or memory forecasted to reach capacity within threshold |
| Process Baseline | Unauthorized process execution | LOTL (living off the land) binary detected, unknown process in enforcement mode |
| Exfiltration | Network anomalies | Large outbound transfers, connections to unknown destinations, tunneling port usage |
| Fingerprint Drift | Configuration drift detected | Service, package, or configuration changed from established baseline |
| Host Group Health | Health score degradation | Host group health drops below threshold; auto-resolves on recovery |
Event alert rules
Event alert rules control which event types generate alerts. System rules are seeded at installation for common scenarios. You can create custom event rules to adjust severity, notification channels, and scope.
- Navigate to Alerts > Alert Rules > Event Rules tab.
- Click Create Event Rule.
- Select the event source (e.g.
ad_security,vulnerability,snmp_status). - Optionally filter by event type within that source.
- Set the severity for alerts generated by this rule.
- Attach notification channels and configure advanced settings (re-notification, escalation).
- Click Save.
Causal investigation for events
The causal engine runs event-specific heuristics when investigating event-sourced alerts. Each event source has specialized analysis patterns:
- Process baseline LOTL — Confidence 0.9 for living-off-the-land binary detection
- Exfiltration + network spike — Confidence 0.85 when exfiltration alert correlates with network anomalies
- AD brute force — Confidence 0.85 for account lockout events
- Vulnerability — Confidence 0.7 for new critical CVE detection
- SNMP device unreachable — Confidence 0.7-0.8 depending on device type
Cascading runbook detection
When a remediation runbook triggers another alert which triggers another runbook, a cascading loop can occur. The remediation engine includes a circuit breaker that detects and stops cascading runbook chains.
- The engine tracks the depth of remediation chains per incident.
- If the chain depth exceeds the configured threshold, the circuit breaker trips and prevents further automated remediation.
- A notification is sent when the circuit breaker activates, requiring manual intervention to resolve the underlying issue.
Automation pause (kill switch)
In an emergency, you can pause all automated remediation and alert-driven actions for an entire organization. This is a per-org safety mechanism that stops the remediation engine from dispatching any new remediation executions while investigation is underway.
- Navigate to Settings > Automation, select the organization, and click Pause Automation. Enter an optional reason (e.g., "Investigating cascading remediation failure").
- All automation for the organization is immediately suspended. The pause is recorded with the user, timestamp, and reason.
- To resume: return to Settings > Automation and click Resume Automation.
- The current pause/resume status is shown on the same page.
Permissions reference
| Action | Permission |
|---|---|
| View alerts, rules, and incidents | alerts.view |
| Create, edit, delete alert rules | alerts.manage |
| Acknowledge and resolve alerts | alerts.manage |
| Bulk alert actions (up to 500) | alerts.manage |
| View and manage runbooks | alerts.manage |
| Approve or cancel remediation executions | alerts.manage |
| View monitoring configuration | monitoring.view |
| Edit monitoring configuration | monitoring.configure |
| Pause/resume automation (kill switch) | settings.manage |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Alert rule not firing | Duration window not sustained, or host offline | Verify metric values are consistently above threshold for the full duration window. Offline hosts are excluded from evaluation. |
| No notification on new alert | Alert correlated to existing incident (noise suppression) | Check the Alert Incidents page -- the alert may have been added to an existing incident instead of creating a new one. |
| "System-generated rules cannot be modified" | Attempting to edit a system rule | System rules are immutable. Create a custom rule with your desired settings instead. |
| Alerts auto-resolving unexpectedly | Host went offline for > 4 hours | The system auto-resolves alerts for hosts offline more than 4 hours. This is expected behavior. |
| External notifications not delivered | Channel misconfigured or inactive | Test the notification channel. Verify the channel is enabled and the configuration is valid. |
| No real-time notifications | WebSocket disconnected | Refresh the page. Check the browser console for WebSocket connection errors. |
| Remediation stuck in pending_approval | Approval required is enabled on the runbook | Manually approve via the Remediation Executions page, or configure the auto-approve health threshold on the runbook. |
| Runbook not triggering for an alert | Trigger conditions don't match | Verify the runbook's trigger conditions (metric type, cause pattern, severity, and heuristic rule) match the alert and its investigation result. |