Core Operations

Monitoring & Alerts

Threshold-based alert rules with sustained-duration checks, incident correlation, multi-channel notifications, escalation policies, and automated remediation triggers.

Technical Manual
Status: Available

Prerequisites

  • Role with alerts.view for viewing alerts, rules, and incidents
  • Role with alerts.manage for creating/editing rules, acknowledging, resolving, and remediation
  • Role with monitoring.view / monitoring.configure for monitoring configuration
  • Hosts must be online and reporting metrics for alert evaluation to work

Setting up notification channels

Before creating alert rules, configure at least one notification channel so alerts can be delivered to your team.

  1. Navigate to Alerts > Notification Channels.
  2. Click Add Channel.
  3. Select the channel type: Email, Webhook, Slack, or Microsoft Teams.
  4. Configure channel-specific settings (SMTP configuration for email, webhook URL, Slack webhook URL, Teams incoming webhook URL).
  5. Click Test to verify delivery works.
  6. Click Save to create the channel.
Channel testing Always test a channel before attaching it to alert rules. If the channel is misconfigured or disabled, external notifications will silently fail.

Creating an alert rule

Alert rules define what to monitor, what threshold triggers an alert, and how to notify your team.

  1. Navigate to Alerts > Alert Rules.
  2. Click Create Rule.
  3. Enter a rule name and description.
  4. Set scope: Account-wide, Organization, Location, or specific Host. Optionally filter by host groups.
  5. Select the metric to monitor from the MetricDefinition catalog (e.g. CPU usage, memory usage, disk space).
  6. Set the operator and threshold (e.g. > 90 for CPU above 90%).
  7. Set the duration in minutes (1-1440). The threshold must be sustained for this entire window before an alert fires. This prevents transient spikes from triggering false alerts.
  8. Set severity: info, low, medium, warning, high, or critical.
  9. Attach one or more notification channels for external delivery.
  10. Optionally configure advanced settings (see below).
  11. Click Save.

Advanced rule settings

Re-notification interval Minutes between repeated notifications for an active, unacknowledged alert. Example: set to 60 to send hourly reminders until someone acknowledges the alert or it resolves. Prevents critical alerts from being silently ignored during shift changes.
Escalation threshold The number of occurrences an alert must reach before escalation fires.
Escalation channel The notification channel that receives the escalation notification. Useful for routing persistent problems to a senior on-call team. The escalation fires once per alert -- it does not repeat.
Workflow trigger Automatically trigger a workflow when the alert fires.
Script trigger Automatically execute a script on the affected host when the alert fires.
Estimated fix time Advisory field for triage -- how long this type of issue typically takes to resolve.

Alert evaluation cycle

The alert engine runs a background evaluation loop every 30 seconds. Understanding this cycle helps explain why alerts fire or don't fire.

  1. The alert engine evaluates all active rules every 30 seconds.
  2. All active alert rules are loaded in batch.
  3. For each rule, target hosts are resolved based on scope (account, org, location, host, host groups).
  4. Metric values are queried in batch -- one query per rule for all target hosts.
  5. For each host, the engine checks if ALL values within the duration window breach the threshold (sustained check).
  6. If breached and no existing alert: a new alert is created, correlated to an incident, notifications are queued, and an investigation is triggered.
  7. If breached and existing alert: the occurrence count is incremented. Re-notification and escalation checks run.
  8. If NOT breached and an existing alert exists: the alert is auto-resolved.
Offline hosts are excluded Hosts that are offline are not evaluated. If a host goes offline for more than 4 hours, the system auto-resolves its active alerts. This is expected behavior, not a bug.

Alert triage

Work through active alerts using acknowledge, investigate, and resolve actions.

  1. Navigate to the Alerts page.
  2. Filter by severity, status (active / acknowledged / resolved), or host.
  3. Click an alert to see details: metric value, threshold, occurrence count, investigation results.
  4. Acknowledge: Click "Acknowledge" to indicate you are looking at the issue. This stops re-notification for this alert.
  5. Resolve: Click "Resolve" to mark as fixed. Add an optional comment explaining the resolution.
  6. Bulk actions: Select multiple alerts via checkboxes, then choose "Acknowledge" or "Resolve" for batch processing (up to 500 alerts).

Alert incidents

Incidents group related alerts by host within a 5-minute time window. This reduces noise when multiple alert rules fire for the same host simultaneously.

  1. Navigate to the Alert Incidents page.
  2. Incidents display the highest-severity alert in the group and the count of associated alerts.
  3. Open an incident to see all associated alerts and a unified timeline (alerts, investigations, and remediations in chronological order).
  4. Acknowledge the incident to indicate triage is underway.
  5. Resolve the incident to close it. Resolving an incident auto-resolves all remaining active alerts within it.
Noise suppression When a new alert fires and an open incident exists for the same host (within a 5-minute window), the alert is added to the existing incident instead of creating a new one. The incident's severity is escalated if the new alert has a higher severity. The initial notification for the correlated alert is suppressed to avoid duplicate noise.

Investigations

When a new alert fires, the system automatically triggers a causal investigation. The causal engine analyzes metrics and context to produce a probable cause, confidence score, and heuristic rule name.

  • Investigations are created automatically -- no manual action required to start one.
  • Results appear in the alert detail view: probable cause, confidence, and heuristic rule used.
  • Investigation results feed into the remediation engine for automatic runbook matching.
  • If investigation fails (e.g. insufficient data), the alert still fires normally -- investigation failures are logged but don't block alerting.

Notification channels

Alerts are delivered through multiple channels simultaneously.

Channel Delivery Method Configuration
In-app WebSocket push + database persistence Automatic for all alerts. Click the bell icon to view. Real-time delivery via WebSocket.
Email SMTP SMTP server, port, credentials, sender address, recipient list
Webhook HTTP POST Target URL. Payload includes alert details, host info, and metric values.
Slack Incoming webhook Slack webhook URL. Messages formatted with alert severity, host, and metric context.
Microsoft Teams Incoming webhook Teams webhook URL. Adaptive card format with alert details.

In-app notifications

  1. Click the notification bell icon in the header. Unread count is shown as a badge.
  2. Click a notification to navigate to the relevant resource (alert, incident, etc.).
  3. Use Mark All Read to clear the unread badge.
Multi-instance delivery In multi-instance deployments, notifications are broadcast across all server instances. This ensures notifications reach users regardless of which server instance they are connected to.

Escalation policies

Escalation is configured per alert rule, not as a global policy. Two mechanisms work together to ensure critical alerts get attention.

Re-notification

Set a re-notification interval (in minutes) on the alert rule. The system re-sends notifications for active, unacknowledged alerts at this interval. Example: setting it to 60 sends hourly reminders until the alert is acknowledged or resolved. This prevents critical alerts from being buried during busy periods or shift changes.

Escalation

Set an escalation threshold (occurrence count) and escalation channel on the alert rule. When the alert's occurrence count reaches the threshold, a one-time escalation notification is sent to the designated channel. This is useful for routing persistent problems to a senior on-call team or management channel.

Escalation fires once The escalation notification is sent exactly once per alert. It does not repeat even if the occurrence count continues to climb.

Alert rule templates

System-generated alert rules are created automatically for common operational scenarios. They cannot be modified or deleted -- if you need different thresholds, create a custom rule instead.

  • System rules appear in the Alert Rules list with a "system" badge.
  • Attempting to edit or delete a system rule returns HTTP 403: "System-generated rules cannot be modified".
  • Custom rules with the same metric and scope take precedence in practice (both will fire independently, so duplicate notifications are possible -- consider disabling the system rule's notification channels if you create a custom replacement).

Event-driven alerting

In addition to metric-based threshold alerts, Cadres supports event-driven alerts from non-metric sources. These fire through the same alert pipeline (incident correlation, deduplication, investigation, notification, webhook dispatch) but are triggered by discrete events rather than sustained metric values.

Event sources

SourceTriggers OnExample Events
Active Directory Security Security-relevant AD events Account lockout (high), group membership changes (medium)
AD Replication Replication failures Failed replication partners detected during monitoring
Vulnerability New critical CVEs CVSS >= 7.0 or CISA KEV match on a host
SNMP Status Device status transitions SNMP device becomes unreachable; auto-resolves on recovery
Discovery Asset staleness Discovered host not seen for > 24 hours; resolves when host reappears
Hardware Forecast Predicted hardware failures Disk, CPU, or memory forecasted to reach capacity within threshold
Process Baseline Unauthorized process execution LOTL (living off the land) binary detected, unknown process in enforcement mode
Exfiltration Network anomalies Large outbound transfers, connections to unknown destinations, tunneling port usage
Fingerprint Drift Configuration drift detected Service, package, or configuration changed from established baseline
Host Group Health Health score degradation Host group health drops below threshold; auto-resolves on recovery

Event alert rules

Event alert rules control which event types generate alerts. System rules are seeded at installation for common scenarios. You can create custom event rules to adjust severity, notification channels, and scope.

  1. Navigate to Alerts > Alert Rules > Event Rules tab.
  2. Click Create Event Rule.
  3. Select the event source (e.g. ad_security, vulnerability, snmp_status).
  4. Optionally filter by event type within that source.
  5. Set the severity for alerts generated by this rule.
  6. Attach notification channels and configure advanced settings (re-notification, escalation).
  7. Click Save.
Unified pipeline Event alerts flow through the same pipeline as metric alerts: incident correlation (grouping by host within a 5-minute window), causal investigation (with event-specific heuristics), remediation matching (via event source on runbooks), and webhook dispatch. The only difference is the trigger mechanism.

Causal investigation for events

The causal engine runs event-specific heuristics when investigating event-sourced alerts. Each event source has specialized analysis patterns:

  • Process baseline LOTL — Confidence 0.9 for living-off-the-land binary detection
  • Exfiltration + network spike — Confidence 0.85 when exfiltration alert correlates with network anomalies
  • AD brute force — Confidence 0.85 for account lockout events
  • Vulnerability — Confidence 0.7 for new critical CVE detection
  • SNMP device unreachable — Confidence 0.7-0.8 depending on device type

Cascading runbook detection

When a remediation runbook triggers another alert which triggers another runbook, a cascading loop can occur. The remediation engine includes a circuit breaker that detects and stops cascading runbook chains.

  • The engine tracks the depth of remediation chains per incident.
  • If the chain depth exceeds the configured threshold, the circuit breaker trips and prevents further automated remediation.
  • A notification is sent when the circuit breaker activates, requiring manual intervention to resolve the underlying issue.

Automation pause (kill switch)

In an emergency, you can pause all automated remediation and alert-driven actions for an entire organization. This is a per-org safety mechanism that stops the remediation engine from dispatching any new remediation executions while investigation is underway.

  1. Navigate to Settings > Automation, select the organization, and click Pause Automation. Enter an optional reason (e.g., "Investigating cascading remediation failure").
  2. All automation for the organization is immediately suspended. The pause is recorded with the user, timestamp, and reason.
  3. To resume: return to Settings > Automation and click Resume Automation.
  4. The current pause/resume status is shown on the same page.
Alerts still fire while paused. The automation pause only stops automated remediation dispatch. Alert evaluation, incident creation, and notification delivery continue normally. This is intentional -- you still need visibility into what is happening, you just want to stop automated responses from executing.

Permissions reference

Action Permission
View alerts, rules, and incidents alerts.view
Create, edit, delete alert rules alerts.manage
Acknowledge and resolve alerts alerts.manage
Bulk alert actions (up to 500) alerts.manage
View and manage runbooks alerts.manage
Approve or cancel remediation executions alerts.manage
View monitoring configuration monitoring.view
Edit monitoring configuration monitoring.configure
Pause/resume automation (kill switch) settings.manage

Troubleshooting

Symptom Cause Fix
Alert rule not firing Duration window not sustained, or host offline Verify metric values are consistently above threshold for the full duration window. Offline hosts are excluded from evaluation.
No notification on new alert Alert correlated to existing incident (noise suppression) Check the Alert Incidents page -- the alert may have been added to an existing incident instead of creating a new one.
"System-generated rules cannot be modified" Attempting to edit a system rule System rules are immutable. Create a custom rule with your desired settings instead.
Alerts auto-resolving unexpectedly Host went offline for > 4 hours The system auto-resolves alerts for hosts offline more than 4 hours. This is expected behavior.
External notifications not delivered Channel misconfigured or inactive Test the notification channel. Verify the channel is enabled and the configuration is valid.
No real-time notifications WebSocket disconnected Refresh the page. Check the browser console for WebSocket connection errors.
Remediation stuck in pending_approval Approval required is enabled on the runbook Manually approve via the Remediation Executions page, or configure the auto-approve health threshold on the runbook.
Runbook not triggering for an alert Trigger conditions don't match Verify the runbook's trigger conditions (metric type, cause pattern, severity, and heuristic rule) match the alert and its investigation result.