Alerts & Monitoring
Alert quality, triage workflow, escalation posture, and the signal model teams use to keep monitoring actionable.
Scope
Monitoring only helps when the signals arriving in front of humans are worth acting on. This public page keeps the operating guidance and strips internal plumbing and API detail.
Step-by-step guides for alert rule creation, incident management, notifications, and templates.
Related documents:
- Architecture: docs/architecture/alerts-monitoring.md
- Functional: docs/functional/alerts-monitoring.md
- RBAC setup: docs/manual/roles-permissions.md
Creating Metric Alert Rules
severity(critical, high, medium, low, info)- Optional notification settings:
notification_channels: array of channel IDs for alert deliveryre_notification_interval_minutes: resend if still activeescalation_threshold: occurrence count to trigger escalationescalation_to: channel ID for escalation- Optional trigger actions:
System rules: Default rules are created per org during onboarding (CPU 85%/95%, Memory 90%/97%, Disk 80%/95%, Swap 80%, Process Count 500). These cannot be modified or deleted.
Creating Event Alert Rules
severity, optionalseverity_filter(drop events below this level)- Optional
mute_until(datetime to silence until) - Event sources are validated against the
event_source_registrytable.
Managing Alert Snooze
To temporarily suppress alerts for a host:
Working with Incidents
View incidents:
Manage incidents:
Correlation behavior:
- Per-host: alerts on the same host within 5 minutes group into one incident
- Cross-host (opt-in): enable cross_host_correlation_enabled on the organization. When 3+ hosts trigger the same rule within 5 minutes, a single cross-host incident is created.
- Storm: if 20+ alerts fire per minute for the same rule, they aggregate into a [STORM] incident
Auto-resolution: incidents auto-resolve when all member alerts resolve.
Understanding Dependency-Aware Suppression
When configured:
- If an upstream host (per service map dependencies) has an active critical/high alert, downstream non-critical alerts are suppressed
- This prevents secondary alert noise during cascading failures
- Critical alerts are NEVER suppressed (always fire)
- To configure: create HostGroupDependency records in the service map
Managing Notification Channels
To configure notification channels, go to Alerting > Channels (the relevant workflow).
- Reference channels in alert rules via
notification_channelsarray
Pausing All Automation (Kill Switch)
In an emergency, pause all automated actions for an organization:
Notification Routing Rules
To configure routing rules, go to Alerting > Routing Rules (the relevant workflow).
Route alerts to different channels based on conditions. Rules are evaluated in priority order; first match wins.
Creating a Routing Rule
name: Rule name (unique per account)priority: Higher number = evaluated first (descending order)severity:["critical", "high"]— match alerts with these severities (OR logic)alert_type:["metric", "event"]— match metric-based or event-based alertssource_type:["snmp", "agent", "active_directory"]— match by sourcetime_window:{"start": "18:00", "end": "06:00", "timezone": "America/New_York"}— match during specific hours (supports overnight windows)suppress: Settrueto silence matching alerts (no notification sent)
All conditions use AND logic (all must match). Within each condition, values use OR logic (any can match). Empty conditions = match everything.
Permission required: alerts.manage
Testing a Rule
Before activating, test your rules against a simulated alert:
2. Returns which rule matched, whether it suppressed, and the target channel IDs
3. If fallback: true, no rules matched — default alert behavior applies
Managing Rule Priority
Control evaluation order: 2. Higher priority values are evaluated first 3. First matching rule stops evaluation 4. Reorder updates are scope-enforced: only rules in your effective organization scope (plus account-wide rules) are updated; out-of-scope IDs are ignored
How Routing Integrates with Existing Alerts
- When an alert fires, routing rules are evaluated before the alert rule’s own channels
- If a routing rule matches with
suppress: true, no notification is sent - If no routing rules match, the alert rule’s own channels are used (fallback behavior)
- Routing works for both metric alerts (
core/alert_engine.py) and event alerts (core/event_alert_adapter.py)
Example: After-Hours Escalation
Route critical alerts to PagerDuty during off-hours, email during business hours: 3. The after-hours rule evaluates first (higher priority); during business hours it won’t match, so the business-hours rule catches it