IT Service & Operations Manual

Alerts & Monitoring

Alert quality, triage workflow, escalation posture, and the signal model teams use to keep monitoring actionable.

Audience: Operations and NOC-style teamsFocus: Monitoring signal and triageStatus: Public manual

Scope

Monitoring only helps when the signals arriving in front of humans are worth acting on. This public page keeps the operating guidance and strips internal plumbing and API detail.

Step-by-step guides for alert rule creation, incident management, notifications, and templates.

Related documents: - Architecture: docs/architecture/alerts-monitoring.md - Functional: docs/functional/alerts-monitoring.md - RBAC setup: docs/manual/roles-permissions.md

Creating Metric Alert Rules

severity (critical, high, medium, low, info)
Optional notification settings:
notification_channels: array of channel IDs for alert delivery
re_notification_interval_minutes: resend if still active
escalation_threshold: occurrence count to trigger escalation
escalation_to: channel ID for escalation
Optional trigger actions:

System rules: Default rules are created per org during onboarding (CPU 85%/95%, Memory 90%/97%, Disk 80%/95%, Swap 80%, Process Count 500). These cannot be modified or deleted.

Creating Event Alert Rules

severity, optional severity_filter (drop events below this level)
Optional mute_until (datetime to silence until)
Event sources are validated against the event_source_registry table.

Managing Alert Snooze

To temporarily suppress alerts for a host:

Working with Incidents

View incidents:

Manage incidents:

Correlation behavior: - Per-host: alerts on the same host within 5 minutes group into one incident - Cross-host (opt-in): enable cross_host_correlation_enabled on the organization. When 3+ hosts trigger the same rule within 5 minutes, a single cross-host incident is created. - Storm: if 20+ alerts fire per minute for the same rule, they aggregate into a [STORM] incident

Auto-resolution: incidents auto-resolve when all member alerts resolve.

Understanding Dependency-Aware Suppression

When configured: - If an upstream host (per service map dependencies) has an active critical/high alert, downstream non-critical alerts are suppressed - This prevents secondary alert noise during cascading failures - Critical alerts are NEVER suppressed (always fire) - To configure: create HostGroupDependency records in the service map

Managing Notification Channels

To configure notification channels, go to Alerting > Channels (the relevant workflow).

Reference channels in alert rules via notification_channels array

Pausing All Automation (Kill Switch)

In an emergency, pause all automated actions for an organization:

Notification Routing Rules

To configure routing rules, go to Alerting > Routing Rules (the relevant workflow).

Route alerts to different channels based on conditions. Rules are evaluated in priority order; first match wins.

Creating a Routing Rule

name: Rule name (unique per account)
priority: Higher number = evaluated first (descending order)
severity: ["critical", "high"] — match alerts with these severities (OR logic)
alert_type: ["metric", "event"] — match metric-based or event-based alerts
source_type: ["snmp", "agent", "active_directory"] — match by source
time_window: {"start": "18:00", "end": "06:00", "timezone": "America/New_York"} — match during specific hours (supports overnight windows)
suppress: Set true to silence matching alerts (no notification sent)

All conditions use AND logic (all must match). Within each condition, values use OR logic (any can match). Empty conditions = match everything.

Permission required: alerts.manage

Testing a Rule

Before activating, test your rules against a simulated alert: 2. Returns which rule matched, whether it suppressed, and the target channel IDs 3. If fallback: true, no rules matched — default alert behavior applies

Managing Rule Priority

Control evaluation order: 2. Higher priority values are evaluated first 3. First matching rule stops evaluation 4. Reorder updates are scope-enforced: only rules in your effective organization scope (plus account-wide rules) are updated; out-of-scope IDs are ignored

How Routing Integrates with Existing Alerts

When an alert fires, routing rules are evaluated before the alert rule’s own channels
If a routing rule matches with suppress: true, no notification is sent
If no routing rules match, the alert rule’s own channels are used (fallback behavior)
Routing works for both metric alerts (core/alert_engine.py) and event alerts (core/event_alert_adapter.py)

Example: After-Hours Escalation

Route critical alerts to PagerDuty during off-hours, email during business hours: 3. The after-hours rule evaluates first (higher priority); during business hours it won’t match, so the business-hours rule catches it