Prerequisites
- Role with alerts.view for viewing SLA definitions, breach events, problems, and incident categories
- Role with alerts.manage for creating/editing SLA definitions, managing problems, and linking incidents
- Role with settings.manage for creating and managing change freeze windows
- Active monitoring with alert rules configured so incidents are created for SLA tracking
SLA definitions
SLA definitions set response and resolution time targets for each severity level. They can be scoped to an entire organization or a specific host group, with optional business hours configuration.
- Navigate to SLA Definitions from the sidebar.
- Click Create SLA Definition.
- Enter a name and optional description.
- Select the scope type:
organization(applies to all incidents in the org) orhost_group(applies only to incidents on hosts in that group). - Set severity targets for each level — both a response time (minutes until someone acknowledges) and a resolution time (minutes until the incident is resolved).
- Optionally configure business hours with timezone, work start/end times, and working days. Leave blank for a 24/7 SLA.
- Add escalation rules to trigger notifications when breach thresholds are crossed (e.g., notify on-call after 30 minutes, notify manager after 120 minutes).
- Click Save.
Severity targets
Each severity level requires two values:
Scope resolution
When a new incident is created, the SLA engine assigns the most specific matching SLA definition. Host group SLAs take priority over organization-level defaults. If no SLA matches, the incident has no SLA tracking.
SLA breach detection
The SLA engine runs every 30 seconds as part of the alert evaluation loop. It checks all open incidents with SLA definitions for response and resolution breaches.
- For each open incident with an assigned SLA, the engine calculates elapsed business minutes since incident creation.
- If the incident has not been acknowledged and elapsed time exceeds the response target, a response breach event is created.
- If the incident has not been resolved and elapsed time exceeds the resolution target, a resolution breach event is created.
- Breach events are recorded once per incident per type — the system does not create duplicate breach records.
- When a breach is detected, escalation notifications fire according to the SLA definition's escalation rules.
Escalation notifications
Each SLA definition can have multiple escalation rules with different time thresholds and notification targets. When a breach is detected, the system evaluates all escalation rules and fires notifications for every rule where the breach duration exceeds the configured time threshold.
Incident enrichment
When a new incident is created (from either metric alerts or event alerts), the SLA engine automatically enriches it with ITIL metadata.
Auto-categorization
Incidents are automatically categorized based on their source. The engine maps event sources (AD security, vulnerability, SNMP, discovery, etc.) and metric types (CPU, memory, disk, network) to ITIL incident categories and subcategories. Categories follow a hierarchical taxonomy seeded at installation.
Priority matrix
Incident priority is calculated using an ITIL 4x4 impact/urgency grid:
| Urgency: Critical | Urgency: High | Urgency: Medium | Urgency: Low | |
|---|---|---|---|---|
| Impact: Critical | P1 | P1 | P2 | P2 |
| Impact: High | P1 | P2 | P2 | P3 |
| Impact: Medium | P2 | P2 | P3 | P3 |
| Impact: Low | P2 | P3 | P3 | P4 |
Impact is derived from alert severity, the number of affected hosts, and the criticality of the host group. Urgency maps directly from severity.
SLA assignment
After categorization and priority calculation, the system assigns the most specific matching SLA definition and sets the response due and resolution due timestamps on the incident.
Problem management (KEDB)
The Problems page serves as a Known Error Database (KEDB). Problems represent recurring issues that cause multiple incidents. Linking incidents to problems provides root cause context and helps track permanent fixes.
- Navigate to Problems from the sidebar.
- Click Create Problem.
- Enter a title, description, root cause (if known), workaround, and permanent fix details.
- Set the status:
open,investigating,known_error,resolved, orclosed. - Link related incidents by selecting them from the incident list.
- Click Save.
Problem statuses
Incident linking
Link incidents to a problem to track which incidents are caused by the same underlying issue. Unlinking removes the association without deleting either record. The problem's incident count and affected host count update automatically.
Automated problem detection
The problem detection engine runs daily as a scheduled job. It scans closed incidents to find recurring patterns that indicate a systemic issue.
Detection tiers
| Tier | Pattern | Criteria |
|---|---|---|
| Tier 1: Cross-host | Same alert rule firing across multiple hosts in an organization | 3+ incidents from the same alert rule, affecting 2+ distinct hosts |
| Tier 2: Per-host recurrence | Same alert rule firing repeatedly on a single host | 3+ incidents from the same alert rule on the same host |
When a pattern is detected, the engine automatically creates a Problem record with detection metadata (alert rule, affected hosts, incident count, time range). These auto-detected problems appear in the KEDB alongside manually created ones.
Incident templates
Five system incident templates are seeded at installation for common incident types. They provide structured response procedures and pre-defined severity levels.
| Template | Default Severity | Description |
|---|---|---|
| Service Outage | Critical | Complete or partial service unavailability affecting users or business operations. |
| Security Breach | Critical | Confirmed or suspected unauthorized access, data exfiltration, or policy violation. |
| Performance Degradation | High | Service available but operating below acceptable performance thresholds. |
| Network Connectivity | High | Network path failures, high latency, or intermittent connectivity issues. |
| Compliance Violation | Medium | Detected deviation from compliance policies or regulatory requirements. |
System templates cannot be modified or deleted. Create custom templates by cloning and adjusting the response procedures to match your operational requirements.
Change freeze windows
Change freeze windows block automated changes during sensitive periods such as quarter-end, audit windows, or deployment freezes. When a freeze is active, the patch engine and remediation engine respect the freeze and hold all changes.
- Navigate to Settings > Change Freeze.
- Click Create Freeze Window.
- Select the target organization.
- Set the start and end dates/times (UTC).
- Select the scope to control what is blocked.
- Enter a reason for the freeze.
- Click Save.
Freeze scopes
Emergency override
If an urgent change is needed during a freeze, users with settings.manage can issue an emergency override. The override requires a written justification, records who approved it and when, and creates a full audit trail. The freeze is deactivated but the override record is preserved.
Major incident auto-escalation
P1 (priority 1) incidents trigger automatic escalation notifications to ensure immediate attention from senior staff. When an incident is classified as P1 through the ITIL priority matrix, the system automatically:
- Sends critical-severity notifications to all configured notification channels
- Fires SLA breach escalation rules if applicable
- Records the escalation in the incident timeline
This ensures that major incidents receive immediate visibility without waiting for manual triage.
Incident categories
Cadres ships with a hierarchical incident category taxonomy based on ITIL best practices. Categories are seeded at installation with six top-level categories, each containing subcategories.
Categories are used for incident classification, reporting, and trend analysis. The auto-categorization engine assigns categories automatically based on alert source, but operators can override the category on any incident.
SLA dashboard
The SLA Dashboard provides a 30-day breach trend visualization, summary statistics, and SLA definition status at a glance.
- Breach trend chart — 30-day view of response and resolution breaches by day.
- Stats cards — Total active SLAs, total breaches (response + resolution), and current compliance percentage.
- SLA definition list — All active SLA definitions with their scope, target values, and current breach count.
- At-risk indicator — Incidents show an "At Risk" warning badge (yellow) when less than 20% of SLA time remains, before the actual breach occurs.
Permissions reference
| Action | Permission |
|---|---|
| View SLA definitions, breaches, problems, categories | alerts.view |
| Create/edit/delete SLA definitions and problems | alerts.manage |
| Link/unlink incidents to problems | alerts.manage |
| Create/delete change freeze windows | settings.manage |
| Emergency override of change freeze | settings.manage |
Navigation reference
| Feature | Location |
|---|---|
| SLA Definitions | SLA > SLA Definitions — create, edit, deactivate, and view SLA definitions |
| SLA Breaches | SLA > Breaches — view breach events and details |
| Incident Categories | SLA > Incident Categories — view the hierarchical category taxonomy |
| Problems (KEDB) | Problems — create, edit, delete problems; link and unlink incidents |
| Change Freeze Windows | Settings > Change Freeze — create, deactivate, and override freeze windows |