Service Management

SLA & ITIL Operations

Define SLA targets per severity, track response and resolution times against business hours, detect SLA breaches automatically, manage problems as a Known Error Database, and enforce change freeze windows during sensitive periods.

Technical Manual
Status: Available

Prerequisites

  • Role with alerts.view for viewing SLA definitions, breach events, problems, and incident categories
  • Role with alerts.manage for creating/editing SLA definitions, managing problems, and linking incidents
  • Role with settings.manage for creating and managing change freeze windows
  • Active monitoring with alert rules configured so incidents are created for SLA tracking

SLA definitions

SLA definitions set response and resolution time targets for each severity level. They can be scoped to an entire organization or a specific host group, with optional business hours configuration.

  1. Navigate to SLA Definitions from the sidebar.
  2. Click Create SLA Definition.
  3. Enter a name and optional description.
  4. Select the scope type: organization (applies to all incidents in the org) or host_group (applies only to incidents on hosts in that group).
  5. Set severity targets for each level — both a response time (minutes until someone acknowledges) and a resolution time (minutes until the incident is resolved).
  6. Optionally configure business hours with timezone, work start/end times, and working days. Leave blank for a 24/7 SLA.
  7. Add escalation rules to trigger notifications when breach thresholds are crossed (e.g., notify on-call after 30 minutes, notify manager after 120 minutes).
  8. Click Save.

Severity targets

Each severity level requires two values:

Response Time (minutes) Maximum time (in minutes, minimum 1) before the incident must be acknowledged.
Resolution Time (minutes) Maximum time (in minutes, minimum 1) before the incident must be resolved.

Scope resolution

When a new incident is created, the SLA engine assigns the most specific matching SLA definition. Host group SLAs take priority over organization-level defaults. If no SLA matches, the incident has no SLA tracking.

Business hours calculation SLA timers only count business hours when business hours are configured. A 240-minute resolution SLA with 9am-5pm business hours means 240 minutes of business time, not wall clock time. For 24/7 SLAs, leave business hours empty.

SLA breach detection

The SLA engine runs every 30 seconds as part of the alert evaluation loop. It checks all open incidents with SLA definitions for response and resolution breaches.

  1. For each open incident with an assigned SLA, the engine calculates elapsed business minutes since incident creation.
  2. If the incident has not been acknowledged and elapsed time exceeds the response target, a response breach event is created.
  3. If the incident has not been resolved and elapsed time exceeds the resolution target, a resolution breach event is created.
  4. Breach events are recorded once per incident per type — the system does not create duplicate breach records.
  5. When a breach is detected, escalation notifications fire according to the SLA definition's escalation rules.

Escalation notifications

Each SLA definition can have multiple escalation rules with different time thresholds and notification targets. When a breach is detected, the system evaluates all escalation rules and fires notifications for every rule where the breach duration exceeds the configured time threshold.

Escalation timing All applicable escalation levels fire at the time of breach detection. Time-based re-escalation (e.g., "escalate to level 2 after 120 more minutes beyond initial breach") is not currently supported. All escalation levels that meet their threshold fire at the initial breach event.

Incident enrichment

When a new incident is created (from either metric alerts or event alerts), the SLA engine automatically enriches it with ITIL metadata.

Auto-categorization

Incidents are automatically categorized based on their source. The engine maps event sources (AD security, vulnerability, SNMP, discovery, etc.) and metric types (CPU, memory, disk, network) to ITIL incident categories and subcategories. Categories follow a hierarchical taxonomy seeded at installation.

Priority matrix

Incident priority is calculated using an ITIL 4x4 impact/urgency grid:

Urgency: CriticalUrgency: HighUrgency: MediumUrgency: Low
Impact: CriticalP1P1P2P2
Impact: HighP1P2P2P3
Impact: MediumP2P2P3P3
Impact: LowP2P3P3P4

Impact is derived from alert severity, the number of affected hosts, and the criticality of the host group. Urgency maps directly from severity.

SLA assignment

After categorization and priority calculation, the system assigns the most specific matching SLA definition and sets the response due and resolution due timestamps on the incident.

Problem management (KEDB)

The Problems page serves as a Known Error Database (KEDB). Problems represent recurring issues that cause multiple incidents. Linking incidents to problems provides root cause context and helps track permanent fixes.

  1. Navigate to Problems from the sidebar.
  2. Click Create Problem.
  3. Enter a title, description, root cause (if known), workaround, and permanent fix details.
  4. Set the status: open, investigating, known_error, resolved, or closed.
  5. Link related incidents by selecting them from the incident list.
  6. Click Save.

Problem statuses

openNew problem under initial review.
investigatingRoot cause analysis in progress.
known_errorRoot cause identified and documented with a workaround. The Known Error flag is set automatically with this status.
resolvedPermanent fix applied. The resolved timestamp is set automatically.
closedProblem fully reviewed and closed.

Incident linking

Link incidents to a problem to track which incidents are caused by the same underlying issue. Unlinking removes the association without deleting either record. The problem's incident count and affected host count update automatically.

Automated problem detection

The problem detection engine runs daily as a scheduled job. It scans closed incidents to find recurring patterns that indicate a systemic issue.

Detection tiers

TierPatternCriteria
Tier 1: Cross-host Same alert rule firing across multiple hosts in an organization 3+ incidents from the same alert rule, affecting 2+ distinct hosts
Tier 2: Per-host recurrence Same alert rule firing repeatedly on a single host 3+ incidents from the same alert rule on the same host

When a pattern is detected, the engine automatically creates a Problem record with detection metadata (alert rule, affected hosts, incident count, time range). These auto-detected problems appear in the KEDB alongside manually created ones.

Incident templates

Five system incident templates are seeded at installation for common incident types. They provide structured response procedures and pre-defined severity levels.

TemplateDefault SeverityDescription
Service OutageCriticalComplete or partial service unavailability affecting users or business operations.
Security BreachCriticalConfirmed or suspected unauthorized access, data exfiltration, or policy violation.
Performance DegradationHighService available but operating below acceptable performance thresholds.
Network ConnectivityHighNetwork path failures, high latency, or intermittent connectivity issues.
Compliance ViolationMediumDetected deviation from compliance policies or regulatory requirements.

System templates cannot be modified or deleted. Create custom templates by cloning and adjusting the response procedures to match your operational requirements.

Change freeze windows

Change freeze windows block automated changes during sensitive periods such as quarter-end, audit windows, or deployment freezes. When a freeze is active, the patch engine and remediation engine respect the freeze and hold all changes.

  1. Navigate to Settings > Change Freeze.
  2. Click Create Freeze Window.
  3. Select the target organization.
  4. Set the start and end dates/times (UTC).
  5. Select the scope to control what is blocked.
  6. Enter a reason for the freeze.
  7. Click Save.

Freeze scopes

allBlocks patches, deployments, and auto-remediation.
patchesBlocks patch deployments only.
deploymentsBlocks all deployments (patches and software installs).
remediationBlocks auto-remediation only.

Emergency override

If an urgent change is needed during a freeze, users with settings.manage can issue an emergency override. The override requires a written justification, records who approved it and when, and creates a full audit trail. The freeze is deactivated but the override record is preserved.

Freeze and maintenance window interaction Change freezes override maintenance windows. If a freeze is active, patches will not deploy even if a maintenance window is currently open.

Major incident auto-escalation

P1 (priority 1) incidents trigger automatic escalation notifications to ensure immediate attention from senior staff. When an incident is classified as P1 through the ITIL priority matrix, the system automatically:

  • Sends critical-severity notifications to all configured notification channels
  • Fires SLA breach escalation rules if applicable
  • Records the escalation in the incident timeline

This ensures that major incidents receive immediate visibility without waiting for manual triage.

Incident categories

Cadres ships with a hierarchical incident category taxonomy based on ITIL best practices. Categories are seeded at installation with six top-level categories, each containing subcategories.

Categories are used for incident classification, reporting, and trend analysis. The auto-categorization engine assigns categories automatically based on alert source, but operators can override the category on any incident.

Read-only taxonomy The category list is available to any authenticated user from SLA > Incident Categories. Inactive categories are excluded. Custom categories are not currently supported — the system uses the seeded taxonomy.

SLA dashboard

The SLA Dashboard provides a 30-day breach trend visualization, summary statistics, and SLA definition status at a glance.

  • Breach trend chart — 30-day view of response and resolution breaches by day.
  • Stats cards — Total active SLAs, total breaches (response + resolution), and current compliance percentage.
  • SLA definition list — All active SLA definitions with their scope, target values, and current breach count.
  • At-risk indicator — Incidents show an "At Risk" warning badge (yellow) when less than 20% of SLA time remains, before the actual breach occurs.

Permissions reference

ActionPermission
View SLA definitions, breaches, problems, categoriesalerts.view
Create/edit/delete SLA definitions and problemsalerts.manage
Link/unlink incidents to problemsalerts.manage
Create/delete change freeze windowssettings.manage
Emergency override of change freezesettings.manage

Navigation reference

FeatureLocation
SLA DefinitionsSLA > SLA Definitions — create, edit, deactivate, and view SLA definitions
SLA BreachesSLA > Breaches — view breach events and details
Incident CategoriesSLA > Incident Categories — view the hierarchical category taxonomy
Problems (KEDB)Problems — create, edit, delete problems; link and unlink incidents
Change Freeze WindowsSettings > Change Freeze — create, deactivate, and override freeze windows