IT Service & Operations Manual

Patching

Patch planning, rollout control, exception handling, and the workflow used to keep update pressure from turning into operational risk.

Audience: Endpoint and operations teamsFocus: Patch rollout and exception controlStatus: Public manual

Scope

Patching is where operational hygiene, user impact, and risk management collide. This public guide keeps the rollout and control model while removing private implementation detail.

SSOT Document – Single Source of Truth for patch management operational procedures, setup guides, and troubleshooting.

Related SSOT documents: - Architecture: patching.md – System design and data models - Functional: patching.md – API endpoints, state machines, KPIs - Manual: vulnerability-management.md – Vulnerability operations - Manual: compliance.md – Compliance operations - Architecture: alerts-monitoring.md – Alert and incident management

Setting Up Patch Policies

A patch policy controls how patches are approved, deployed, and rolled back for an organization.

Create a policy: 1. Navigate to Patching > Setup 2. Click “Create Policy” 3. Configure: - Name: Descriptive name (unique per org) - Auto-approve toggles: Enable for patch classifications you want auto-approved (security, critical, feature, hotfix, driver) - Auto-approve delay: Days after vendor release before auto-approval (0 = immediate) - Reboot policy: How to handle reboots (maintenance_window recommended for production) - Disk space thresholds: Minimum free space before patching starts - Circuit breaker threshold: Number of failures before auto-rejecting a KB across all orgs (default 3)

Configure rollback behavior: - Failure action: What happens when a host fails: - auto_rollback (default): Automatically roll back failed hosts - auto_pause: Pause the deployment for human review - escalate: Pause AND fire a high-severity alert - Rollback triggers: Enable/disable rollback for specific failure classes (install failure, synthetic validation failure, fingerprint drift)

Add overrides: - Create overrides for specific host groups or locations - Overrides are sparse – only the fields you set override the base policy - Higher priority overrides win when a host matches multiple overrides - The policy override editor in Setup > Policies is structured: choose a location or host group, then toggle only the fields you want to override.

Creating and Managing Ring Sets

Ring sets define the staged rollout strategy for patches.

Create a ring set: 1. Navigate to Patching > Setup > Ring Sets 2. Click “Create Ring Set” 3. Configure: - Name: e.g., “Windows Servers” or “Linux Production” - Auto-deploy: Enable if approved patches should auto-create deployments - Classification filter: Limit to specific patch types (e.g., security only) - Prior-ring proof: Enable to require patches proven in ring N-1 before ring N - Change approval: Enable to require change record approval before deployment

Add rings: 1. Click “Add Ring” on the ring set 2. Configure per-ring settings: - Ring order: 0 = first deployed - Success gate: Minimum % success before advancing (default 95%) - Cooloff hours: Wait time after ring completion (default 24h) - Canary count: Number of canary hosts (default 1) - Canary wait: Hours to wait after canary completion (default 4h) - Install schedule: When to install patches: - immediate: Start right away - delay_from_approval: Wait N days from deployment creation - delay_from_prior_ring: Wait N days from prior ring completion - maintenance_window: Wait for host’s maintenance window - Reboot policy: Ring-level override (immediate, maintenance_window, scheduled_delay, manual, suppress) - Observation window: Hours to watch for critical alerts after cooloff before advancing

Add ring members: - Individual hosts (server-side search – type hostname to search across all hosts) - Host groups (all members included) - Service groups (all tiers expanded, tier ordering preserved)

Handling Rollbacks

When a host enters the rolled_back or surgical_rollback state:

Check triage_data on the deployment host record for the probable culprit KB
If the rollback identified a culprit KB, the circuit breaker may auto-reject that KB if it fails on enough hosts
For manual_intervention_required hosts: investigate the triage data, then either:
Retry the host (resets to pending and restarts the pipeline)
Skip the host (marks as skipped, allows ring to advance)
Manually rollback the host

Reboot Compliance

Setting SLA hours: - On the patch policy, set reboot_sla_hours (default 24) - The strictest (lowest) SLA from active org policies applies

Monitoring breaches: - The reboot_sla_sweep_loop runs every 15 minutes - When a host’s reboot deadline passes (reboot_required_at + sla_hours), it is flagged with reboot_sla_breached = True - A high-severity alert is fired for each breach - View breached hosts in the deployment detail or via the KPI endpoint

Reboot modes: | Mode | Behavior | |------|----------| | immediate | Reboot right after patch install | | maintenance_window | Reboot during next maintenance window | | scheduled_delay | Reboot after N days (configured per ring) | | manual | Human must approve reboot | | suppress | No reboot (explicit exception only) |

Viewing Patch KPIs

Seven KPIs available:

Patch compliance: % of hosts with no approved patches pending
Reboot compliance: % of hosts that rebooted within SLA
MTTR: Mean hours from first failure to recovery (for retried hosts)
Rollback rate: % of hosts that ended in rollback
False positive rate: % of rollbacks without identified culprit
Circuit breaker precision: True triggers vs false triggers
Exception rate: % of hosts needing human intervention

Export compliance for operator/client handoff: - In Patching > KPIs & Compliance, use Export Compliance CSV to download the scoped report shown in the UI. - Optional scoping params:

Common Patching Scenarios

Scenario: Emergency out-of-band patch 1. Create deployment with is_out_of_band = True 2. This bypasses maintenance window scheduling but NOT safety checks 3. Pre-flight, snapshot, and rollback still apply

Scenario: Paused deployment 1. Deployment pauses automatically on: - Canary failure - Success gate not met - Observation window alert detected - Variance gate (untested patches) 2. Investigate the cause 3. If a specific patch should be excluded before continuing, use “Remove Patches” on the deployment detail page while the deployment is paused 4. Either resume (deployment continues), cancel, or use kill switch 5. If paused too long (default 24h): auto-escalated and auto-cancelled

Scenario: Emergency kill switch Use the kill switch when patches are actively causing damage and you need to stop everything immediately: 1. Navigate to the deployment detail page (the kill switch is also available from the deployments list via the “Kill” button on each deployment row) 2. Click “Kill Switch” (requires patches.killswitch permission) 3. Confirm in the modal and provide a reason (required – the activation button is disabled until a reason is entered) 4. The kill switch will: - Set deployment status to killed - Force-skip ALL non-terminal hosts immediately (they won’t finish their current phase) - Cancel all in-flight agent jobs (queued and running) - Record full audit trail with kill switch metadata 5. Unlike cancel (which lets in-progress hosts finish), the kill switch is an immediate forced stop

Kill switch vs Cancel: - Cancel: Graceful — in-progress hosts finish their current phase, then stop. Status: cancelled. - Kill switch: Emergency — all non-terminal hosts are force-skipped immediately, agent jobs cancelled. Status: killed. - Both cancelled and killed deployments can be redeployed (reset to draft).

Scenario: Circuit breaker triggered 1. A KB fails on multiple hosts across orgs (>= threshold) 2. The KB is auto-rejected across all orgs with a rejection reason 3. To re-enable: manually approve the KB after investigating the root cause 4. The re-approved KB resets the circuit breaker for that KB

Scenario: Redeploy a cancelled/killed deployment 1. Click “Redeploy” on a cancelled or killed deployment 2. This resets status to draft and clears manifest_frozen_at 3. You can modify the patch list 4. Start when ready

Scenario: Using grouped patches view 1. Navigate to Patching > Available tab 2. Switch to “Grouped by KB” view 3. Patches are grouped by KB identifier, sorted by host count (most impactful first) 4. Use pagination controls to browse through KB groups (50 per page by default) 5. Filter by severity, classification, or search by KB/title 6. Use “Approve All” on a group to approve that KB across all displayed hosts (confirmation dialog shows host count)

Scenario: Searching patches in flat mode 1. Navigate to Patching > Available tab 2. Switch to flat list view 3. Enter a search term in the search box – the search is sent to the server after a 300ms debounce and resets pagination to page 1 4. Results include patches matching KB, title, or hostname server-side (no 50-per-page limitation on search scope)

Scenario: Browsing deployments with pagination 1. Navigate to Patching > Deployments tab 2. Deployments are paginated server-side at 25 per page with Previous/Next controls 4. Use the search box to find deployments by name (server-side, debounced at 300ms) 5. Pagination resets to page 1 when filters or search change 6. The X-Total-Count response header drives accurate “Showing X-Y of Z” display 7. If the deployment index or ring timeline cannot be loaded, the page shows an inline warning with a Retry action instead of falling back to an empty deployment view.

Scenario: Searching approved patches for manual deployment 1. In the Deployments tab, click “Manual Deployment” 2. Select an organization – approved patches load (up to 500 initial) 3. Type in the patch search box to find patches by KB or title – the search triggers a server-side query after 300ms, allowing discovery of patches beyond the initial 500 4. Client-side filtering provides instant results while the server search is in flight 5. If approved-patch lookup fails, the modal shows an inline warning and Retry action instead of silently clearing the selector.