IT Service & Operations Manual

Automation

Workflow automation, approvals, exception handling, and the guardrails needed to turn repeatable work into reliable execution.

Audience: Automation and operations teamsFocus: Repeatable execution and guardrailsStatus: Public manual

Scope

Automation should remove repetitive work without creating opaque or dangerous operations. This guide keeps the operator-facing automation model while excluding private execution interfaces.

Domain: Workflow Engine & Automation Control Plane

See also: Jobs & Scripts Manual | ITSM Manual

Creating a Workflow Definition

Navigate to Automation > Workflows.
Click “New Workflow”.
Fill in the name, description, and organization scope.

The Workflows landing page also provides shortcuts for operators who want a different starting point: “Open visual designer” opens the editor in visual mode, “Create template” starts a reusable draft, and the reusable-templates panel plus template library surface existing template definitions for cloning or review. The search box on this page is definitions-only; instances and approvals do not pretend to support the same search affordance.

Each step requires: - id: Unique string identifier within the workflow. - type: Step type (see below). - config: Type-specific configuration. - on_success: Next step ID on success. Omit or set null for a terminal successful completion.

Building Workflow Steps

Script step:

Command step:

Approval step:

Conditional step:

Delay step:

PAM checkout step:

PAM checkin step:

Parallel step:

Notification step:

Publishing a Workflow

Test the workflow in draft mode by running instances.
When satisfied, click Publish in the WorkflowEditor toolbar (requires workflows.edit permission). A confirmation modal explains the lock.
You can still toggle is_active and is_published.
To make changes, clone the definition ( or use the Clone action in the definitions table) and edit the clone.
In the editor Settings menu, set the default workflow timeout, optional canary rollout fields, and the template-library flag before saving the draft. Existing unpublished workflows can also be promoted into the template library by toggling that flag and saving.

If you open the editor from the Workflows page starter cards, the URL can carry ?mode=visual to land on the canvas or ?template=1 to preselect the template flag for a new draft. Template drafts are snapshot-based, so cloning captures the current definition as a new org-specific workflow and later edits do not automatically change older clones. The template library is fetched separately from the Workflows landing page definition list, so it stays accurate even when the main definitions table is filtered or paged.

Running Workflows

Manual start: 2. Groups are expanded to host IDs at start time. 3. The workflow definition is frozen (snapshot) at start time.

Deferred start (WP-5):

Scheduled start: Set up a recurring schedule that targets a workflow.

Triggered start: Workflows can be triggered by: - Runbook actions (via monitor rules) - Manual API calls

Monitoring Workflow Instances

Current step and status
Step results for each completed step
Variables (with sensitive values redacted)
Associated jobs and their statuses
Admin/owner users see all approvals and can respond regardless of approver config.
If approvers is empty, any user with workflows.execute permission can respond (legacy behavior).
The approvals list is paginated with skip and limit, so large queues can be browsed without loading every record at once. The Workflows page pagination controls navigate real server-side pages.

Simulation / Dry-Run

Click Simulate in the WorkflowEditor toolbar (requires workflows.execute permission). Select target hosts in the dialog and click Run Simulation.
The engine walks all steps, predicting outcomes without any side effects.
Script/command steps record what WOULD be dispatched (host IDs, job counts) – no actual Jobs created.
Approval steps record would_pause_for_approval – no approval request created, no pause.
Conditional steps evaluate expressions against current variables and route accordingly.
Delay steps record would_delay – no pause, simulation advances immediately.
PAM steps record what WOULD happen – no sessions opened, no credentials decrypted.
Notification steps record would_notify – no notification sent.

Use simulation mode to validate new workflows before production deployment.

Retrying Failed Workflows

Only instances in failed status can be retried.
The failed step execution is reset and the engine reprocesses from that step.
If no failed step is found, the engine restarts from the beginning.

Canary Rollout

Set canary_percentage on the workflow definition (e.g., 10 for 10%).
Set canary_success_threshold (default 80%).
On execution, the engine dispatches to canary hosts first.
If success rate meets threshold, remaining hosts are dispatched.
If below threshold, the step fails and routes to on_failure.

Automation Kill Switch

Pausing Automation

When you need to immediately stop all automated remediation and patch deployment for an organization:

Effect is immediate on the next scheduler cycle.
Affects: auto-remediation, auto-deploy. Does NOT affect: manual job creation, manual workflow execution.
When the kill switch is activated, any currently running workflow instances will be paused (with reason ‘Kill switch active’) at the next step boundary. They are not terminated — resume them after clearing the kill switch.
Permission required: settings.manage.

Resuming Automation

All automated processes resume on the next scheduler cycle.
Permission required: settings.manage.

Checking Status

In the Workflows page, if initial data loads fail (definitions, instances, approvals, org list, or automation status), an error banner is shown with a Retry button so operators can recover without refreshing the browser.

Auto-Retry on Transient Failures (A-01)

Steps can automatically retry on transient errors (timeouts, network issues, 503/504 responses) without operator intervention. Configure per-step in the step config:

retry_count: Maximum number of automatic retry attempts (default 0 — disabled).
retry_delay_seconds: Seconds to wait between retries (default 30).
Retries only fire for transient errors. Configuration errors, permission errors, and “not found” errors are never retried.
The instance shows pause_reason = "retry_delay" during the wait between retries.
After exhausting all retries, the step routes to on_failure as normal.

a. Workflow-Level Timeout (A-03)

Set default_timeout_seconds on a workflow definition to cap total execution time. If the workflow runs longer than this threshold, it is terminated with status timeout (terminal). All pending step executions and PAM sessions are cleaned up.

Instances with no default_timeout_seconds (or null) run without a time limit.

Permission Reference

Action	Required Permission
Create workflow definition	`workflows.create`
Import workflow definition	`workflows.create`
View definitions, instances, approvals	`workflows.view`
Export workflow definition	`workflows.view`
Edit/publish definitions	`workflows.edit`
Delete definitions	`workflows.delete`
Start, cancel, pause, resume, retry instances	`workflows.execute`
Respond to approvals	`workflows.execute`
Run simulation	`workflows.execute`
View confidence decisions	`automation.view`
Approve, reject, replay, or leave feedback on confidence decisions	`automation.manage`
Pause/resume automation kill switch	`settings.manage`
Check automation status	`settings.view`

UI Features (WP-6, updated 2026-03-18)

Publishing a Workflow

Open a saved workflow in the editor.
Click Publish in the toolbar (requires workflows.edit permission). The button is hidden on new (unsaved) workflows and on already-published workflows.
Confirm in the modal — this makes the definition immutable.
Published workflows show a “Published” badge in the editor header. Save and all step-editing controls are disabled.

Version History

Click Versions in the toolbar of a saved workflow (requires workflows.view permission).
A modal shows all version snapshots with version number, timestamp, and step count.
Versions are created automatically each time the workflow is updated.

Cloning a Workflow

From the workflow definitions table, click the Clone action for any definition.
A new definition is created with the same steps and variables, name appended with ” (Copy)”, and is_published = false.
Clones are created as local drafts in the caller’s org, with is_template = false.

Simulation / Dry-Run

Click Simulate in the toolbar of a saved workflow (requires workflows.execute permission).
Select target hosts and click Run Simulation.
The simulation report shows predicted actions per step without executing anything.

Running a Workflow

From the workflows list, click the Play button, or from the editor click Run.
Navigate to the Run Workflow page (the relevant workflow).
Select target hosts using the searchable multi-select (server-side search by hostname or IP) and/or host groups.
Optionally override workflow variables.
Optionally enable Schedule for later and pick a date/time for deferred start.
Click Run Workflow (or Schedule Workflow if deferred).

Instance Detail / Step Timeline

From the Instances tab: - Click the workflow name or any instance row to open the detail panel for that execution. - Use the explicit Open workflow definition action when you want the workflow editor/definition instead of the execution record. - Failed rows surface an inline error preview directly in the table so operators can triage without leaving the list.

The detail panel shows: - Overall status, start time, elapsed time - Workflow-level error_message when the instance failed or timed out - Step-by-step execution timeline with per-step status, duration, and links to associated jobs - Variables (sensitive values redacted)

Kill Switch (Feature Flags page)

Navigate to Support Portal > Management > Feature Flags (the relevant workflow) or RMM Portal > Settings > Feature Flags (the relevant workflow)
Each organization shows its automation pause status
Click Pause to halt all automation (reason required)
Click Resume to re-enable automation

a. Exporting and Importing Workflows

Exporting a workflow

Navigate to Automation > Workflows (Definitions tab).
Click the Export (download) icon next to the workflow you want to export.

Export includes: name, description, steps, variables, timeout, canary settings, and trigger type. It does NOT include execution history.

Importing a workflow

Navigate to Automation > Workflows (Definitions tab).
Click the Import button in the page header.
Optionally enter a name override to rename the workflow on import.
Click Import.

The imported workflow is always created as a draft (not published). Review and publish it after verifying the steps are correct.

b. Trigger Types

Every workflow definition has a trigger type that describes how it is intended to be invoked:

Trigger Type	Description
`manual`	Invoked manually by an operator (default)
`alert_rule`	Linked to an alert rule for automated response
`monitor_action`	Triggered by a monitoring action
`service_request`	Fulfills an ITSM service request
`runbook`	Used as a runbook for documented procedures
`schedule`	Scheduled for periodic execution
`automation_decision`	Launched from a confidence-engine decision-backed remediation/workflow path

Set the trigger type in the workflow editor under Settings > Trigger Type. This field is informational — it does not change execution behavior.

c. Confidence Engine Workspace

Operators with automation.view can review the current backend confidence decision record, governance summary, breaker state, scoped policies, and pattern validation for the alert-backed automation slice. Operators with automation.manage can also take review/admin actions where the backend already supports them.

Opening the workspace

Navigate to Automation > Confidence Review (the relevant workflow).
The queue is only visible when your session has automation.view.
The route now has five tabs:
Decision Queue
Governance
Policies
Patterns
LLM Bridge
Use the queue filters for:
organization ID
gate result
review status
execution status
domain
host ID
alert ID
decision ID and domain
gate/review/execution status
effective, diagnosis, and action confidence
selected action summary
organization, review metadata, and timestamps

Inspecting decision detail

Click any queue row to open the right-side detail drawer.
The drawer shows:
decision confidence scores
reason codes
matched pattern name/key/pack provenance when the decision came from an authored pattern
selected action preview
candidate actions
policy snapshot
evidence snapshot
latest linked remediation execution state
feedback history
replay lineage
If the decision is pattern-backed, use the matched-pattern card in the selected-action preview to confirm the winning pattern key before approving or replaying the action.

Approving or rejecting pending-review decisions

Click Approve or Reject in the drawer. Each action opens a confirmation modal with an optional notes field.
Approval and rejection continue to use the existing remediation approval substrate:
if a linked remediation execution is already waiting for approval, the decision action updates that same execution
if no execution exists yet, approval creates the execution through the existing remediation engine
Reject does not rewrite history into a replay. The rejected decision remains as a terminal historical record.
After either action, the UI reloads both the queue row and the selected decision detail from the backend instead of trusting local optimistic state.

Replaying and leaving feedback

Replay is only shown for terminal decisions. The replay confirmation modal requires a reason.
Leave feedback opens a structured form with:
feedback type
optional notes
Review, reject, replay, and feedback actions are audited with actor identity, target decision, notes or reason, and resulting outcome metadata.

Governance, breakers, and policy/pattern maintenance

The Governance tab shows the current decision-quality summary, domain trends, recommendation cards, open breakers, and the current LLM status message from backend truth.
Use the breaker reset controls only after reviewing why the breaker opened. Reset persists an audit-linked reset record and reopens the corresponding live gating window.
The Policies tab lists account-wide and organization-scoped policies. Use it to tune thresholds, hourly caps, risk ceilings, and change-window guards without bypassing the existing remediation substrate.
The Patterns tab lists system/account patterns together with validation warnings or errors returned by the backend. Validation reflects current domain catalog, authored anchor-pack expectations, and candidate runbook visibility.
The LLM Bridge tab shows current bridge status, scoped provider connections, budgets, and recent usage. Manage-capable operators can create or update connections, run a live connection test, and create or update warning or hard-stop budgets from the same route.
Current truth limits still matter:
the bridge currently backs bounded causal investigation analysis plus connection, budget, and usage visibility
generated-action support is intentionally bounded to blueprint-backed generated workflow actions that validate, materialize, and promote through the existing remediation substrate

Managing LLM bridge connections and budgets

Open the LLM Bridge tab.
Review the bridge status banner first. If no active connection exists, the live alert path will stay heuristic-only.
Use Add connection to create a provider connection:
choose a clear display name
enter the provider endpoint when the adapter requires one or when you need an endpoint override
enter the default model or deployment name that the selected provider should use by default
set an allowlist of approved models
azure_openai requires api_version
enter the provider secret, which is stored through PAM-backed credential storage instead of frontend-local or .env state
Click Test on a saved connection before treating it as usable. The saved row records the latest test status and any returned error message.
Create at least one budget if you want warnings or hard stops:
choose daily or monthly period
optionally scope by organization, connection, feature, or workflow
set any hard caps you want enforced for prompt, completion, total tokens, or cost
Use the recent usage table and summary cards to verify:
whether calls succeeded, failed, timed out, were schema-rejected, or were budget-blocked
total tokens and estimated/provider cost
which active budgets are approaching or exceeding their thresholds

Runtime-Proof Checklist

Trigger or seed an alert-backed remediation candidate so backend/core/alert_engine.py enters backend/core/confidence_engine/runtime.py.
Confirm a new AutomationDecision row exists before any linked RemediationExecution is created.
Open the relevant workflow and verify the new decision appears in the Decision Queue with visible diagnosis, action, and effective confidence.
Open the drawer and confirm reason codes, matched pattern provenance when present, selected action preview, policy snapshot, evidence snapshot, feedback history, and replay lineage are readable from persisted backend state.
For a pending_review decision, approve or reject it from the drawer and confirm the linked remediation execution reuses the normal remediation approval/cancel path instead of a second execution path.
Use Governance to confirm the decision contributes to decision-based summary metrics and any recommendation/open-breaker state.
If an org or host breaker is open, reset it from the UI and confirm the breaker closes and the live remediation safety shell re-evaluates from the reset timestamp.
Create or update a policy in Policies and a pattern in Patterns; confirm the UI refreshes from backend truth and validation feedback is surfaced without manual page reload.
If an authored pattern exists for the alert domain, confirm the selected action preview names that pattern and that the chosen runbook remains remediation-backed rather than launching through a second executor.
In LLM Bridge, create or reuse an active connection, run Test, and create a budget. Confirm the tab shows the saved connection state, budget state, and usage summary from backend truth.
Trigger a low-confidence metric alert in a scope with an active LLM connection. Confirm the resulting investigation uses analysis_method = llm, the decision detail evidence snapshot includes llm_bridge metadata, and the LLM Bridge usage table shows a metered event.
Confirm policy gating still applies to that LLM-assisted path:
with no allow_llm_assist opt-in, the decision is capped to suggestion-only
with allow_llm_assist enabled and a runbook marked require_human_review_when_llm_used, the decision is capped to pending review instead of auto-executing
Trigger or seed a blueprint-backed generated workflow candidate in one of the shipped anchor domains and confirm the selected action is validated before execution:
the initial decision detail should show generated-action provenance in selected_action_preview
invalid generated payloads should discard cleanly with generated_action_validation_failed
Let the validated generated workflow execute successfully and confirm promotion:
the materialized org-scoped runbook becomes active
a later matching evaluation resolves the promoted runbook instead of materializing the blueprint again

Host Context Variables

When a workflow starts, the engine automatically populates per-host context variables from the Host model. These are available for variable substitution in script and command steps without the operator needing to manually define them. Access via {{hosts.hostname}}, {{hosts.primary_ip}}, etc.

Stale Job Handling

The background workflow processor automatically detects and cleans up stale jobs:

Running jobs: If an agent-executed job has been running for >2x its configured timeout, it is marked as timeout. The agent is assumed unreachable.
Queued jobs: If a job has been in queued status for >30 minutes past its scheduled time, it is marked as failed. The agent likely never received the job.

Both cases fire workflow callbacks so the parent workflow can advance (typically to its on_failure path).

Cross-References

Related Domain	Manual
Jobs & Scripts	jobs-scripts.md
ITSM	itsm.md