Automation
Workflow automation, approvals, exception handling, and the guardrails needed to turn repeatable work into reliable execution.
Scope
Automation should remove repetitive work without creating opaque or dangerous operations. This guide keeps the operator-facing automation model while excluding private execution interfaces.
Domain: Workflow Engine & Automation Control Plane
See also: Jobs & Scripts Manual | ITSM Manual
Creating a Workflow Definition
- Navigate to Automation > Workflows.
- Click “New Workflow”.
- Fill in the name, description, and organization scope.
The Workflows landing page also provides shortcuts for operators who want a different starting point: “Open visual designer” opens the editor in visual mode, “Create template” starts a reusable draft, and the reusable-templates panel plus template library surface existing template definitions for cloning or review. The search box on this page is definitions-only; instances and approvals do not pretend to support the same search affordance.
Each step requires:
- id: Unique string identifier within the workflow.
- type: Step type (see below).
- config: Type-specific configuration.
- on_success: Next step ID on success. Omit or set null for a terminal successful completion.
Building Workflow Steps
Script step:
Command step:
Approval step:
Conditional step:
Delay step:
PAM checkout step:
PAM checkin step:
Parallel step:
Notification step:
Publishing a Workflow
- Test the workflow in draft mode by running instances.
- When satisfied, click Publish in the WorkflowEditor toolbar (requires
workflows.editpermission). A confirmation modal explains the lock. - You can still toggle
is_activeandis_published. - To make changes, clone the definition ( or use the Clone action in the definitions table) and edit the clone.
- In the editor Settings menu, set the default workflow timeout, optional canary rollout fields, and the template-library flag before saving the draft. Existing unpublished workflows can also be promoted into the template library by toggling that flag and saving.
If you open the editor from the Workflows page starter cards, the URL can carry ?mode=visual to land on the canvas or ?template=1 to preselect the template flag for a new draft. Template drafts are snapshot-based, so cloning captures the current definition as a new org-specific workflow and later edits do not automatically change older clones. The template library is fetched separately from the Workflows landing page definition list, so it stays accurate even when the main definitions table is filtered or paged.
Running Workflows
Manual start: 2. Groups are expanded to host IDs at start time. 3. The workflow definition is frozen (snapshot) at start time.
Deferred start (WP-5):
Scheduled start: Set up a recurring schedule that targets a workflow.
Triggered start: Workflows can be triggered by: - Runbook actions (via monitor rules) - Manual API calls
Monitoring Workflow Instances
- Current step and status
- Step results for each completed step
- Variables (with sensitive values redacted)
- Associated jobs and their statuses
- Admin/owner users see all approvals and can respond regardless of approver config.
- If
approversis empty, any user withworkflows.executepermission can respond (legacy behavior). - The approvals list is paginated with
skipandlimit, so large queues can be browsed without loading every record at once. The Workflows page pagination controls navigate real server-side pages.
Simulation / Dry-Run
- Click Simulate in the WorkflowEditor toolbar (requires
workflows.executepermission). Select target hosts in the dialog and click Run Simulation. - The engine walks all steps, predicting outcomes without any side effects.
- Script/command steps record what WOULD be dispatched (host IDs, job counts) – no actual Jobs created.
- Approval steps record
would_pause_for_approval– no approval request created, no pause. - Conditional steps evaluate expressions against current variables and route accordingly.
- Delay steps record
would_delay– no pause, simulation advances immediately. - PAM steps record what WOULD happen – no sessions opened, no credentials decrypted.
- Notification steps record
would_notify– no notification sent.
Use simulation mode to validate new workflows before production deployment.
Retrying Failed Workflows
- Only instances in
failedstatus can be retried. - The failed step execution is reset and the engine reprocesses from that step.
- If no failed step is found, the engine restarts from the beginning.
Canary Rollout
- Set
canary_percentageon the workflow definition (e.g., 10 for 10%). - Set
canary_success_threshold(default 80%). - On execution, the engine dispatches to canary hosts first.
- If success rate meets threshold, remaining hosts are dispatched.
- If below threshold, the step fails and routes to
on_failure.
Automation Kill Switch
Pausing Automation
When you need to immediately stop all automated remediation and patch deployment for an organization:
- Effect is immediate on the next scheduler cycle.
- Affects: auto-remediation, auto-deploy. Does NOT affect: manual job creation, manual workflow execution.
- When the kill switch is activated, any currently running workflow instances will be paused (with reason ‘Kill switch active’) at the next step boundary. They are not terminated — resume them after clearing the kill switch.
- Permission required:
settings.manage.
Resuming Automation
- All automated processes resume on the next scheduler cycle.
- Permission required:
settings.manage.
Checking Status
In the Workflows page, if initial data loads fail (definitions, instances, approvals, org list, or automation status), an error banner is shown with a Retry button so operators can recover without refreshing the browser.
Auto-Retry on Transient Failures (A-01)
Steps can automatically retry on transient errors (timeouts, network issues, 503/504 responses) without operator intervention. Configure per-step in the step config:
retry_count: Maximum number of automatic retry attempts (default 0 — disabled).retry_delay_seconds: Seconds to wait between retries (default 30).- Retries only fire for transient errors. Configuration errors, permission errors, and “not found” errors are never retried.
- The instance shows
pause_reason = "retry_delay"during the wait between retries. - After exhausting all retries, the step routes to
on_failureas normal.
a. Workflow-Level Timeout (A-03)
Set default_timeout_seconds on a workflow definition to cap total execution time. If the workflow runs longer than this threshold, it is terminated with status timeout (terminal). All pending step executions and PAM sessions are cleaned up.
Instances with no default_timeout_seconds (or null) run without a time limit.
Permission Reference
| Action | Required Permission |
|---|---|
| Create workflow definition | workflows.create |
| Import workflow definition | workflows.create |
| View definitions, instances, approvals | workflows.view |
| Export workflow definition | workflows.view |
| Edit/publish definitions | workflows.edit |
| Delete definitions | workflows.delete |
| Start, cancel, pause, resume, retry instances | workflows.execute |
| Respond to approvals | workflows.execute |
| Run simulation | workflows.execute |
| View confidence decisions | automation.view |
| Approve, reject, replay, or leave feedback on confidence decisions | automation.manage |
| Pause/resume automation kill switch | settings.manage |
| Check automation status | settings.view |
UI Features (WP-6, updated 2026-03-18)
Publishing a Workflow
- Open a saved workflow in the editor.
- Click Publish in the toolbar (requires
workflows.editpermission). The button is hidden on new (unsaved) workflows and on already-published workflows. - Confirm in the modal — this makes the definition immutable.
- Published workflows show a “Published” badge in the editor header. Save and all step-editing controls are disabled.
Version History
- Click Versions in the toolbar of a saved workflow (requires
workflows.viewpermission). - A modal shows all version snapshots with version number, timestamp, and step count.
- Versions are created automatically each time the workflow is updated.
Cloning a Workflow
- From the workflow definitions table, click the Clone action for any definition.
- A new definition is created with the same steps and variables, name appended with ” (Copy)”, and
is_published = false. - Clones are created as local drafts in the caller’s org, with
is_template = false.
Simulation / Dry-Run
- Click Simulate in the toolbar of a saved workflow (requires
workflows.executepermission). - Select target hosts and click Run Simulation.
- The simulation report shows predicted actions per step without executing anything.
Running a Workflow
- From the workflows list, click the Play button, or from the editor click Run.
- Navigate to the Run Workflow page (the relevant workflow).
- Select target hosts using the searchable multi-select (server-side search by hostname or IP) and/or host groups.
- Optionally override workflow variables.
- Optionally enable Schedule for later and pick a date/time for deferred start.
- Click Run Workflow (or Schedule Workflow if deferred).
Instance Detail / Step Timeline
From the Instances tab: - Click the workflow name or any instance row to open the detail panel for that execution. - Use the explicit Open workflow definition action when you want the workflow editor/definition instead of the execution record. - Failed rows surface an inline error preview directly in the table so operators can triage without leaving the list.
The detail panel shows:
- Overall status, start time, elapsed time
- Workflow-level error_message when the instance failed or timed out
- Step-by-step execution timeline with per-step status, duration, and links to associated jobs
- Variables (sensitive values redacted)
Kill Switch (Feature Flags page)
- Navigate to Support Portal > Management > Feature Flags (the relevant workflow) or RMM Portal > Settings > Feature Flags (the relevant workflow)
- Each organization shows its automation pause status
- Click Pause to halt all automation (reason required)
- Click Resume to re-enable automation
a. Exporting and Importing Workflows
Exporting a workflow
- Navigate to Automation > Workflows (Definitions tab).
- Click the Export (download) icon next to the workflow you want to export.
Export includes: name, description, steps, variables, timeout, canary settings, and trigger type. It does NOT include execution history.
Importing a workflow
- Navigate to Automation > Workflows (Definitions tab).
- Click the Import button in the page header.
- Optionally enter a name override to rename the workflow on import.
- Click Import.
The imported workflow is always created as a draft (not published). Review and publish it after verifying the steps are correct.
b. Trigger Types
Every workflow definition has a trigger type that describes how it is intended to be invoked:
| Trigger Type | Description |
|---|---|
manual |
Invoked manually by an operator (default) |
alert_rule |
Linked to an alert rule for automated response |
monitor_action |
Triggered by a monitoring action |
service_request |
Fulfills an ITSM service request |
runbook |
Used as a runbook for documented procedures |
schedule |
Scheduled for periodic execution |
automation_decision |
Launched from a confidence-engine decision-backed remediation/workflow path |
Set the trigger type in the workflow editor under Settings > Trigger Type. This field is informational — it does not change execution behavior.
c. Confidence Engine Workspace
Operators with automation.view can review the current backend confidence decision record, governance summary, breaker state, scoped policies, and pattern validation for the alert-backed automation slice. Operators with automation.manage can also take review/admin actions where the backend already supports them.
Opening the workspace
- Navigate to Automation > Confidence Review (the relevant workflow).
- The queue is only visible when your session has
automation.view. - The route now has five tabs:
- Decision Queue
- Governance
- Policies
- Patterns
- LLM Bridge
- Use the queue filters for:
- organization ID
- gate result
- review status
- execution status
- domain
- host ID
- alert ID
- decision ID and domain
- gate/review/execution status
- effective, diagnosis, and action confidence
- selected action summary
- organization, review metadata, and timestamps
Inspecting decision detail
- Click any queue row to open the right-side detail drawer.
- The drawer shows:
- decision confidence scores
- reason codes
- matched pattern name/key/pack provenance when the decision came from an authored pattern
- selected action preview
- candidate actions
- policy snapshot
- evidence snapshot
- latest linked remediation execution state
- feedback history
- replay lineage
- If the decision is pattern-backed, use the matched-pattern card in the selected-action preview to confirm the winning pattern key before approving or replaying the action.
Approving or rejecting pending-review decisions
- Click Approve or Reject in the drawer. Each action opens a confirmation modal with an optional notes field.
- Approval and rejection continue to use the existing remediation approval substrate:
- if a linked remediation execution is already waiting for approval, the decision action updates that same execution
- if no execution exists yet, approval creates the execution through the existing remediation engine
- Reject does not rewrite history into a replay. The rejected decision remains as a terminal historical record.
- After either action, the UI reloads both the queue row and the selected decision detail from the backend instead of trusting local optimistic state.
Replaying and leaving feedback
- Replay is only shown for terminal decisions. The replay confirmation modal requires a reason.
- Leave feedback opens a structured form with:
- feedback type
- optional notes
- Review, reject, replay, and feedback actions are audited with actor identity, target decision, notes or reason, and resulting outcome metadata.
Governance, breakers, and policy/pattern maintenance
- The Governance tab shows the current decision-quality summary, domain trends, recommendation cards, open breakers, and the current LLM status message from backend truth.
- Use the breaker reset controls only after reviewing why the breaker opened. Reset persists an audit-linked reset record and reopens the corresponding live gating window.
- The Policies tab lists account-wide and organization-scoped policies. Use it to tune thresholds, hourly caps, risk ceilings, and change-window guards without bypassing the existing remediation substrate.
- The Patterns tab lists system/account patterns together with validation warnings or errors returned by the backend. Validation reflects current domain catalog, authored anchor-pack expectations, and candidate runbook visibility.
- The LLM Bridge tab shows current bridge status, scoped provider connections, budgets, and recent usage. Manage-capable operators can create or update connections, run a live connection test, and create or update warning or hard-stop budgets from the same route.
- Current truth limits still matter:
- the bridge currently backs bounded causal investigation analysis plus connection, budget, and usage visibility
- generated-action support is intentionally bounded to blueprint-backed generated workflow actions that validate, materialize, and promote through the existing remediation substrate
Managing LLM bridge connections and budgets
- Open the LLM Bridge tab.
- Review the bridge status banner first. If no active connection exists, the live alert path will stay heuristic-only.
- Use Add connection to create a provider connection:
- choose a clear display name
- enter the provider endpoint when the adapter requires one or when you need an endpoint override
- enter the default model or deployment name that the selected provider should use by default
- set an allowlist of approved models
azure_openairequiresapi_version- enter the provider secret, which is stored through PAM-backed credential storage instead of frontend-local or
.envstate - Click Test on a saved connection before treating it as usable. The saved row records the latest test status and any returned error message.
- Create at least one budget if you want warnings or hard stops:
- choose daily or monthly period
- optionally scope by organization, connection, feature, or workflow
- set any hard caps you want enforced for prompt, completion, total tokens, or cost
- Use the recent usage table and summary cards to verify:
- whether calls succeeded, failed, timed out, were schema-rejected, or were budget-blocked
- total tokens and estimated/provider cost
- which active budgets are approaching or exceeding their thresholds
Runtime-Proof Checklist
- Trigger or seed an alert-backed remediation candidate so
backend/core/alert_engine.pyentersbackend/core/confidence_engine/runtime.py. - Confirm a new
AutomationDecisionrow exists before any linkedRemediationExecutionis created. - Open the relevant workflow and verify the new decision appears in the Decision Queue with visible diagnosis, action, and effective confidence.
- Open the drawer and confirm reason codes, matched pattern provenance when present, selected action preview, policy snapshot, evidence snapshot, feedback history, and replay lineage are readable from persisted backend state.
- For a
pending_reviewdecision, approve or reject it from the drawer and confirm the linked remediation execution reuses the normal remediation approval/cancel path instead of a second execution path. - Use Governance to confirm the decision contributes to decision-based summary metrics and any recommendation/open-breaker state.
- If an org or host breaker is open, reset it from the UI and confirm the breaker closes and the live remediation safety shell re-evaluates from the reset timestamp.
- Create or update a policy in Policies and a pattern in Patterns; confirm the UI refreshes from backend truth and validation feedback is surfaced without manual page reload.
- If an authored pattern exists for the alert domain, confirm the selected action preview names that pattern and that the chosen runbook remains remediation-backed rather than launching through a second executor.
- In LLM Bridge, create or reuse an active connection, run Test, and create a budget. Confirm the tab shows the saved connection state, budget state, and usage summary from backend truth.
- Trigger a low-confidence metric alert in a scope with an active LLM connection. Confirm the resulting investigation uses
analysis_method = llm, the decision detail evidence snapshot includesllm_bridgemetadata, and the LLM Bridge usage table shows a metered event. - Confirm policy gating still applies to that LLM-assisted path:
- with no
allow_llm_assistopt-in, the decision is capped to suggestion-only - with
allow_llm_assistenabled and a runbook markedrequire_human_review_when_llm_used, the decision is capped to pending review instead of auto-executing - Trigger or seed a blueprint-backed generated workflow candidate in one of the shipped anchor domains and confirm the selected action is validated before execution:
- the initial decision detail should show generated-action provenance in
selected_action_preview - invalid generated payloads should discard cleanly with
generated_action_validation_failed - Let the validated generated workflow execute successfully and confirm promotion:
- the materialized org-scoped runbook becomes active
- a later matching evaluation resolves the promoted runbook instead of materializing the blueprint again
Host Context Variables
When a workflow starts, the engine automatically populates per-host context variables from the Host model. These are available for variable substitution in script and command steps without the operator needing to manually define them. Access via {{hosts.hostname}}, {{hosts.primary_ip}}, etc.
Stale Job Handling
The background workflow processor automatically detects and cleans up stale jobs:
- Running jobs: If an agent-executed job has been running for >2x its configured timeout, it is marked as
timeout. The agent is assumed unreachable. - Queued jobs: If a job has been in
queuedstatus for >30 minutes past its scheduled time, it is marked asfailed. The agent likely never received the job.
Both cases fire workflow callbacks so the parent workflow can advance (typically to its on_failure path).
Cross-References
| Related Domain | Manual |
|---|---|
| Jobs & Scripts | jobs-scripts.md |
| ITSM | itsm.md |