All posts

The Core

In the last post, I closed by saying that managed credentials really pay off when they are paired with automation. That is the other half of the picture. This is the post about the engine that makes the platform work.

A Core Tenet From Day One

The workflow engine was a core tenet of the platform before almost anything else got built. The idea was simple. If you have a flexible workflow and job delivery system, you can extend the agent as far as you need without recompiling it. The agent stays small. Production hosts get fewer updates. New capability is added by writing a workflow, not by rolling out a new agent build to ten thousand servers.

That decision shaped everything that came after. Patching is a workflow. Remediation is a workflow. Running a privileged maintenance task on a fleet of hosts is a workflow. The platform is not a pile of features that happen to share a database. The features are workflows that happen to share an engine.

What the Engine Actually Does

A workflow is a graph of steps. Each step has a type, a config, an on_success, and an on_failure. The engine walks the graph, dispatches work, waits for results, substitutes variables, handles failures, and decides where to go next. There is nothing magical about it. There is just a lot of plumbing that has to be right.

Some of the step types dispatch work down to the agent on a target host:

  • Script — execute a saved script on a host, a list of hosts, or a host group, with parameter substitution and an optional PAM credential to run as.
  • Command — same as script but for a single inline command, for the cases where you do not want to manage a script asset.

Others run inside the platform itself. They never touch an agent:

  • PAM Checkout — reach into the vault and bind a credential to a workflow variable so later steps can use it. Optionally require a human approval before the checkout completes.
  • PAM Checkin — release the credential and trigger automatic rotation on break-glass identities.
  • Approval — pause the workflow until a human says yes. Singular, one-of-many, two-of-many, majority, or all-must-approve. Configurable timeout with auto-reject.
  • Conditional — branch on an expression evaluated against workflow variables. Real on_true and on_false paths, not a hack.
  • Parallel — fan out a group of sub-steps to run concurrently and wait for all of them before moving on.
  • Delay — wait for a configurable duration before continuing. Useful for stabilization windows after a change.
  • Webhook — call an external HTTP endpoint with templated headers and body.
  • Notification — send a notification through the platform’s messaging stack.

Variables flow through the whole graph. A script’s output can feed the next script’s parameters. A PAM checkout populates a credential variable that a downstream script step references by name to run as that user. The engine also exposes host context, so any step can read {{hosts.<id>.hostname}} and other safe fields of the target host without you having to plumb them manually.

Scalable SOPs

Workflows are scalable SOPs. Need to run the same procedure across ten servers, or a hundred, or a host group that grows on its own? The engine handles the fan-out, tracks every job individually, and rolls the results back up to the parent step. If three hosts succeed and one fails, you know exactly which one and why.

For changes that are too risky to push everywhere at once, workflows support canary rollout. Dispatch to a configurable percentage of hosts first. If the success rate clears the threshold, the rest of the fleet goes. If it does not, the workflow stops and tells you why. Same engine, same definition, no separate “canary mode” tool to learn.

Before you ever run a workflow against production, you can run it in simulation mode. The engine walks every step and predicts what would happen. Which jobs would dispatch. Which conditions would evaluate which way. Which approvals would block. All without touching a single host. It is the dry run that vendors usually promise and rarely deliver.

Runbooks Are the Trigger Layer

Workflows describe what to do. Runbooks describe when to do it.

A runbook is the entry point that connects an event in the platform to a workflow. Disk space crosses a threshold. A heuristic rule fires. A monitoring alert reaches a specific severity. An investigation flags a probable cause. The runbook says “when this happens, run that.”

Simple cases stay simple. Low disk on a Linux host triggers a cleanup workflow. A stuck Windows service triggers a restart. The harder cases get harder workflows. Application cluster stops responding — restart the SQL database, drain the load balancer, recycle the IIS application pools, run a synthetic health check, page the on-call only if the synthetic still fails. Same engine, just more steps.

Runbooks come with the safety rails this kind of automation needs. Rate limits so a flapping alert cannot stampede the platform. Cooldown windows. Per-runbook approval requirements with auto-approval rules tied to investigation confidence or service health. Stabilization windows that watch a validation metric after the action runs. If the metric does not recover, the runbook escalates or rolls back automatically. Automated remediation is only useful if it cannot run away from you.

Three Ways to Build a Workflow

The workflow editor has three views and you can switch between them at any time.

The wizard is the step-by-step editor. It shows the steps in order, lets you pick a type, fills in the config form for whichever type you chose, and validates as you go. If you do not know what a step type does, this is where you start.

The visual view is a flow chart. You see the branches, the parallel groups, the success and failure paths, and you click a node to edit it. Useful for anything more complicated than a straight line, and very useful for showing a workflow to someone who is not the person who built it.

The JSON view is the raw definition. Read-only by design — the wizard and the visual editor are the source of truth — but it is there when you want to see exactly what is going to be saved, copy a definition out for source control, or paste one into a ticket.

Three views over the same definition. Pick the one that matches how you think.

Versioning, Audit, and the Things You Need at 2 AM

Every workflow is versioned. When you publish a version it becomes immutable. You can keep editing the definition, but the version that ran last night is the version that ran last night, and you can pull it up in the history tab and see exactly what executed. Running instances snapshot their definition at start time, so an edit mid-execution cannot mutate something already in flight.

Every step execution is recorded. Every approval, every PAM checkout, every job dispatched, every result. When you need to answer “what did this workflow do, on which hosts, with whose credentials, and who approved it,” the answer is one query away. Not “let me cross-reference four log streams.” One query.

The Community Marketplace

We are all running the same operations against the same operating systems. We are all writing the same disk-cleanup script and the same patch validation workflow. We are all reinventing the same wheel.

The marketplace is where that stops. Publish a workflow, a script, or a runbook. Tag it. Version it. Other Cadres customers can browse, install, and rate it. Trust levels — unverified, community, verified, official — make it clear what you are pulling in. Channels (stable, candidate, pinned) let you control whether you want the latest release or a specific version. Every install reads from an immutable snapshot of the artifact at publish time, so a publisher cannot alter a workflow under your feet after the fact.

Have a workflow that handles a gnarly edge case nobody else has solved? Publish it. Need to solve a problem you have never tackled? Look for someone who already has. We can keep doing the same work in isolation, or we can level each other up.

The workflow engine is where the human-by-exception model actually lives. Any time something in the platform needs more than a single job — multiple steps, multiple hosts, an approval, a credential checkout, a branch, a wait — it runs through the engine. That is the difference between a tool that automates a task and a platform that automates an operation.

If running ten servers worth of manual SOPs sounds like a familiar problem, start a trial.