AAP Automation Orchestrator: Automated Remediation Execution Explained

By Luca Berton · Published 2024-01-01 · Category: installation

How AAP Automation Orchestrator executes Step 5, deterministic remediation at scale, with rolling batches, health checks, and full auditability.

Why Step 5 Is the Point of the Whole Pipeline

Everything in Red Hat's upcoming Automation Orchestrator — alert ingestion, Event-Driven Ansible rulebooks, an LLM proposing a fix — exists to set up one moment: the machine actually doing the work. That moment is Step 5, automated remediation execution via AAP, and it is the step where the platform's core design principle gets tested in production: AI isn't improvising against production infrastructure, it's acting through AAP.

This distinction matters more than it sounds. An AI agent can reason about a vulnerability, correlate it against your CMDB, and draft a remediation plan, but none of that reasoning ever touches a live host directly. Everything downstream of the human approval gate runs as a standard, deterministic Ansible Automation Platform job — the same execution engine, the same credentials model, the same audit trail you already trust for every other job template in your organization. Automation Orchestrator, announced at Red Hat Tech Day Netherlands 2026 in Bunnik and built on the upstream Temporal durable-execution engine, is arriving in Q3 2026 specifically to make that hand-off — from AI recommendation to governed execution — a first-class, single-canvas experience instead of a pile of glue scripts between a chatbot and Ansible.

The Five Steps, Briefly

Automated remediation execution doesn't exist in isolation. It's the payoff of a five-stage pipeline:

Alerts from multiple sources — agents, events, and playbooks orchestrated on a single canvas.
Events trigger a deterministic automation rulebook — Event-Driven Ansible picks up the alert.
AI analyzes and recommends — an LLM plus MCP tools investigate and propose remediation.
Humans approve — a governance gate before anything touches production.
Automated remediation at scale — deterministic, auditable execution via AAP.

Step 5 is where the plan approved in Step 4 becomes a real AAP job launch: inventory-scoped, credential-scoped, and logged like any other automation run.

Anatomy of Step 5: What Actually Executes

In the Tech Day demo, Step 5 remediated CVE-2024-6387 — "regresshion," the critical OpenSSH race condition in sshd. By the time execution began, the earlier steps had already done the investigative work: an IBM Instana webhook and a ServiceNow webhook had both posted to the EDA webhook endpoint (each with an auto-generated API key), and a Red Hat AI/Nomotron 120b agent — equipped with Splunk Query, Splunk Alert Search, Splunk Saved Search, and ServiceNow CMDB Lookup as MCP tools — had queried the AAP inventory, correlated the affected hosts to the correct host group, matched an existing remediation job template, and submitted a plan for approval that included a rollback strategy.

Step 5 simply launches that matched job template against that host group, exactly as approved — nothing improvised, nothing re-negotiated at execution time. In the demo this meant:

12 hosts patched across prod, staging, and dev
A rolling update in 3 batches of 4 hosts, avoiding a big-bang deployment
Zero downtime, with health checks passed at every batch
The originating ServiceNow ticket INC0038291 resolved and closed automatically once the job completed

A representative job template task for this kind of rolling OpenSSH remediation might look like this:

---
- name: Remediate CVE-2024-6387 (regresshion) on affected sshd hosts
  hosts: "{{ remediation_target_group }}"
  become: true
  serial: 4
  max_fail_percentage: 0

  vars:
    servicenow_ticket: "{{ incident_ticket_id }}"

  tasks:
    - name: Update openssh-server to patched version
      ansible.builtin.package:
        name: openssh-server
        state: latest

    - name: Restart sshd to apply the patched binary
      ansible.builtin.systemd:
        name: sshd
        state: restarted

    - name: Wait for SSH to come back healthy before next batch
      ansible.builtin.wait_for:
        port: 22
        host: "{{ inventory_hostname }}"
        delay: 5
        timeout: 60

    - name: Confirm patched version is installed
      ansible.builtin.command: rpm -q openssh-server
      register: openssh_version
      changed_when: false

    - name: Report batch health back to Automation Orchestrator
      ansible.builtin.debug:
        msg: "Host {{ inventory_hostname }} healthy on {{ openssh_version.stdout }}"

The serial: 4 directive is what produces the "3 batches of 4" rollout, and max_fail_percentage: 0 is what enforces zero-tolerance for failed hosts mid-rollout — if a batch fails health checks, the play stops rather than pushing the change further, which is exactly the kind of deterministic safety property that makes execution auditable after the fact.

Why the Approval Gate in Step 4 Shapes What Step 5 Is Allowed to Do

Step 5's determinism only means anything because Step 4 constrains it. The Human Review gate has:

Gate parameter	Behavior
Notified usernames	Configurable list of approvers
Approval message	Custom text, e.g. "Please approve this deployment to production"
Timeout	1 day default
On-timeout action	"Fail the workflow" — explicitly no silent auto-approval

That last row is the important one for anyone evaluating this for a regulated environment: if nobody approves in time, the workflow fails closed. Step 5 never fires on a technicality. Execution only ever runs against a plan a human explicitly signed off on, which is what lets Red Hat describe the whole pipeline as governed rather than merely "AI-assisted."

The Execution Timeline Tells the Real Story

The demo's measured timings make the point better than any slide could:

Phase	Duration
Alert ingestion	0s
ITSM ticket creation	1.2s
Vulnerability analysis	4.8s
Human review	38.4s (manual)
Remediation execution	0.9s
Ticket close	2.1s

Strip out the human review wait — which is inherently variable and, by design, not something the platform should rush — and the entire automated path, from alert to closed ticket, runs in under 10 seconds. Remediation execution itself, the part actually touching 12 hosts, took 0.9 seconds to kick off. The bottleneck in this architecture is deliberately the human, not the machine, and that's the correct place for the bottleneck to sit.

Key Takeaways

Step 5 never improvises: it launches a pre-matched, pre-approved AAP job template — the AI's role ends at Step 3's recommendation.
Rolling batches (serial) and fail-percentage thresholds are what turn "automated remediation" into "automated remediation at scale" without risking a fleet-wide outage.
The Step 4 approval gate fails closed on timeout by default, so Step 5 can never execute on an unattended, unapproved plan.
In the CVE-2024-6387 demo, 12 hosts across prod/staging/dev were patched with zero downtime, and the linked ServiceNow ticket INC0038291 closed automatically.
Automated execution time (0.9s) is negligible next to human review time (38.4s) — the platform optimizes the parts that should be fast and preserves human judgment where it matters.
Automation Orchestrator, built on Temporal and expected in Q3 2026, packages this whole flow — alerting, EDA, AI recommendation, approval, and AAP execution — into a single governed canvas rather than separate disconnected tools.

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton