AAP Automation Orchestrator: Building a Human Review Approval Gate

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

How AAP Automation Orchestrator's Human Review gate stops AI remediation before production, with timeout, escalation, and audit trail design.

Red Hat's upcoming Automation Orchestrator (coming Q3 2026) makes one design principle non-negotiable: AI isn't improvising against production infrastructure — it's acting through AAP. That principle was demonstrated live at Red Hat Tech Day Netherlands 2026 in Bunnik, and nowhere is it more visible than in Step 4 of the platform's five-step pipeline: the Human Review governance gate. This is the checkpoint where an AI-generated remediation plan stops being a suggestion and either becomes an authorized production action or gets rejected outright — no silent auto-approval, no unattended blast radius.

Where Human Review Sits in the Pipeline

Automation Orchestrator is built on the upstream Temporal durable-execution engine, and it unifies task-based and event-based automation on a single governed canvas. The full pipeline runs in five stages:

Alerts from multiple sources — agents, events, and playbooks all land on the same canvas
Events trigger a deterministic rulebook — Event-Driven Ansible (EDA) picks up the alert
AI analyzes and recommends — an LLM plus MCP tools investigate and propose remediation
Humans approve — a governance gate before anything touches production
Automated remediation at scale — deterministic, auditable execution via AAP

Step 4 is the hinge point of the whole architecture. Everything before it — webhook ingestion, correlation, AI reasoning — is fast and automated. Everything after it is fast and automated too. Step 4 is the one deliberately human, deliberately slow step in the chain, and that's by design.

What the Human Review Gate Actually Configures

In the live demo, the presenters walked through remediating CVE-2024-6387 ("regresshion"), a critical race condition in OpenSSH's sshd. An AI agent running on Red Hat AI/Nomotron 120b, equipped with MCP tools for Splunk Query, Splunk Alert Search, Splunk Saved Search, and ServiceNow CMDB Lookup, queried AAP inventory, correlated affected hosts to the right host group, matched an existing remediation job template, and assembled a plan — including a rollback strategy — for approval.

That plan doesn't execute itself. It lands in a Human Review node with four configurable elements:

Setting	Purpose	Demo value
Usernames to notify	Who is authorized to approve	Named on-call operators
Custom message	Context shown to the approver	"Please approve this deployment to production"
Timeout	How long the gate waits for a decision	1 day (default)
On-timeout action	What happens if nobody responds	Fail the workflow

That last row is the one worth dwelling on. Red Hat explicitly chose to make the default on-timeout behavior a hard failure, not a fallback approval. If a workflow times out because nobody reviewed it, the safe outcome is that nothing happens to production — the incident stays open, the ticket stays unresolved, and a human has to look at it. An automation platform that quietly approves itself after a timeout isn't a governance gate; it's a governance gate with a snooze button that eventually disables itself. Automation Orchestrator refuses to build that failure mode in.

Modeling the Gate in an EDA-Driven Workflow

Even though Automation Orchestrator's canvas is graphical, the underlying execution is still AAP job templates and EDA rulebooks doing the actual work. A simplified rulebook fragment showing how an EDA rule hands off to a workflow with an approval node might look like this:

---
- name: CVE remediation triggered by Instana/ServiceNow webhook
  hosts: all
  sources:
    - ansible.eda.webhook:
        host: 0.0.0.0
        port: 5001

  rules:
    - name: Critical sshd CVE detected
      condition: event.payload.cve_id == "CVE-2024-6387"
      action:
        run_workflow_template:
          name: "regresshion-remediation-workflow"
          organization: "Platform Ops"
          job_args:
            extra_vars:
              affected_cve: "{{ event.payload.cve_id }}"
              source_system: "{{ event.payload.source }}"
              ticket_number: "{{ event.payload.incident_id }}"

The workflow template referenced above is where the Human Review node lives, positioned between the AI recommendation node and the remediation job template. A representative task inside the remediation job template — the piece that only fires once approval is granted — stays a completely ordinary Ansible play:

---
- name: Patch sshd for CVE-2024-6387 in rolling batches
  hosts: "{{ target_host_group }}"
  serial: 4
  become: true
  tasks:
    - name: Ensure approval reference is recorded for audit
      ansible.builtin.debug:
        msg: "Approved by {{ approval_username }} at {{ approval_timestamp }} for ticket {{ ticket_number }}"

    - name: Update openssh-server to patched version
      ansible.builtin.package:
        name: openssh-server
        state: latest

    - name: Restart sshd service
      ansible.builtin.systemd:
        name: sshd
        state: restarted

    - name: Run post-patch health check
      ansible.builtin.uri:
        url: "http://{{ inventory_hostname }}:22"
        method: GET
        status_code: [200, 400]
      register: health_check
      retries: 3
      delay: 5

Note the serial: 4 directive — it mirrors the demo's real execution pattern: 12 hosts across prod, staging, and dev, patched in three batches of four, with zero downtime and every health check passing before the next batch proceeded.

Why the Timing Numbers Matter

The demo's execution timeline makes the case for a human gate better than any policy document could. Alert ingestion took 0 seconds, ITSM ticket creation 1.2 seconds, vulnerability analysis 4.8 seconds, remediation execution 0.9 seconds, and ticket close 2.1 seconds. That's under 10 seconds of total automated processing time. The human review step took 38.4 seconds — by far the longest phase of the entire run, and the only one that wasn't automated.

That asymmetry is the point. Automation Orchestrator can investigate, correlate, and prepare a fix in single-digit seconds; it deliberately will not act on production infrastructure in that same window. The 38.4-second pause is the cost of keeping a human accountable for the final decision, and it's a cost the architecture treats as a feature rather than a bottleneck.

Key Takeaways

Step 4, Human Review, is a mandatory governance gate between AI-generated recommendations and AAP-executed remediation — AI proposes, humans dispose.
Four settings define the gate: notified usernames, a custom approval message, a timeout (1 day by default), and an on-timeout action that defaults to failing the workflow, not auto-approving it.
In the CVE-2024-6387 demo, human review took 38.4 seconds against under 10 seconds of combined automated processing — proof the gate is where accountability, not speed, is optimized.
The underlying execution is still ordinary AAP job templates and EDA rulebooks, so approval nodes can sit inside workflow templates you already understand.
Result of the demo run: 12 hosts patched in 3 rolling batches of 4, zero downtime, all health checks passed, and ServiceNow ticket INC0038291 closed — with a human decision as the only non-deterministic step in the chain.

Category: troubleshooting

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton