ServiceNow, OPA, and Ansible: Policy-Gated Agentic Remediation Explained

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

How OPA policy gates and ServiceNow ITSM keep agentic CVE remediation with Ansible Automation Platform safe, auditable, and enterprise-ready.

Autonomous remediation sounds risky until you look at what actually gates it. At Red Hat Tech Day Netherlands 2026 in Bunnik, Fred van Zwieten and Ismail Dhaoui demonstrated exactly that during their "Ansible Automation Platform 2.7 and Beyond" session: an AI agent detecting a critical CVE, opening a ServiceNow change record, clearing an Open Policy Agent (OPA) review, and patching four production servers — without a human touching a keyboard mid-flight. The headline isn't the AI. It's the governance stack that never lost authority.

This article breaks down the two components that made the demo enterprise-credible rather than a stunt: the OPA policy engine acting as a hard gate, and ServiceNow acting as the system of record. Together they answer the question every platform team asks about agentic AI: who's actually in control?

The scenario: a critical CVE, four servers, zero manual steps

The demo used an agent called OpenClaw, running as a pod inside OpenShift, orchestrating work through Ansible Automation Platform (AAP). The pipeline had five stages:

Detect and validate — the agent identifies CVE-2026-31337 (CVSS 9.8, critical) affecting four production servers and cross-checks it against the CVE database.
Open the change — a ServiceNow Change Request, CHG0012847, is created with critical priority and a 03:00–05:00 EST maintenance window. PagerDuty notifies app owners, and the four affected servers are attached to the change record.
Policy review — OPA evaluates the change against a set of hard conditions.
Execute — a rolling patch-and-reboot runs host by host.
Close and report — the ticket closes, stakeholders are notified, CMDB is updated, and a compliance report is filed.

The point of walking through all five stages is that stage 3 is not decorative. Nothing in stage 4 executes unless stage 3 passes.

Why OPA is the gate, not a suggestion

Open Policy Agent evaluates structured input against declarative Rego policy and returns an allow/deny decision. In this pipeline, before AAP is permitted to launch patch_and_reboot.yml against a single host, OPA checks:

The ITSM ticket (CHG0012847) exists and is validated.
The maintenance window is confirmed and current.
The playbook being invoked is pre-approved for this specific CVE class.
A rollback plan is present and attached to the change.

If any one of those conditions fails, the pipeline stops. There is no "proceed anyway" path, no agent override, and no way for a plausible-sounding justification from the AI to substitute for a passing policy decision. That's the core design principle worth internalizing: policy as code sits between intent and execution, and it is evaluated by a deterministic engine, not by the same model that proposed the action. Red Hat was explicit on this point during the session — the policy engine, the ITSM system, the CMDB, and the audit trail are the guardrails an enterprise already relies on today. The agent doesn't replace them or get to negotiate with them. It just moves faster inside the lane they define.

A simplified version of the kind of Rego policy backing that gate looks like this:

# policy-input.yml — example payload AAP would submit to OPA before execution
change_request: "CHG0012847"
cve_id: "CVE-2026-31337"
cvss_score: 9.8
maintenance_window:
  start: "2026-06-03T03:00:00-05:00"
  end: "2026-06-03T05:00:00-05:00"
playbook: "patch_and_reboot.yml"
playbook_preapproved_for_cve_class: true
rollback_plan_attached: true
itsm_ticket_validated: true

# opa-gate.yml — illustrative AAP workflow step calling the policy check
- name: Evaluate OPA policy before remediation
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Query OPA decision endpoint
      ansible.builtin.uri:
        url: "https://opa.internal.example.com/v1/data/aap/remediation/allow"
        method: POST
        body_format: json
        body: "{{ lookup('file', 'policy-input.yml') | from_yaml | to_json }}"
        return_content: true
      register: opa_decision

    - name: Fail the workflow if policy denies remediation
      ansible.builtin.fail:
        msg: "OPA policy check failed — remediation blocked pending manual review."
      when: not (opa_decision.json.result | default(false))

That second task is the entire point of the architecture: a failed policy check is a hard stop, surfaced as a workflow failure in AAP, not a warning the agent can reason its way past.

Why ServiceNow is more than a notification target

It's tempting to read the ServiceNow step as "send an alert." In the demo it did four structural jobs:

ServiceNow function	What it enforces
Change Request (CHG0012847)	Creates an auditable record before any host is touched
Attached servers	Scopes exactly which four production hosts are in play — no silent scope creep
Maintenance window (03:00–05:00 EST)	Gives OPA a concrete time boundary to validate against
PagerDuty notification to app owners	Ensures humans are aware before automated execution begins

By the time the pipeline reached closure, the same ticket that opened the change was the one that closed it — updated with the final change record, CHG9679226, documenting the kernel move from 5.14.0-427.el9 to 5.14.0-503.el9, with the previous kernel retained as a GRUB boot entry for rollback. The CMDB was updated and a compliance report was filed automatically. Nothing about "what happened" lives only in agent memory or chat logs — it lives in ServiceNow, where auditors already know how to look for it.

The execution stage: rolling, drained, reversible

Once OPA approved and the change record was live, AAP ran the rolling patch across prod-web-01, prod-web-02, prod-api-01, and finally prod-db-01 — one host at a time, with the database node drained before its reboot. Monitoring was silenced for the maintenance window and restored afterward, and health checks passed on every host before moving to the next. The result was a zero-downtime kernel upgrade, with a same-generation rollback path preserved via the retained GRUB entry.

What the agent actually did — and didn't do

This is the detail worth repeating to any skeptical stakeholder: the agent, named sre-sally and running on Claude Sonnet 4.6 via litellm, did not write the remediation. Its visible tool calls in the OpenClaw control panel — memory_search, runtime_search, runtime_exec — show it searching context, checking prior runs, and executing tools. The actual fix was a pre-approved playbook exposed through the AAP MCP server, selected by the agent from an existing, human-reviewed catalog. The agent's job was orchestration and speed: detect, file, wait on policy, execute the known-good playbook, report. Every decision point that mattered — is this ticket valid, is this the approved fix, is there a way back out — was answered by systems that predate the agent entirely.

Key Takeaways

OPA is a hard gate, not a lint check. Execution is blocked, not warned, when ITSM validation, maintenance window, playbook approval, or rollback plan is missing.
ServiceNow anchors the audit trail. The Change Request created before remediation is the same record closed after it, giving auditors a single source of truth.
The agent selects, it doesn't author. Remediation ran through a pre-approved playbook (patch_and_reboot.yml) via the AAP MCP server — the AI didn't improvise a fix.
Rollback is designed in, not bolted on. Retaining the prior kernel as a GRUB entry meant the "undo" path existed before the "do" path ran.
Speed doesn't require lowering guardrails. CVE-2026-31337 went from detection to closed, compliant change record without skipping ITSM, policy, or CMDB steps — automation made the existing process faster, not thinner.

Category: troubleshooting

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton