AnsiblePilot — Master Ansible Automation

AnsiblePilot is the leading resource for learning Ansible automation, DevOps, and infrastructure as code. Browse over 1,400 tutorials covering Ansible modules, playbooks, roles, collections, and real-world examples. Whether you are a beginner or an experienced engineer, our step-by-step guides help you automate Linux, Windows, cloud, containers, and network infrastructure.

Popular Topics

About Luca Berton

Luca Berton is an Ansible automation expert, author of 8 Ansible books published by Apress and Leanpub including "Ansible for VMware by Examples" and "Ansible for Kubernetes by Example", and creator of the Ansible Pilot YouTube channel. He shares practical automation knowledge through tutorials, books, and video courses to help IT professionals and DevOps engineers master infrastructure automation.

AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

Monitor AAP 2.6 with Prometheus metrics, Grafana dashboards, and centralized logging. Track job performance, mesh health, capacity planning, and alerting.

Why Monitor AAP?

A healthy automation platform requires observability. Without monitoring, you discover problems only when jobs fail or users complain. AAP 2.6 exposes metrics and logs that let you: • Track job success rates — identify failing automation before users notice • Monitor capacity — know when to add execution nodes • Detect performance degradation — catch slow jobs and bottlenecks • Audit automation activity — who ran what, when, and on which hosts • Plan capacity — forecast growth based on usage trends

See also: Ansible Monitoring and Observability: Prometheus, Grafana, and ELK Stack Integration

AAP Metrics Endpoint

Automation Controller exposes a Prometheus-compatible metrics endpoint:

GET /api/controller/v2/metrics/

Enable Metrics

Metrics are enabled by default in AAP 2.6. Ensure the endpoint is accessible:

# Test metrics endpoint
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/metrics/" | head -20

Key Metrics

| Metric | Type | Description | |--------|------|-------------| | awx_running_jobs_total | Gauge | Currently running jobs | | awx_pending_jobs_total | Gauge | Jobs waiting in queue | | awx_status_total | Counter | Jobs by status (successful, failed, error, canceled) | | awx_instance_capacity | Gauge | Total capacity per instance | | awx_instance_consumed_capacity | Gauge | Used capacity per instance | | awx_instance_remaining_capacity | Gauge | Available capacity per instance | | awx_instance_info | Info | Instance metadata (hostname, type, version) | | awx_database_connections_total | Gauge | Database connection count | | awx_system_memory_total | Gauge | Total system memory | | awx_system_memory_used | Gauge | Used system memory | | awx_license_instance_total | Gauge | Licensed instance count | | awx_license_instance_free | Gauge | Available licensed instances | | awx_organizations_total | Gauge | Number of organizations | | awx_users_total | Gauge | Number of users | | awx_inventories_total | Gauge | Number of inventories | | awx_projects_total | Gauge | Number of projects |

Prometheus Configuration

Scrape Configuration

# prometheus.yml
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs: - job_name: 'aap-controller' scheme: https tls_config: insecure_skip_verify: true # Or provide CA cert bearer_token: '<controller-api-token>' static_configs: - targets: - 'gateway.example.org' metrics_path: '/api/controller/v2/metrics/' scrape_interval: 60s

- job_name: 'aap-nodes' scheme: https tls_config: insecure_skip_verify: true bearer_token: '<controller-api-token>' static_configs: - targets: - 'controller1.example.org' - 'controller2.example.org' metrics_path: '/api/controller/v2/metrics/'

Service Discovery for OpenShift

scrape_configs:
  - job_name: 'aap-openshift'
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names:
            - ansible-automation-platform
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: automation-controller
        action: keep

See also: Ansible Monitoring: Integrate with Prometheus, Grafana & Alerting (Complete Guide)

Grafana Dashboards

Dashboard: AAP Job Overview

{
  "panels": [
    {
      "title": "Running Jobs",
      "type": "stat",
      "targets": [
        {
          "expr": "awx_running_jobs_total",
          "legendFormat": "Running"
        }
      ]
    },
    {
      "title": "Pending Jobs",
      "type": "stat",
      "targets": [
        {
          "expr": "awx_pending_jobs_total",
          "legendFormat": "Pending"
        }
      ],
      "thresholds": {
        "steps": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 5},
          {"color": "red", "value": 20}
        ]
      }
    },
    {
      "title": "Job Success Rate (24h)",
      "type": "gauge",
      "targets": [
        {
          "expr": "sum(increase(awx_status_total{status='successful'}[24h])) / sum(increase(awx_status_total[24h])) * 100"
        }
      ],
      "thresholds": {
        "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 90},
          {"color": "green", "value": 95}
        ]
      }
    },
    {
      "title": "Jobs by Status (24h)",
      "type": "piechart",
      "targets": [
        {
          "expr": "increase(awx_status_total[24h])",
          "legendFormat": "{{status}}"
        }
      ]
    }
  ]
}

Dashboard: Capacity Planning

# PromQL queries for capacity dashboard

# Capacity utilization per node awx_instance_consumed_capacity / awx_instance_capacity * 100

# Total platform capacity sum(awx_instance_capacity{node_type="execution"})

# Capacity trend (predict when you'll run out) predict_linear(awx_instance_consumed_capacity[7d], 86400 * 30)

# Queue depth trend rate(awx_pending_jobs_total[1h])

Dashboard: Mesh Health

# Node availability
up{job="aap-nodes"}

# Instance errors awx_instance_info{errors!=""}

# Capacity per node awx_instance_remaining_capacity

# Database connections awx_database_connections_total

Alerting Rules

Prometheus Alert Rules

# aap-alerts.yml
groups:
  - name: aap-critical
    rules:
      - alert: AAPJobQueueBacklog
        expr: awx_pending_jobs_total > 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "AAP job queue backlog: {{ $value }} pending jobs"
          description: "More than 20 jobs pending for 10+ minutes. Consider adding execution nodes."

- alert: AAPHighFailureRate expr: | sum(rate(awx_status_total{status="failed"}[1h])) / sum(rate(awx_status_total[1h])) > 0.1 for: 30m labels: severity: critical annotations: summary: "AAP job failure rate above 10%" description: "{{ $value | humanizePercentage }} of jobs failing in the last hour."

- alert: AAPNodeDown expr: up{job="aap-nodes"} == 0 for: 5m labels: severity: critical annotations: summary: "AAP node {{ $labels.instance }} is down"

- alert: AAPCapacityLow expr: | sum(awx_instance_remaining_capacity{node_type="execution"}) / sum(awx_instance_capacity{node_type="execution"}) < 0.2 for: 15m labels: severity: warning annotations: summary: "AAP execution capacity below 20%" description: "Only {{ $value | humanizePercentage }} capacity remaining."

- alert: AAPDatabaseConnectionsHigh expr: awx_database_connections_total > 80 for: 10m labels: severity: warning annotations: summary: "AAP database connections high: {{ $value }}"

- alert: AAPNoJobsRunning expr: awx_running_jobs_total == 0 and awx_pending_jobs_total > 0 for: 15m labels: severity: critical annotations: summary: "AAP has pending jobs but nothing is running" description: "{{ $value }} pending jobs with 0 running. Check execution nodes."

See also: Integrate Automation Controller, Prometheus, and Grafana to IT Monitor Realtime

Centralized Logging

Controller Activity Stream

AAP logs all activity to the Activity Stream API:

# Recent activity
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/activity_stream/?order_by=-timestamp&page_size=10" | \
  jq '.results[] | {timestamp: .timestamp, operation: .operation, object1: .summary_fields.actor.username, object2: .object1}'

Log Forwarding to External Systems

Configure Controller to forward logs to a log aggregator:

# Via Automation Controller settings API
- name: Configure external logging
  ansible.builtin.uri:
    url: "https://gateway.example.org/api/controller/v2/settings/logging/"
    method: PATCH
    headers:
      Authorization: "Bearer {{ token }}"
      Content-Type: "application/json"
    body_format: json
    body:
      LOG_AGGREGATOR_HOST: "splunk.example.org"
      LOG_AGGREGATOR_PORT: 8088
      LOG_AGGREGATOR_TYPE: "splunk"
      LOG_AGGREGATOR_PROTOCOL: "https"
      LOG_AGGREGATOR_TOKEN: "{{ splunk_hec_token }}"
      LOG_AGGREGATOR_ENABLED: true
      LOG_AGGREGATOR_INDIVIDUAL_FACTS: false
    validate_certs: false

Supported Log Aggregators

| Aggregator | Type Value | Protocol | |-----------|-----------|----------| | Splunk | splunk | HTTPS (HEC) | | Elastic/Logstash | logstash | TCP/UDP | | Loggly | loggly | HTTPS | | Sumo Logic | sumologic | HTTPS | | Other | other | TCP/UDP syslog |

What Gets Logged

| Log Category | Contents | |-------------|----------| | Job events | Task start/finish, stdout, playbook output | | Activity stream | User actions, CRUD operations | | System tracking | Fact gathering, inventory changes | | Performance | Job duration, fork count, host count |

ELK Stack Configuration

# Logstash input for AAP
input {
  http {
    port => 5044
    codec => json
  }
}

filter { if [cluster_host_id] { mutate { add_field => { "source" => "aap-controller" } } } }

output { elasticsearch { hosts => ["https://elasticsearch.example.org:9200"] index => "aap-logs-%{+YYYY.MM.dd}" user => "elastic" password => "${ES_PASSWORD}" } }

API-Based Monitoring

Job Health Monitoring Script

#!/usr/bin/env python3
"""Monitor AAP job health via API"""
import requests
import json
from datetime import datetime, timedelta

GATEWAY_URL = "https://gateway.example.org" TOKEN = "your-api-token" HEADERS = {"Authorization": f"Bearer {TOKEN}"}

# Get jobs from the last hour one_hour_ago = (datetime.utcnow() - timedelta(hours=1)).strftime("%Y-%m-%dT%H:%M:%S") response = requests.get( f"{GATEWAY_URL}/api/controller/v2/jobs/", headers=HEADERS, params={"finished__gte": one_hour_ago, "page_size": 100}, verify=False )

jobs = response.json()["results"] status_counts = {} for job in jobs: status = job["status"] status_counts[status] = status_counts.get(status, 0) + 1

total = len(jobs) failed = status_counts.get("failed", 0) success_rate = ((total - failed) / total * 100) if total > 0 else 100

print(f"Jobs (last 1h): {total}") print(f"Status: {json.dumps(status_counts, indent=2)}") print(f"Success rate: {success_rate:.1f}%")

if success_rate < 90: print("WARNING: Success rate below 90%!")

Best Practices

Set up alerting before you need it — don't wait for the first outage • Monitor the queue — pending jobs > 0 for extended periods means capacity issues • Track job duration trends — gradually increasing duration indicates performance problems • Monitor database size — configure retention policies to prevent unbounded growth • Use Grafana annotations — mark deployments, upgrades, and incidents on dashboards • Test alerts regularly — ensure notification channels are working

FAQ

Does AAP support OpenTelemetry?

AAP 2.6 primarily exposes Prometheus metrics. OpenTelemetry integration is not built-in, but you can use the OpenTelemetry Collector as a bridge between AAP's Prometheus endpoint and OpenTelemetry-compatible backends.

Can I monitor individual playbook task performance?

Yes. Enable callback plugins or use the Activity Stream API to get per-task timing. The job events API (/api/controller/v2/jobs/{id}/job_events/) provides task-level detail including duration.

How much storage does log aggregation need?

Depends on job volume and verbosity. A deployment running 1,000 jobs/day at normal verbosity generates roughly 1-5 GB of logs daily. Higher verbosity (level 3+) increases this 5-10x.

Can I monitor EDA Controller separately?

EDA metrics are available through the Platform Gateway. EDA also logs event processing, rulebook activation status, and action execution to the standard log aggregation pipeline.

What retention should I set for job history?

Default is to keep all job history. For high-volume environments, configure cleanup jobs via Administration → Management Jobs → Cleanup Activity Stream and Cleanup Job Details to retain 90-180 days of history.

Conclusion

Monitoring and logging transform AAP from a black box into an observable, auditable platform. Start with Prometheus metrics and Grafana dashboards for real-time visibility, add alerting for proactive issue detection, and configure log aggregation for compliance and troubleshooting. The investment in observability pays back every time you catch an issue before your users do.

Related Articles

AAP 2.6 Architecture and Components: Complete GuideAAP 2.6 Automation Mesh: Distributed Execution Across Sites and NetworksAAP 2.6 Backup, Restore, and Disaster Recovery GuideAAP 2.6 Automation Dashboard GuideAAP 2.6 Security Best Practices

See also

Ansible Monitoring: Integrate with Prometheus, Grafana & Alerting (Complete Guide)

Category: troubleshooting

Browse all Ansible tutorials · AnsiblePilot Home