AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

Monitor AAP 2.6 with Prometheus metrics, Grafana dashboards, and centralized logging. Track job performance, mesh health, capacity planning, and alerting.

Why Monitor AAP?

A healthy automation platform requires observability. Without monitoring, you discover problems only when jobs fail or users complain. AAP 2.6 exposes metrics and logs that let you:

Track job success rates — identify failing automation before users notice
Monitor capacity — know when to add execution nodes
Detect performance degradation — catch slow jobs and bottlenecks
Audit automation activity — who ran what, when, and on which hosts
Plan capacity — forecast growth based on usage trends

AAP Metrics Endpoint

Automation Controller exposes a Prometheus-compatible metrics endpoint:

GET /api/controller/v2/metrics/

Enable Metrics

Metrics are enabled by default in AAP 2.6. Ensure the endpoint is accessible:

# Test metrics endpoint
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/metrics/" | head -20

Key Metrics

Metric	Type	Description
`awx_running_jobs_total`	Gauge	Currently running jobs
`awx_pending_jobs_total`	Gauge	Jobs waiting in queue
`awx_status_total`	Counter	Jobs by status (successful, failed, error, canceled)
`awx_instance_capacity`	Gauge	Total capacity per instance
`awx_instance_consumed_capacity`	Gauge	Used capacity per instance
`awx_instance_remaining_capacity`	Gauge	Available capacity per instance
`awx_instance_info`	Info	Instance metadata (hostname, type, version)
`awx_database_connections_total`	Gauge	Database connection count
`awx_system_memory_total`	Gauge	Total system memory
`awx_system_memory_used`	Gauge	Used system memory
`awx_license_instance_total`	Gauge	Licensed instance count
`awx_license_instance_free`	Gauge	Available licensed instances
`awx_organizations_total`	Gauge	Number of organizations
`awx_users_total`	Gauge	Number of users
`awx_inventories_total`	Gauge	Number of inventories
`awx_projects_total`	Gauge	Number of projects

Prometheus Configuration

Scrape Configuration

# prometheus.yml
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'aap-controller'
    scheme: https
    tls_config:
      insecure_skip_verify: true  # Or provide CA cert
    bearer_token: '<controller-api-token>'
    static_configs:
      - targets:
          - 'gateway.example.org'
    metrics_path: '/api/controller/v2/metrics/'
    scrape_interval: 60s

  - job_name: 'aap-nodes'
    scheme: https
    tls_config:
      insecure_skip_verify: true
    bearer_token: '<controller-api-token>'
    static_configs:
      - targets:
          - 'controller1.example.org'
          - 'controller2.example.org'
    metrics_path: '/api/controller/v2/metrics/'

Service Discovery for OpenShift

scrape_configs:
  - job_name: 'aap-openshift'
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names:
            - ansible-automation-platform
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: automation-controller
        action: keep

Grafana Dashboards

Dashboard: AAP Job Overview

{
  "panels": [
    {
      "title": "Running Jobs",
      "type": "stat",
      "targets": [
        {
          "expr": "awx_running_jobs_total",
          "legendFormat": "Running"
        }
      ]
    },
    {
      "title": "Pending Jobs",
      "type": "stat",
      "targets": [
        {
          "expr": "awx_pending_jobs_total",
          "legendFormat": "Pending"
        }
      ],
      "thresholds": {
        "steps": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 5},
          {"color": "red", "value": 20}
        ]
      }
    },
    {
      "title": "Job Success Rate (24h)",
      "type": "gauge",
      "targets": [
        {
          "expr": "sum(increase(awx_status_total{status='successful'}[24h])) / sum(increase(awx_status_total[24h])) * 100"
        }
      ],
      "thresholds": {
        "steps": [
          {"color": "red", "value": 0},
          {"color": "yellow", "value": 90},
          {"color": "green", "value": 95}
        ]
      }
    },
    {
      "title": "Jobs by Status (24h)",
      "type": "piechart",
      "targets": [
        {
          "expr": "increase(awx_status_total[24h])",
          "legendFormat": "{{status}}"
        }
      ]
    }
  ]
}

Dashboard: Capacity Planning

# PromQL queries for capacity dashboard

# Capacity utilization per node
awx_instance_consumed_capacity / awx_instance_capacity * 100

# Total platform capacity
sum(awx_instance_capacity{node_type="execution"})

# Capacity trend (predict when you'll run out)
predict_linear(awx_instance_consumed_capacity[7d], 86400 * 30)

# Queue depth trend
rate(awx_pending_jobs_total[1h])

Dashboard: Mesh Health

# Node availability
up{job="aap-nodes"}

# Instance errors
awx_instance_info{errors!=""}

# Capacity per node
awx_instance_remaining_capacity

# Database connections
awx_database_connections_total

Alerting Rules

Prometheus Alert Rules

# aap-alerts.yml
groups:
  - name: aap-critical
    rules:
      - alert: AAPJobQueueBacklog
        expr: awx_pending_jobs_total > 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "AAP job queue backlog: {{ $value }} pending jobs"
          description: "More than 20 jobs pending for 10+ minutes. Consider adding execution nodes."

      - alert: AAPHighFailureRate
        expr: |
          sum(rate(awx_status_total{status="failed"}[1h])) /
          sum(rate(awx_status_total[1h])) > 0.1
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "AAP job failure rate above 10%"
          description: "{{ $value | humanizePercentage }} of jobs failing in the last hour."

      - alert: AAPNodeDown
        expr: up{job="aap-nodes"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "AAP node {{ $labels.instance }} is down"

      - alert: AAPCapacityLow
        expr: |
          sum(awx_instance_remaining_capacity{node_type="execution"}) /
          sum(awx_instance_capacity{node_type="execution"}) < 0.2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "AAP execution capacity below 20%"
          description: "Only {{ $value | humanizePercentage }} capacity remaining."

      - alert: AAPDatabaseConnectionsHigh
        expr: awx_database_connections_total > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "AAP database connections high: {{ $value }}"

      - alert: AAPNoJobsRunning
        expr: awx_running_jobs_total == 0 and awx_pending_jobs_total > 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "AAP has pending jobs but nothing is running"
          description: "{{ $value }} pending jobs with 0 running. Check execution nodes."

Centralized Logging

Controller Activity Stream

AAP logs all activity to the Activity Stream API:

# Recent activity
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/activity_stream/?order_by=-timestamp&page_size=10" | \
  jq '.results[] | {timestamp: .timestamp, operation: .operation, object1: .summary_fields.actor.username, object2: .object1}'

Log Forwarding to External Systems

Configure Controller to forward logs to a log aggregator:

# Via Automation Controller settings API
- name: Configure external logging
  ansible.builtin.uri:
    url: "https://gateway.example.org/api/controller/v2/settings/logging/"
    method: PATCH
    headers:
      Authorization: "Bearer {{ token }}"
      Content-Type: "application/json"
    body_format: json
    body:
      LOG_AGGREGATOR_HOST: "splunk.example.org"
      LOG_AGGREGATOR_PORT: 8088
      LOG_AGGREGATOR_TYPE: "splunk"
      LOG_AGGREGATOR_PROTOCOL: "https"
      LOG_AGGREGATOR_TOKEN: "{{ splunk_hec_token }}"
      LOG_AGGREGATOR_ENABLED: true
      LOG_AGGREGATOR_INDIVIDUAL_FACTS: false
    validate_certs: false

Supported Log Aggregators

Aggregator	Type Value	Protocol
Splunk	`splunk`	HTTPS (HEC)
Elastic/Logstash	`logstash`	TCP/UDP
Loggly	`loggly`	HTTPS
Sumo Logic	`sumologic`	HTTPS
Other	`other`	TCP/UDP syslog

What Gets Logged

Log Category	Contents
Job events	Task start/finish, stdout, playbook output
Activity stream	User actions, CRUD operations
System tracking	Fact gathering, inventory changes
Performance	Job duration, fork count, host count

ELK Stack Configuration

# Logstash input for AAP
input {
  http {
    port => 5044
    codec => json
  }
}

filter {
  if [cluster_host_id] {
    mutate {
      add_field => { "source" => "aap-controller" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["https://elasticsearch.example.org:9200"]
    index => "aap-logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "${ES_PASSWORD}"
  }
}

API-Based Monitoring

Job Health Monitoring Script

#!/usr/bin/env python3
"""Monitor AAP job health via API"""
import requests
import json
from datetime import datetime, timedelta

GATEWAY_URL = "https://gateway.example.org"
TOKEN = "your-api-token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

# Get jobs from the last hour
one_hour_ago = (datetime.utcnow() - timedelta(hours=1)).strftime("%Y-%m-%dT%H:%M:%S")
response = requests.get(
    f"{GATEWAY_URL}/api/controller/v2/jobs/",
    headers=HEADERS,
    params={"finished__gte": one_hour_ago, "page_size": 100},
    verify=False
)

jobs = response.json()["results"]
status_counts = {}
for job in jobs:
    status = job["status"]
    status_counts[status] = status_counts.get(status, 0) + 1

total = len(jobs)
failed = status_counts.get("failed", 0)
success_rate = ((total - failed) / total * 100) if total > 0 else 100

print(f"Jobs (last 1h): {total}")
print(f"Status: {json.dumps(status_counts, indent=2)}")
print(f"Success rate: {success_rate:.1f}%")

if success_rate < 90:
    print("WARNING: Success rate below 90%!")

Best Practices

Set up alerting before you need it — don't wait for the first outage
Monitor the queue — pending jobs > 0 for extended periods means capacity issues
Track job duration trends — gradually increasing duration indicates performance problems
Monitor database size — configure retention policies to prevent unbounded growth
Use Grafana annotations — mark deployments, upgrades, and incidents on dashboards
Test alerts regularly — ensure notification channels are working

FAQ

Does AAP support OpenTelemetry?

AAP 2.6 primarily exposes Prometheus metrics. OpenTelemetry integration is not built-in, but you can use the OpenTelemetry Collector as a bridge between AAP's Prometheus endpoint and OpenTelemetry-compatible backends.

Can I monitor individual playbook task performance?

Yes. Enable callback plugins or use the Activity Stream API to get per-task timing. The job events API (/api/controller/v2/jobs/{id}/job_events/) provides task-level detail including duration.

How much storage does log aggregation need?

Depends on job volume and verbosity. A deployment running 1,000 jobs/day at normal verbosity generates roughly 1-5 GB of logs daily. Higher verbosity (level 3+) increases this 5-10x.

Can I monitor EDA Controller separately?

EDA metrics are available through the Platform Gateway. EDA also logs event processing, rulebook activation status, and action execution to the standard log aggregation pipeline.

What retention should I set for job history?

Default is to keep all job history. For high-volume environments, configure cleanup jobs via Administration → Management Jobs → Cleanup Activity Stream and Cleanup Job Details to retain 90-180 days of history.

Conclusion

Monitoring and logging transform AAP from a black box into an observable, auditable platform. Start with Prometheus metrics and Grafana dashboards for real-time visibility, add alerting for proactive issue detection, and configure log aggregation for compliance and troubleshooting. The investment in observability pays back every time you catch an issue before your users do.

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton