AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation
By Luca Berton · Published 2024-01-01 · Category: troubleshooting
Monitor AAP 2.6 with Prometheus metrics, Grafana dashboards, and centralized logging. Track job performance, mesh health, capacity planning, and alerting.
Why Monitor AAP?
A healthy automation platform requires observability. Without monitoring, you discover problems only when jobs fail or users complain. AAP 2.6 exposes metrics and logs that let you: • Track job success rates — identify failing automation before users notice • Monitor capacity — know when to add execution nodes • Detect performance degradation — catch slow jobs and bottlenecks • Audit automation activity — who ran what, when, and on which hosts • Plan capacity — forecast growth based on usage trends
See also: Ansible Monitoring and Observability: Prometheus, Grafana, and ELK Stack Integration
AAP Metrics Endpoint
Automation Controller exposes a Prometheus-compatible metrics endpoint:
GET /api/controller/v2/metrics/
Enable Metrics
Metrics are enabled by default in AAP 2.6. Ensure the endpoint is accessible:
# Test metrics endpoint
curl -s -k -H "Authorization: Bearer $TOKEN" \
"https://gateway.example.org/api/controller/v2/metrics/" | head -20
Key Metrics
| Metric | Type | Description |
|--------|------|-------------|
| awx_running_jobs_total | Gauge | Currently running jobs |
| awx_pending_jobs_total | Gauge | Jobs waiting in queue |
| awx_status_total | Counter | Jobs by status (successful, failed, error, canceled) |
| awx_instance_capacity | Gauge | Total capacity per instance |
| awx_instance_consumed_capacity | Gauge | Used capacity per instance |
| awx_instance_remaining_capacity | Gauge | Available capacity per instance |
| awx_instance_info | Info | Instance metadata (hostname, type, version) |
| awx_database_connections_total | Gauge | Database connection count |
| awx_system_memory_total | Gauge | Total system memory |
| awx_system_memory_used | Gauge | Used system memory |
| awx_license_instance_total | Gauge | Licensed instance count |
| awx_license_instance_free | Gauge | Available licensed instances |
| awx_organizations_total | Gauge | Number of organizations |
| awx_users_total | Gauge | Number of users |
| awx_inventories_total | Gauge | Number of inventories |
| awx_projects_total | Gauge | Number of projects |
Prometheus Configuration
Scrape Configuration
# prometheus.yml
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'aap-controller'
scheme: https
tls_config:
insecure_skip_verify: true # Or provide CA cert
bearer_token: '<controller-api-token>'
static_configs:
- targets:
- 'gateway.example.org'
metrics_path: '/api/controller/v2/metrics/'
scrape_interval: 60s
- job_name: 'aap-nodes'
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token: '<controller-api-token>'
static_configs:
- targets:
- 'controller1.example.org'
- 'controller2.example.org'
metrics_path: '/api/controller/v2/metrics/'
Service Discovery for OpenShift
scrape_configs:
- job_name: 'aap-openshift'
kubernetes_sd_configs:
- role: service
namespaces:
names:
- ansible-automation-platform
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
regex: automation-controller
action: keep
See also: Ansible Monitoring: Integrate with Prometheus, Grafana & Alerting (Complete Guide)
Grafana Dashboards
Dashboard: AAP Job Overview
{
"panels": [
{
"title": "Running Jobs",
"type": "stat",
"targets": [
{
"expr": "awx_running_jobs_total",
"legendFormat": "Running"
}
]
},
{
"title": "Pending Jobs",
"type": "stat",
"targets": [
{
"expr": "awx_pending_jobs_total",
"legendFormat": "Pending"
}
],
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 5},
{"color": "red", "value": 20}
]
}
},
{
"title": "Job Success Rate (24h)",
"type": "gauge",
"targets": [
{
"expr": "sum(increase(awx_status_total{status='successful'}[24h])) / sum(increase(awx_status_total[24h])) * 100"
}
],
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 90},
{"color": "green", "value": 95}
]
}
},
{
"title": "Jobs by Status (24h)",
"type": "piechart",
"targets": [
{
"expr": "increase(awx_status_total[24h])",
"legendFormat": "{{status}}"
}
]
}
]
}
Dashboard: Capacity Planning
# PromQL queries for capacity dashboard
# Capacity utilization per node
awx_instance_consumed_capacity / awx_instance_capacity * 100
# Total platform capacity
sum(awx_instance_capacity{node_type="execution"})
# Capacity trend (predict when you'll run out)
predict_linear(awx_instance_consumed_capacity[7d], 86400 * 30)
# Queue depth trend
rate(awx_pending_jobs_total[1h])
Dashboard: Mesh Health
# Node availability
up{job="aap-nodes"}
# Instance errors
awx_instance_info{errors!=""}
# Capacity per node
awx_instance_remaining_capacity
# Database connections
awx_database_connections_total
Alerting Rules
Prometheus Alert Rules
# aap-alerts.yml
groups:
- name: aap-critical
rules:
- alert: AAPJobQueueBacklog
expr: awx_pending_jobs_total > 20
for: 10m
labels:
severity: warning
annotations:
summary: "AAP job queue backlog: {{ $value }} pending jobs"
description: "More than 20 jobs pending for 10+ minutes. Consider adding execution nodes."
- alert: AAPHighFailureRate
expr: |
sum(rate(awx_status_total{status="failed"}[1h])) /
sum(rate(awx_status_total[1h])) > 0.1
for: 30m
labels:
severity: critical
annotations:
summary: "AAP job failure rate above 10%"
description: "{{ $value | humanizePercentage }} of jobs failing in the last hour."
- alert: AAPNodeDown
expr: up{job="aap-nodes"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "AAP node {{ $labels.instance }} is down"
- alert: AAPCapacityLow
expr: |
sum(awx_instance_remaining_capacity{node_type="execution"}) /
sum(awx_instance_capacity{node_type="execution"}) < 0.2
for: 15m
labels:
severity: warning
annotations:
summary: "AAP execution capacity below 20%"
description: "Only {{ $value | humanizePercentage }} capacity remaining."
- alert: AAPDatabaseConnectionsHigh
expr: awx_database_connections_total > 80
for: 10m
labels:
severity: warning
annotations:
summary: "AAP database connections high: {{ $value }}"
- alert: AAPNoJobsRunning
expr: awx_running_jobs_total == 0 and awx_pending_jobs_total > 0
for: 15m
labels:
severity: critical
annotations:
summary: "AAP has pending jobs but nothing is running"
description: "{{ $value }} pending jobs with 0 running. Check execution nodes."
See also: Integrate Automation Controller, Prometheus, and Grafana to IT Monitor Realtime
Centralized Logging
Controller Activity Stream
AAP logs all activity to the Activity Stream API:
# Recent activity
curl -s -k -H "Authorization: Bearer $TOKEN" \
"https://gateway.example.org/api/controller/v2/activity_stream/?order_by=-timestamp&page_size=10" | \
jq '.results[] | {timestamp: .timestamp, operation: .operation, object1: .summary_fields.actor.username, object2: .object1}'
Log Forwarding to External Systems
Configure Controller to forward logs to a log aggregator:
# Via Automation Controller settings API
- name: Configure external logging
ansible.builtin.uri:
url: "https://gateway.example.org/api/controller/v2/settings/logging/"
method: PATCH
headers:
Authorization: "Bearer {{ token }}"
Content-Type: "application/json"
body_format: json
body:
LOG_AGGREGATOR_HOST: "splunk.example.org"
LOG_AGGREGATOR_PORT: 8088
LOG_AGGREGATOR_TYPE: "splunk"
LOG_AGGREGATOR_PROTOCOL: "https"
LOG_AGGREGATOR_TOKEN: "{{ splunk_hec_token }}"
LOG_AGGREGATOR_ENABLED: true
LOG_AGGREGATOR_INDIVIDUAL_FACTS: false
validate_certs: false
Supported Log Aggregators
| Aggregator | Type Value | Protocol |
|-----------|-----------|----------|
| Splunk | splunk | HTTPS (HEC) |
| Elastic/Logstash | logstash | TCP/UDP |
| Loggly | loggly | HTTPS |
| Sumo Logic | sumologic | HTTPS |
| Other | other | TCP/UDP syslog |
What Gets Logged
| Log Category | Contents | |-------------|----------| | Job events | Task start/finish, stdout, playbook output | | Activity stream | User actions, CRUD operations | | System tracking | Fact gathering, inventory changes | | Performance | Job duration, fork count, host count |
ELK Stack Configuration
# Logstash input for AAP
input {
http {
port => 5044
codec => json
}
}
filter {
if [cluster_host_id] {
mutate {
add_field => { "source" => "aap-controller" }
}
}
}
output {
elasticsearch {
hosts => ["https://elasticsearch.example.org:9200"]
index => "aap-logs-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ES_PASSWORD}"
}
}
API-Based Monitoring
Job Health Monitoring Script
#!/usr/bin/env python3
"""Monitor AAP job health via API"""
import requests
import json
from datetime import datetime, timedelta
GATEWAY_URL = "https://gateway.example.org"
TOKEN = "your-api-token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
# Get jobs from the last hour
one_hour_ago = (datetime.utcnow() - timedelta(hours=1)).strftime("%Y-%m-%dT%H:%M:%S")
response = requests.get(
f"{GATEWAY_URL}/api/controller/v2/jobs/",
headers=HEADERS,
params={"finished__gte": one_hour_ago, "page_size": 100},
verify=False
)
jobs = response.json()["results"]
status_counts = {}
for job in jobs:
status = job["status"]
status_counts[status] = status_counts.get(status, 0) + 1
total = len(jobs)
failed = status_counts.get("failed", 0)
success_rate = ((total - failed) / total * 100) if total > 0 else 100
print(f"Jobs (last 1h): {total}")
print(f"Status: {json.dumps(status_counts, indent=2)}")
print(f"Success rate: {success_rate:.1f}%")
if success_rate < 90:
print("WARNING: Success rate below 90%!")
Best Practices
• Set up alerting before you need it — don't wait for the first outage • Monitor the queue — pending jobs > 0 for extended periods means capacity issues • Track job duration trends — gradually increasing duration indicates performance problems • Monitor database size — configure retention policies to prevent unbounded growth • Use Grafana annotations — mark deployments, upgrades, and incidents on dashboards • Test alerts regularly — ensure notification channels are workingFAQ
Does AAP support OpenTelemetry?
AAP 2.6 primarily exposes Prometheus metrics. OpenTelemetry integration is not built-in, but you can use the OpenTelemetry Collector as a bridge between AAP's Prometheus endpoint and OpenTelemetry-compatible backends.
Can I monitor individual playbook task performance?
Yes. Enable callback plugins or use the Activity Stream API to get per-task timing. The job events API (/api/controller/v2/jobs/{id}/job_events/) provides task-level detail including duration.
How much storage does log aggregation need?
Depends on job volume and verbosity. A deployment running 1,000 jobs/day at normal verbosity generates roughly 1-5 GB of logs daily. Higher verbosity (level 3+) increases this 5-10x.
Can I monitor EDA Controller separately?
EDA metrics are available through the Platform Gateway. EDA also logs event processing, rulebook activation status, and action execution to the standard log aggregation pipeline.
What retention should I set for job history?
Default is to keep all job history. For high-volume environments, configure cleanup jobs via Administration → Management Jobs → Cleanup Activity Stream and Cleanup Job Details to retain 90-180 days of history.
Conclusion
Monitoring and logging transform AAP from a black box into an observable, auditable platform. Start with Prometheus metrics and Grafana dashboards for real-time visibility, add alerting for proactive issue detection, and configure log aggregation for compliance and troubleshooting. The investment in observability pays back every time you catch an issue before your users do.
Related Articles
• AAP 2.6 Architecture and Components: Complete Guide • AAP 2.6 Automation Mesh: Distributed Execution Across Sites and Networks • AAP 2.6 Backup, Restore, and Disaster Recovery Guide • AAP 2.6 Automation Dashboard Guide • AAP 2.6 Security Best PracticesSee also
• Ansible Monitoring: Integrate with Prometheus, Grafana & Alerting (Complete Guide)Category: troubleshooting