AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation
By Luca Berton · Published 2024-01-01 · Category: troubleshooting
Monitor AAP 2.6 with Prometheus metrics, Grafana dashboards, and centralized logging. Track job performance, mesh health, capacity planning, and alerting for Automation Controller, Hub, EDA, and Platform Gateway.
Why Monitor AAP?
A healthy automation platform requires observability. Without monitoring, you discover problems only when jobs fail or users complain. AAP 2.6 exposes metrics and logs that let you: • Track job success rates — identify failing automation before users notice • Monitor capacity — know when to add execution nodes • Detect performance degradation — catch slow jobs and bottlenecks • Audit automation activity — who ran what, when, and on which hosts • Plan capacity — forecast growth based on usage trends
AAP Metrics Endpoint
Automation Controller exposes a Prometheus-compatible metrics endpoint:
Enable Metrics
Metrics are enabled by default in AAP 2.6. Ensure the endpoint is accessible:
Key Metrics
| Metric | Type | Description | |--------|------|-------------| | awx_running_jobs_total | Gauge | Currently running jobs | | awx_pending_jobs_total | Gauge | Jobs waiting in queue | | awx_status_total | Counter | Jobs by status (successful, failed, error, canceled) | | awx_instance_capacity | Gauge | Total capacity per instance | | awx_instance_consumed_capacity | Gauge | Used capacity per instance | | awx_instance_remaining_capacity | Gauge | Available capacity per instance | | awx_instance_info | Info | Instance metadata (hostname, type, version) | | awx_database_connections_total | Gauge | Database connection count | | awx_system_memory_total | Gauge | Total system memory | | awx_system_memory_used | Gauge | Used system memory | | awx_license_instance_total | Gauge | Licensed instance count | | awx_license_instance_free | Gauge | Available licensed instances | | awx_organizations_total | Gauge | Number of organizations | | awx_users_total | Gauge | Number of users | | awx_inventories_total | Gauge | Number of inventories | | awx_projects_total | Gauge | Number of projects |
Prometheus Configuration
Scrape Configuration
Service Discovery for OpenShift
Grafana Dashboards
Dashboard: AAP Job Overview
Dashboard: Capacity Planning
Dashboard: Mesh Health
Alerting Rules
Prometheus Alert Rules
Centralized Logging
Controller Activity Stream
AAP logs all activity to the Activity Stream API:
Log Forwarding to External Systems
Configure Controller to forward logs to a log aggregator:
Supported Log Aggregators
| Aggregator | Type Value | Protocol | |-----------|-----------|----------| | Splunk | splunk | HTTPS (HEC) | | Elastic/Logstash | logstash | TCP/UDP | | Loggly | loggly | HTTPS | | Sumo Logic | sumologic | HTTPS | | Other | other | TCP/UDP syslog |
What Gets Logged
| Log Category | Contents | |-------------|----------| | Job events | Task start/finish, stdout, playbook output | | Activity stream | User actions, CRUD operations | | System tracking | Fact gathering, inventory changes | | Performance | Job duration, fork count, host count |
ELK Stack Configuration
API-Based Monitoring
Job Health Monitoring Script
Best Practices • Set up alerting before you need it — don't wait for the first outage • Monitor the queue — pending jobs > 0 for extended periods means capacity issues • Track job duration trends — gradually increasing duration indicates performance problems • Monitor database size — configure retention policies to prevent unbounded growth • Use Grafana annotations — mark deployments, upgrades, and incidents on dashboards • Test alerts regularly — ensure notification channels are working
FAQ
Does AAP support OpenTelemetry?
AAP 2.6 primarily exposes Prometheus metrics. OpenTelemetry integration is not built-in, but you can use the OpenTelemetry Collector as a bridge between AAP's Prometheus endpoint and OpenTelemetry-compatible backends.
Can I monitor individual playbook task performance?
Yes. Enable callback plugins or use the Activity Stream API to get per-task timing. The job events API (/api/controller/v2/jobs/{id}/job_events/) provides task-level detail including duration.
How much storage does log aggregation need?
Depends on job volume and verbosity. A deployment running 1,000 jobs/day at normal verbosity generates roughly 1-5 GB of logs daily. Higher verbosity (level 3+) increases this 5-10x.
Can I monitor EDA Controller separately?
EDA metrics are available through the Platform Gateway. EDA also logs event processing, rulebook activation status, and action execution to the standard log aggregation pipeline.
What retention should I set for job history?
Default is to keep all job history. For high-volume environments, configure cleanup jobs via Administration → Management Jobs → Cleanup Activity Stream and Cleanup Job Details to retain 90-180 days of history.
Conclusion
Monitoring and logging transform AAP from a black box into an observable, auditable platform. Start with Prometheus metrics and Grafana dashboards for real-time visibility, add alerting for proactive issue detection, and configure log aggregation for compliance and troubleshooting. The investment in observability pays back every time you catch an issue before your users do.
Related Articles • AAP 2.6 Architecture and Components: Complete Guide • AAP 2.6 Automation Mesh: Distributed Execution Across Sites and Networks • AAP 2.6 Backup, Restore, and Disaster Recovery Guide • AAP 2.6 Automation Dashboard Guide • AAP 2.6 Security Best Practices
Category: troubleshooting