AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

Monitor AAP 2.6 with Prometheus metrics, Grafana dashboards, and centralized logging. Track job performance, mesh health, capacity planning, and alerting for Automation Controller, Hub, EDA, and Platform Gateway.

Why Monitor AAP?

A healthy automation platform requires observability. Without monitoring, you discover problems only when jobs fail or users complain. AAP 2.6 exposes metrics and logs that let you: • Track job success rates — identify failing automation before users notice • Monitor capacity — know when to add execution nodes • Detect performance degradation — catch slow jobs and bottlenecks • Audit automation activity — who ran what, when, and on which hosts • Plan capacity — forecast growth based on usage trends

AAP Metrics Endpoint

Automation Controller exposes a Prometheus-compatible metrics endpoint:

Enable Metrics

Metrics are enabled by default in AAP 2.6. Ensure the endpoint is accessible:

Key Metrics

| Metric | Type | Description | |--------|------|-------------| | awx_running_jobs_total | Gauge | Currently running jobs | | awx_pending_jobs_total | Gauge | Jobs waiting in queue | | awx_status_total | Counter | Jobs by status (successful, failed, error, canceled) | | awx_instance_capacity | Gauge | Total capacity per instance | | awx_instance_consumed_capacity | Gauge | Used capacity per instance | | awx_instance_remaining_capacity | Gauge | Available capacity per instance | | awx_instance_info | Info | Instance metadata (hostname, type, version) | | awx_database_connections_total | Gauge | Database connection count | | awx_system_memory_total | Gauge | Total system memory | | awx_system_memory_used | Gauge | Used system memory | | awx_license_instance_total | Gauge | Licensed instance count | | awx_license_instance_free | Gauge | Available licensed instances | | awx_organizations_total | Gauge | Number of organizations | | awx_users_total | Gauge | Number of users | | awx_inventories_total | Gauge | Number of inventories | | awx_projects_total | Gauge | Number of projects |

Prometheus Configuration

Scrape Configuration

Service Discovery for OpenShift

Grafana Dashboards

Dashboard: AAP Job Overview

Dashboard: Capacity Planning

Dashboard: Mesh Health

Alerting Rules

Prometheus Alert Rules

Centralized Logging

Controller Activity Stream

AAP logs all activity to the Activity Stream API:

Log Forwarding to External Systems

Configure Controller to forward logs to a log aggregator:

Supported Log Aggregators

| Aggregator | Type Value | Protocol | |-----------|-----------|----------| | Splunk | splunk | HTTPS (HEC) | | Elastic/Logstash | logstash | TCP/UDP | | Loggly | loggly | HTTPS | | Sumo Logic | sumologic | HTTPS | | Other | other | TCP/UDP syslog |

What Gets Logged

| Log Category | Contents | |-------------|----------| | Job events | Task start/finish, stdout, playbook output | | Activity stream | User actions, CRUD operations | | System tracking | Fact gathering, inventory changes | | Performance | Job duration, fork count, host count |

ELK Stack Configuration

API-Based Monitoring

Job Health Monitoring Script

Best Practices • Set up alerting before you need it — don't wait for the first outage • Monitor the queue — pending jobs > 0 for extended periods means capacity issues • Track job duration trends — gradually increasing duration indicates performance problems • Monitor database size — configure retention policies to prevent unbounded growth • Use Grafana annotations — mark deployments, upgrades, and incidents on dashboards • Test alerts regularly — ensure notification channels are working

FAQ

Does AAP support OpenTelemetry?

AAP 2.6 primarily exposes Prometheus metrics. OpenTelemetry integration is not built-in, but you can use the OpenTelemetry Collector as a bridge between AAP's Prometheus endpoint and OpenTelemetry-compatible backends.

Can I monitor individual playbook task performance?

Yes. Enable callback plugins or use the Activity Stream API to get per-task timing. The job events API (/api/controller/v2/jobs/{id}/job_events/) provides task-level detail including duration.

How much storage does log aggregation need?

Depends on job volume and verbosity. A deployment running 1,000 jobs/day at normal verbosity generates roughly 1-5 GB of logs daily. Higher verbosity (level 3+) increases this 5-10x.

Can I monitor EDA Controller separately?

EDA metrics are available through the Platform Gateway. EDA also logs event processing, rulebook activation status, and action execution to the standard log aggregation pipeline.

What retention should I set for job history?

Default is to keep all job history. For high-volume environments, configure cleanup jobs via Administration → Management Jobs → Cleanup Activity Stream and Cleanup Job Details to retain 90-180 days of history.

Conclusion

Monitoring and logging transform AAP from a black box into an observable, auditable platform. Start with Prometheus metrics and Grafana dashboards for real-time visibility, add alerting for proactive issue detection, and configure log aggregation for compliance and troubleshooting. The investment in observability pays back every time you catch an issue before your users do.

Category: troubleshooting

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton

AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation