AAP 2.6 Troubleshooting Guide: Common Issues and Solutions
By Luca Berton · Published 2024-01-01 · Category: installation
Troubleshoot common AAP 2.6 issues: job failures, connectivity problems, performance bottlenecks, database issues, mesh errors, EE problems, and upgrade failures. Diagnostic commands and solutions for every component.
Troubleshooting Approach
When something goes wrong in AAP, follow this diagnostic hierarchy: Check the job output — most issues are visible in stdout Check service health — are all components running? Check logs — system logs reveal infrastructure issues Check capacity — is the platform overloaded? Check connectivity — can components reach each other?
Service Health Checks
Quick Health Check
Containerized Deployment
RPM Deployment
Operator Deployment (OpenShift)
Job Failures
"Error creating pod" / EE Image Pull Failure
Causes and fixes: • EE image doesn't exist in registry → Push image to Hub • Registry authentication failed → Update Container Registry credential • Network issue → Check execution node can reach Hub • Image tag wrong → Verify exact image name and tag
"Host key verification failed"
Fix: Add host_key_checking = False in project's ansible.cfg or set the environment variable:
"Permission denied (publickey,password)"
Causes: • Wrong credential assigned to job template • SSH key doesn't match authorized_keys on target • User doesn't exist on target host • Password expired
"No hosts matched"
Causes: • Inventory doesn't contain the host group • Dynamic inventory sync failed • Host limit pattern wrong • Inventory not updated since last host change
Fix: Check inventory sync status, verify host patterns.
Timeout Errors
Causes: • become_password not set or wrong • Slow network to target host • Target host under heavy load • Become method incompatible
"Module not found"
Fix: Module is not in the Execution Environment. Either: Build a custom EE with the required collection Add the collection to the project's collections/requirements.yml
Database Issues
Connection Errors
Check:
Database Full / Out of Space
Fix: Configure cleanup jobs: • Administration → Management Jobs → Cleanup Job Details — purge old job records • Cleanup Activity Stream — purge old activity log entries • Set retention to 90-180 days for production
Slow Queries
Automation Mesh Issues
Execution Node Unreachable
Diagnostic steps:
Node Shows Zero Capacity
Common causes: • Execution node out of memory • Receptor not connected • Node disabled in settings
Jobs Stuck in "Waiting"
Performance Issues
Slow Job Execution
Diagnostic checklist: Check forks setting — increase for more parallelism Check execution node capacity — add more nodes Check network latency to managed hosts Check if gather_facts is needed (disable if not) Check for slow modules (use callback_whitelist = timer for task timing)
Platform Gateway Slow
High Memory Usage
Upgrade Issues
Pre-Upgrade Checklist
"Migration failed" During Upgrade
Services Won't Start After Upgrade
Useful CLI Commands
awx-manage (Controller)
receptorctl (Mesh)
FAQ
How do I enable debug logging?
For Controller: Set AWX_TASK_LOG_LEVEL=DEBUG in settings. For Gateway: Set GATEWAY_LOG_LEVEL=DEBUG. For Receptor: Set --log-level debug in receptor configuration. Remember to revert after debugging — debug logging is verbose and impacts performance.
How do I recover from a corrupted database?
Restore from the most recent backup. If no backup exists, check for PostgreSQL WAL files that might allow point-in-time recovery. As a last resort, the installer can rebuild the database (losing all data).
Why do jobs succeed in CLI but fail in Controller?
Common causes: different user context (Controller runs as awx), different Python environment (jobs run in EEs), different working directory, missing environment variables, or credential injection differences.
How do I reset the admin password?
Where are the log files?
| Component | Containerized | RPM | |-----------|---------------|-----| | Controller | podman logs automation-controller- | /var/log/tower/ | | Gateway | podman logs automation-gateway | /var/log/automation-gateway/ | | Hub | podman logs automation-hub- | /var/log/pulp/ | | EDA | podman logs automation-eda-* | /var/log/automation-eda/ | | Receptor | podman logs receptor | /var/log/receptor/ |
Conclusion
Effective troubleshooting in AAP 2.6 requires understanding the architecture — knowing which component owns which functionality and where to look for errors. Start with the job output, work through service health and logs, and use the diagnostic commands in this guide to pinpoint issues quickly.
Related Articles • AAP 2.6 Architecture and Components: Complete Guide • AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation • AAP 2.6 Automation Mesh: Distributed Execution Across Sites and Networks • AAP 2.6 Backup, Restore, and Disaster Recovery Guide • AAP 2.6 Execution Environments: Build, Manage, and Deploy Custom EEs
Category: installation