AAP 2.6 Troubleshooting Guide: Common Issues and Solutions
By Luca Berton · Published 2024-01-01 · Category: installation
Troubleshoot common AAP 2.6 issues: job failures, connectivity problems, performance bottlenecks, database issues, mesh errors, EE problems, and upgrade.
Troubleshooting Approach
When something goes wrong in AAP, follow this diagnostic hierarchy: Check the job output — most issues are visible in stdout Check service health — are all components running? Check logs — system logs reveal infrastructure issues Check capacity — is the platform overloaded? Check connectivity — can components reach each other?
See also: Ansible Private Automation Hub: Host & Manage Collections (Guide)
Service Health Checks
Quick Health Check
# API health (no auth required)
curl -s -k "https://gateway.example.org/api/controller/v2/ping/"
# Expected: {"ha": true, "version": "4.6.x", "active_node": "controller1"}
# Check all instances
curl -s -k -H "Authorization: Bearer $TOKEN" \
"https://gateway.example.org/api/controller/v2/instances/" | \
jq '.results[] | {hostname: .hostname, node_type: .node_type, capacity: .capacity, errors: .errors, version: .version}'
Containerized Deployment
# Check all containers
podman ps --format "{{.Names}} {{.Status}}" | sort
# Check specific component
podman logs automation-controller-web -f --tail 50
podman logs automation-controller-task -f --tail 50
podman logs automation-gateway -f --tail 50
podman logs automation-hub-web -f --tail 50
podman logs automation-eda-api -f --tail 50
# Restart a component
podman restart automation-controller-web
RPM Deployment
# Check service status
systemctl status automation-controller-service
systemctl status automation-hub-service
systemctl status automation-eda-service
systemctl status automation-gateway-service
# Check logs
journalctl -u automation-controller-service -f --no-pager -n 100
journalctl -u automation-gateway-service -f --no-pager -n 100
Operator Deployment (OpenShift)
# Check pod status
oc get pods -n ansible-automation-platform
oc describe pod <pod-name> -n ansible-automation-platform
oc logs <pod-name> -n ansible-automation-platform -f
# Check operator
oc get csv -n ansible-automation-platform
oc logs deployment/aap-operator-controller-manager -n ansible-automation-platform
Job Failures
"Error creating pod" / EE Image Pull Failure
ERROR: ImagePullBackOff: Back-off pulling image "hub.example.org/ee-custom:1.0"
Causes and fixes: • EE image doesn't exist in registry → Push image to Hub • Registry authentication failed → Update Container Registry credential • Network issue → Check execution node can reach Hub • Image tag wrong → Verify exact image name and tag
# Test image pull on execution node
podman pull hub.example.org/ee-custom:1.0
# Check credential
podman login hub.example.org
"Host key verification failed"
UNREACHABLE! => {"msg": "Failed to connect: Host key verification failed."}
Fix: Add host_key_checking = False in project's ansible.cfg or set the environment variable:
# In job template extra variables
ansible_host_key_checking: false
# Or in inventory group variables
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
"Permission denied (publickey,password)"
Causes: • Wrong credential assigned to job template • SSH key doesn't match authorized_keys on target • User doesn't exist on target host • Password expired
# Test from execution node
ssh -i /path/to/key ansible@target-host
"No hosts matched"
[WARNING]: Could not match supplied host pattern, ignoring: webservers
Causes: • Inventory doesn't contain the host group • Dynamic inventory sync failed • Host limit pattern wrong • Inventory not updated since last host change
Fix: Check inventory sync status, verify host patterns.
Timeout Errors
TASK [Deploy application] *****
fatal: [web01]: FAILED! => {"msg": "Timeout waiting for privilege escalation prompt"}
Causes:
• become_password not set or wrong
• Slow network to target host
• Target host under heavy load
• Become method incompatible
"Module not found"
ERROR! couldn't resolve module/action 'community.general.ufw'
Fix: Module is not in the Execution Environment. Either:
Build a custom EE with the required collection
Add the collection to the project's collections/requirements.yml
See also: Ansible Automation Platform 2.6 Architecture and Components: Complete Guide
Database Issues
Connection Errors
OperationalError: could not connect to server: Connection refused
Check:
# Test database connectivity
psql -h db.example.org -U postgres -c "SELECT 1;"
# Check PostgreSQL is running
systemctl status postgresql
# Check connection count
psql -h db.example.org -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# Check max connections
psql -h db.example.org -U postgres -c "SHOW max_connections;"
Database Full / Out of Space
# Check database sizes
psql -h db.example.org -U postgres -c "
SELECT datname, pg_size_pretty(pg_database_size(datname))
FROM pg_database ORDER BY pg_database_size(datname) DESC;"
# Find large tables
psql -h db.example.org -U postgres -d controller -c "
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"
Fix: Configure cleanup jobs: • Administration → Management Jobs → Cleanup Job Details — purge old job records • Cleanup Activity Stream — purge old activity log entries • Set retention to 90-180 days for production
Slow Queries
# Find long-running queries
psql -h db.example.org -U postgres -d controller -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;"
Automation Mesh Issues
Execution Node Unreachable
ERROR: receptor connection to exec1.example.org:27199 failed: dial tcp: connection refused
Diagnostic steps:
# 1. Check Receptor is running on execution node
systemctl status receptor
# 2. Check port is open
ss -tlnp | grep 27199
# 3. Test connectivity from controller
nc -zv exec1.example.org 27199
# 4. Check firewall
firewall-cmd --list-all
# 5. Check Receptor logs
journalctl -u receptor -f --no-pager -n 50
# 6. Check mesh status
receptorctl --socket /var/run/receptor/receptor.sock status
receptorctl --socket /var/run/receptor/receptor.sock routes
Node Shows Zero Capacity
# Check node in API
curl -s -k -H "Authorization: Bearer $TOKEN" \
"https://gateway.example.org/api/controller/v2/instances/?hostname=exec1.example.org" | \
jq '.results[0] | {hostname: .hostname, capacity: .capacity, errors: .errors, cpu: .cpu, memory: .memory}'
Common causes: • Execution node out of memory • Receptor not connected • Node disabled in settings
Jobs Stuck in "Waiting"
# Check for capacity
curl -s -k -H "Authorization: Bearer $TOKEN" \
"https://gateway.example.org/api/controller/v2/instances/" | \
jq '.results[] | {hostname: .hostname, remaining_capacity: .remaining_capacity}'
# Check instance group assignment
curl -s -k -H "Authorization: Bearer $TOKEN" \
"https://gateway.example.org/api/controller/v2/instance_groups/" | \
jq '.results[] | {name: .name, capacity: .capacity, consumed_capacity: .consumed_capacity}'
See also: AAP 2.6 Backup, Restore, and Disaster Recovery Guide
Performance Issues
Slow Job Execution
Diagnostic checklist:
Check forks setting — increase for more parallelism
Check execution node capacity — add more nodes
Check network latency to managed hosts
Check if gather_facts is needed (disable if not)
Check for slow modules (use callback_whitelist = timer for task timing)
# Speed optimizations in ansible.cfg
[defaults]
forks = 50
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/facts_cache
fact_caching_timeout = 3600
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
Platform Gateway Slow
# Check Gateway logs
podman logs automation-gateway -f --tail 50
# Check Redis
podman exec -it redis redis-cli ping
podman exec -it redis redis-cli info memory
# Check database connection pool
podman logs automation-gateway 2>&1 | grep -i "connection\|pool\|timeout"
High Memory Usage
# Check per-component memory
podman stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}"
# For RPM installs
ps aux --sort=-%mem | head -20
free -h
Upgrade Issues
Pre-Upgrade Checklist
# 1. Backup everything
./setup.sh -b
# 2. Check current version
curl -s -k "https://gateway.example.org/api/controller/v2/ping/" | jq '.version'
# 3. Check database migrations
awx-manage showmigrations | grep "\[ \]"
# 4. Check disk space
df -h
# 5. Check for running jobs (drain first)
curl -s -k -H "Authorization: Bearer $TOKEN" \
"https://gateway.example.org/api/controller/v2/jobs/?status=running" | jq '.count'
"Migration failed" During Upgrade
# Check migration status
awx-manage showmigrations --list
# Run migrations manually
awx-manage migrate --plan
# If stuck, check for locks
psql -h db.example.org -U postgres -d controller -c "
SELECT * FROM pg_locks WHERE NOT granted;"
Services Won't Start After Upgrade
# Check for configuration changes
diff /etc/ansible-automation-platform/controller.conf /etc/ansible-automation-platform/controller.conf.bak
# Verify database connectivity
awx-manage check_db
# Check for incompatible settings
awx-manage validate_settings
Useful CLI Commands
awx-manage (Controller)
# Check database connectivity
awx-manage check_db
# Run pending migrations
awx-manage migrate
# Create/change admin password
awx-manage changepassword admin
# Export/import assets
awx-manage export_assets --all > assets.json
awx-manage import_assets < assets.json
# Clear stuck jobs
awx-manage cleanup_jobs --dry-run
awx-manage cleanup_jobs
# Show settings
awx-manage print_settings
receptorctl (Mesh)
# Node status
receptorctl --socket /var/run/receptor/receptor.sock status
# View routing table
receptorctl --socket /var/run/receptor/receptor.sock routes
# Ping a mesh node
receptorctl --socket /var/run/receptor/receptor.sock ping exec1
# List active work units
receptorctl --socket /var/run/receptor/receptor.sock work list
FAQ
How do I enable debug logging?
For Controller: Set AWX_TASK_LOG_LEVEL=DEBUG in settings. For Gateway: Set GATEWAY_LOG_LEVEL=DEBUG. For Receptor: Set --log-level debug in receptor configuration. Remember to revert after debugging — debug logging is verbose and impacts performance.
How do I recover from a corrupted database?
Restore from the most recent backup. If no backup exists, check for PostgreSQL WAL files that might allow point-in-time recovery. As a last resort, the installer can rebuild the database (losing all data).
Why do jobs succeed in CLI but fail in Controller?
Common causes: different user context (Controller runs as awx), different Python environment (jobs run in EEs), different working directory, missing environment variables, or credential injection differences.
How do I reset the admin password?
# Containerized
podman exec -it automation-controller-task awx-manage changepassword admin
# RPM
awx-manage changepassword admin
Where are the log files?
| Component | Containerized | RPM |
|-----------|---------------|-----|
| Controller | podman logs automation-controller- | /var/log/tower/ |
| Gateway | podman logs automation-gateway | /var/log/automation-gateway/ |
| Hub | podman logs automation-hub- | /var/log/pulp/ |
| EDA | podman logs automation-eda-* | /var/log/automation-eda/ |
| Receptor | podman logs receptor | /var/log/receptor/ |
Conclusion
Effective troubleshooting in AAP 2.6 requires understanding the architecture — knowing which component owns which functionality and where to look for errors. Start with the job output, work through service health and logs, and use the diagnostic commands in this guide to pinpoint issues quickly.
Related Articles
• AAP 2.6 Architecture and Components: Complete Guide • AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation • AAP 2.6 Automation Mesh: Distributed Execution Across Sites and Networks • AAP 2.6 Backup, Restore, and Disaster Recovery Guide • AAP 2.6 Execution Environments: Build, Manage, and Deploy Custom EEsCategory: installation