AAP 2.6 Troubleshooting Guide: Common Issues and Solutions

By Luca Berton · Published 2024-01-01 · Category: installation

Troubleshoot common AAP 2.6 issues: job failures, connectivity problems, performance bottlenecks, database issues, mesh errors, EE problems, and upgrade.

Troubleshooting Approach

When something goes wrong in AAP, follow this diagnostic hierarchy:

Check the job output — most issues are visible in stdout
Check service health — are all components running?
Check logs — system logs reveal infrastructure issues
Check capacity — is the platform overloaded?
Check connectivity — can components reach each other?

Service Health Checks

Quick Health Check

# API health (no auth required)
curl -s -k "https://gateway.example.org/api/controller/v2/ping/"
# Expected: {"ha": true, "version": "4.6.x", "active_node": "controller1"}

# Check all instances
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/instances/" | \
  jq '.results[] | {hostname: .hostname, node_type: .node_type, capacity: .capacity, errors: .errors, version: .version}'

Containerized Deployment

# Check all containers
podman ps --format "{{.Names}} {{.Status}}" | sort

# Check specific component
podman logs automation-controller-web -f --tail 50
podman logs automation-controller-task -f --tail 50
podman logs automation-gateway -f --tail 50
podman logs automation-hub-web -f --tail 50
podman logs automation-eda-api -f --tail 50

# Restart a component
podman restart automation-controller-web

RPM Deployment

# Check service status
systemctl status automation-controller-service
systemctl status automation-hub-service
systemctl status automation-eda-service
systemctl status automation-gateway-service

# Check logs
journalctl -u automation-controller-service -f --no-pager -n 100
journalctl -u automation-gateway-service -f --no-pager -n 100

Operator Deployment (OpenShift)

# Check pod status
oc get pods -n ansible-automation-platform
oc describe pod <pod-name> -n ansible-automation-platform
oc logs <pod-name> -n ansible-automation-platform -f

# Check operator
oc get csv -n ansible-automation-platform
oc logs deployment/aap-operator-controller-manager -n ansible-automation-platform

Job Failures

"Error creating pod" / EE Image Pull Failure

ERROR: ImagePullBackOff: Back-off pulling image "hub.example.org/ee-custom:1.0"

Causes and fixes:

EE image doesn't exist in registry → Push image to Hub
Registry authentication failed → Update Container Registry credential
Network issue → Check execution node can reach Hub
Image tag wrong → Verify exact image name and tag

# Test image pull on execution node
podman pull hub.example.org/ee-custom:1.0

# Check credential
podman login hub.example.org

"Host key verification failed"

UNREACHABLE! => {"msg": "Failed to connect: Host key verification failed."}

Fix: Add host_key_checking = False in project's ansible.cfg or set the environment variable:

# In job template extra variables
ansible_host_key_checking: false

# Or in inventory group variables
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'

"Permission denied (publickey,password)"

Causes:

Wrong credential assigned to job template
SSH key doesn't match authorized_keys on target
User doesn't exist on target host
Password expired

# Test from execution node
ssh -i /path/to/key ansible@target-host

"No hosts matched"

[WARNING]: Could not match supplied host pattern, ignoring: webservers

Causes:

Inventory doesn't contain the host group
Dynamic inventory sync failed
Host limit pattern wrong
Inventory not updated since last host change

Fix: Check inventory sync status, verify host patterns.

Timeout Errors

TASK [Deploy application] *****
fatal: [web01]: FAILED! => {"msg": "Timeout waiting for privilege escalation prompt"}

Causes:

become_password not set or wrong
Slow network to target host
Target host under heavy load
Become method incompatible

"Module not found"

ERROR! couldn't resolve module/action 'community.general.ufw'

Fix: Module is not in the Execution Environment. Either:

Build a custom EE with the required collection
Add the collection to the project's collections/requirements.yml

Database Issues

Connection Errors

OperationalError: could not connect to server: Connection refused

Check:

# Test database connectivity
psql -h db.example.org -U postgres -c "SELECT 1;"

# Check PostgreSQL is running
systemctl status postgresql

# Check connection count
psql -h db.example.org -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check max connections
psql -h db.example.org -U postgres -c "SHOW max_connections;"

Database Full / Out of Space

# Check database sizes
psql -h db.example.org -U postgres -c "
  SELECT datname, pg_size_pretty(pg_database_size(datname))
  FROM pg_database ORDER BY pg_database_size(datname) DESC;"

# Find large tables
psql -h db.example.org -U postgres -d controller -c "
  SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
  FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"

Fix: Configure cleanup jobs:

Administration → Management Jobs → Cleanup Job Details — purge old job records
Cleanup Activity Stream — purge old activity log entries
Set retention to 90-180 days for production

Slow Queries

# Find long-running queries
psql -h db.example.org -U postgres -d controller -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND now() - query_start > interval '5 seconds'
  ORDER BY duration DESC;"

Automation Mesh Issues

Execution Node Unreachable

ERROR: receptor connection to exec1.example.org:27199 failed: dial tcp: connection refused

Diagnostic steps:

# 1. Check Receptor is running on execution node
systemctl status receptor

# 2. Check port is open
ss -tlnp | grep 27199

# 3. Test connectivity from controller
nc -zv exec1.example.org 27199

# 4. Check firewall
firewall-cmd --list-all

# 5. Check Receptor logs
journalctl -u receptor -f --no-pager -n 50

# 6. Check mesh status
receptorctl --socket /var/run/receptor/receptor.sock status
receptorctl --socket /var/run/receptor/receptor.sock routes

Node Shows Zero Capacity

# Check node in API
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/instances/?hostname=exec1.example.org" | \
  jq '.results[0] | {hostname: .hostname, capacity: .capacity, errors: .errors, cpu: .cpu, memory: .memory}'

Common causes:

Execution node out of memory
Receptor not connected
Node disabled in settings

Jobs Stuck in "Waiting"

# Check for capacity
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/instances/" | \
  jq '.results[] | {hostname: .hostname, remaining_capacity: .remaining_capacity}'

# Check instance group assignment
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/instance_groups/" | \
  jq '.results[] | {name: .name, capacity: .capacity, consumed_capacity: .consumed_capacity}'

Performance Issues

Slow Job Execution

Diagnostic checklist:

Check forks setting — increase for more parallelism
Check execution node capacity — add more nodes
Check network latency to managed hosts
Check if gather_facts is needed (disable if not)
Check for slow modules (use callback_whitelist = timer for task timing)

# Speed optimizations in ansible.cfg
[defaults]
forks = 50
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/facts_cache
fact_caching_timeout = 3600

[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Platform Gateway Slow

# Check Gateway logs
podman logs automation-gateway -f --tail 50

# Check Redis
podman exec -it redis redis-cli ping
podman exec -it redis redis-cli info memory

# Check database connection pool
podman logs automation-gateway 2>&1 | grep -i "connection\|pool\|timeout"

High Memory Usage

# Check per-component memory
podman stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}"

# For RPM installs
ps aux --sort=-%mem | head -20
free -h

Upgrade Issues

Pre-Upgrade Checklist

# 1. Backup everything
./setup.sh -b

# 2. Check current version
curl -s -k "https://gateway.example.org/api/controller/v2/ping/" | jq '.version'

# 3. Check database migrations
awx-manage showmigrations | grep "\[ \]"

# 4. Check disk space
df -h

# 5. Check for running jobs (drain first)
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/jobs/?status=running" | jq '.count'

"Migration failed" During Upgrade

# Check migration status
awx-manage showmigrations --list

# Run migrations manually
awx-manage migrate --plan

# If stuck, check for locks
psql -h db.example.org -U postgres -d controller -c "
  SELECT * FROM pg_locks WHERE NOT granted;"

Services Won't Start After Upgrade

# Check for configuration changes
diff /etc/ansible-automation-platform/controller.conf /etc/ansible-automation-platform/controller.conf.bak

# Verify database connectivity
awx-manage check_db

# Check for incompatible settings
awx-manage validate_settings

Useful CLI Commands

awx-manage (Controller)

# Check database connectivity
awx-manage check_db

# Run pending migrations
awx-manage migrate

# Create/change admin password
awx-manage changepassword admin

# Export/import assets
awx-manage export_assets --all > assets.json
awx-manage import_assets < assets.json

# Clear stuck jobs
awx-manage cleanup_jobs --dry-run
awx-manage cleanup_jobs

# Show settings
awx-manage print_settings

receptorctl (Mesh)

# Node status
receptorctl --socket /var/run/receptor/receptor.sock status

# View routing table
receptorctl --socket /var/run/receptor/receptor.sock routes

# Ping a mesh node
receptorctl --socket /var/run/receptor/receptor.sock ping exec1

# List active work units
receptorctl --socket /var/run/receptor/receptor.sock work list

FAQ

How do I enable debug logging?

For Controller: Set AWX_TASK_LOG_LEVEL=DEBUG in settings. For Gateway: Set GATEWAY_LOG_LEVEL=DEBUG. For Receptor: Set --log-level debug in receptor configuration. Remember to revert after debugging — debug logging is verbose and impacts performance.

How do I recover from a corrupted database?

Restore from the most recent backup. If no backup exists, check for PostgreSQL WAL files that might allow point-in-time recovery. As a last resort, the installer can rebuild the database (losing all data).

Why do jobs succeed in CLI but fail in Controller?

Common causes: different user context (Controller runs as awx), different Python environment (jobs run in EEs), different working directory, missing environment variables, or credential injection differences.

How do I reset the admin password?

# Containerized
podman exec -it automation-controller-task awx-manage changepassword admin

# RPM
awx-manage changepassword admin

Where are the log files?

Component	Containerized	RPM
Controller	`podman logs automation-controller-`	`/var/log/tower/`
Gateway	`podman logs automation-gateway`	`/var/log/automation-gateway/`
Hub	`podman logs automation-hub-`	`/var/log/pulp/`
EDA	`podman logs automation-eda-*`	`/var/log/automation-eda/`
Receptor	`podman logs receptor`	`/var/log/receptor/`

Conclusion

Effective troubleshooting in AAP 2.6 requires understanding the architecture — knowing which component owns which functionality and where to look for errors. Start with the job output, work through service health and logs, and use the diagnostic commands in this guide to pinpoint issues quickly.

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton

AAP 2.6 Troubleshooting Guide: Common Issues and Solutions

Troubleshooting Approach

Service Health Checks

Quick Health Check

Containerized Deployment

RPM Deployment

Operator Deployment (OpenShift)

Job Failures

"Error creating pod" / EE Image Pull Failure

"Host key verification failed"

"Permission denied (publickey,password)"

"No hosts matched"

Timeout Errors

"Module not found"

Database Issues

Connection Errors

Database Full / Out of Space

Slow Queries

Automation Mesh Issues

Execution Node Unreachable

Node Shows Zero Capacity

Jobs Stuck in "Waiting"

Performance Issues

Slow Job Execution

Platform Gateway Slow

High Memory Usage

Upgrade Issues

Pre-Upgrade Checklist

"Migration failed" During Upgrade

Services Won't Start After Upgrade

Useful CLI Commands

awx-manage (Controller)

receptorctl (Mesh)

FAQ

How do I enable debug logging?

How do I recover from a corrupted database?

Why do jobs succeed in CLI but fail in Controller?

How do I reset the admin password?

Where are the log files?

Conclusion

Related Articles