AnsiblePilot — Master Ansible Automation

AnsiblePilot is the leading resource for learning Ansible automation, DevOps, and infrastructure as code. Browse over 1,400 tutorials covering Ansible modules, playbooks, roles, collections, and real-world examples. Whether you are a beginner or an experienced engineer, our step-by-step guides help you automate Linux, Windows, cloud, containers, and network infrastructure.

Popular Topics

About Luca Berton

Luca Berton is an Ansible automation expert, author of 8 Ansible books published by Apress and Leanpub including "Ansible for VMware by Examples" and "Ansible for Kubernetes by Example", and creator of the Ansible Pilot YouTube channel. He shares practical automation knowledge through tutorials, books, and video courses to help IT professionals and DevOps engineers master infrastructure automation.

AAP 2.6 Troubleshooting Guide: Common Issues and Solutions

By Luca Berton · Published 2024-01-01 · Category: installation

Troubleshoot common AAP 2.6 issues: job failures, connectivity problems, performance bottlenecks, database issues, mesh errors, EE problems, and upgrade.

Troubleshooting Approach

When something goes wrong in AAP, follow this diagnostic hierarchy: Check the job output — most issues are visible in stdout Check service health — are all components running? Check logs — system logs reveal infrastructure issues Check capacity — is the platform overloaded? Check connectivity — can components reach each other?

See also: Ansible Private Automation Hub: Host & Manage Collections (Guide)

Service Health Checks

Quick Health Check

# API health (no auth required)
curl -s -k "https://gateway.example.org/api/controller/v2/ping/"
# Expected: {"ha": true, "version": "4.6.x", "active_node": "controller1"}

# Check all instances curl -s -k -H "Authorization: Bearer $TOKEN" \ "https://gateway.example.org/api/controller/v2/instances/" | \ jq '.results[] | {hostname: .hostname, node_type: .node_type, capacity: .capacity, errors: .errors, version: .version}'

Containerized Deployment

# Check all containers
podman ps --format "{{.Names}} {{.Status}}" | sort

# Check specific component podman logs automation-controller-web -f --tail 50 podman logs automation-controller-task -f --tail 50 podman logs automation-gateway -f --tail 50 podman logs automation-hub-web -f --tail 50 podman logs automation-eda-api -f --tail 50

# Restart a component podman restart automation-controller-web

RPM Deployment

# Check service status
systemctl status automation-controller-service
systemctl status automation-hub-service
systemctl status automation-eda-service
systemctl status automation-gateway-service

# Check logs journalctl -u automation-controller-service -f --no-pager -n 100 journalctl -u automation-gateway-service -f --no-pager -n 100

Operator Deployment (OpenShift)

# Check pod status
oc get pods -n ansible-automation-platform
oc describe pod <pod-name> -n ansible-automation-platform
oc logs <pod-name> -n ansible-automation-platform -f

# Check operator oc get csv -n ansible-automation-platform oc logs deployment/aap-operator-controller-manager -n ansible-automation-platform

Job Failures

"Error creating pod" / EE Image Pull Failure

ERROR: ImagePullBackOff: Back-off pulling image "hub.example.org/ee-custom:1.0"

Causes and fixes: • EE image doesn't exist in registry → Push image to Hub • Registry authentication failed → Update Container Registry credential • Network issue → Check execution node can reach Hub • Image tag wrong → Verify exact image name and tag

# Test image pull on execution node
podman pull hub.example.org/ee-custom:1.0

# Check credential podman login hub.example.org

"Host key verification failed"

UNREACHABLE! => {"msg": "Failed to connect: Host key verification failed."}

Fix: Add host_key_checking = False in project's ansible.cfg or set the environment variable:

# In job template extra variables
ansible_host_key_checking: false

# Or in inventory group variables ansible_ssh_common_args: '-o StrictHostKeyChecking=no'

"Permission denied (publickey,password)"

Causes: • Wrong credential assigned to job template • SSH key doesn't match authorized_keys on target • User doesn't exist on target host • Password expired

# Test from execution node
ssh -i /path/to/key ansible@target-host

"No hosts matched"

[WARNING]: Could not match supplied host pattern, ignoring: webservers

Causes: • Inventory doesn't contain the host group • Dynamic inventory sync failed • Host limit pattern wrong • Inventory not updated since last host change

Fix: Check inventory sync status, verify host patterns.

Timeout Errors

TASK [Deploy application] *****
fatal: [web01]: FAILED! => {"msg": "Timeout waiting for privilege escalation prompt"}

Causes:become_password not set or wrong • Slow network to target host • Target host under heavy load • Become method incompatible

"Module not found"

ERROR! couldn't resolve module/action 'community.general.ufw'

Fix: Module is not in the Execution Environment. Either: Build a custom EE with the required collection Add the collection to the project's collections/requirements.yml

See also: Ansible Automation Platform 2.6 Architecture and Components: Complete Guide

Database Issues

Connection Errors

OperationalError: could not connect to server: Connection refused

Check:

# Test database connectivity
psql -h db.example.org -U postgres -c "SELECT 1;"

# Check PostgreSQL is running systemctl status postgresql

# Check connection count psql -h db.example.org -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check max connections psql -h db.example.org -U postgres -c "SHOW max_connections;"

Database Full / Out of Space

# Check database sizes
psql -h db.example.org -U postgres -c "
  SELECT datname, pg_size_pretty(pg_database_size(datname))
  FROM pg_database ORDER BY pg_database_size(datname) DESC;"

# Find large tables psql -h db.example.org -U postgres -d controller -c " SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"

Fix: Configure cleanup jobs: • Administration → Management Jobs → Cleanup Job Details — purge old job records • Cleanup Activity Stream — purge old activity log entries • Set retention to 90-180 days for production

Slow Queries

# Find long-running queries
psql -h db.example.org -U postgres -d controller -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND now() - query_start > interval '5 seconds'
  ORDER BY duration DESC;"

Automation Mesh Issues

Execution Node Unreachable

ERROR: receptor connection to exec1.example.org:27199 failed: dial tcp: connection refused

Diagnostic steps:

# 1. Check Receptor is running on execution node
systemctl status receptor

# 2. Check port is open ss -tlnp | grep 27199

# 3. Test connectivity from controller nc -zv exec1.example.org 27199

# 4. Check firewall firewall-cmd --list-all

# 5. Check Receptor logs journalctl -u receptor -f --no-pager -n 50

# 6. Check mesh status receptorctl --socket /var/run/receptor/receptor.sock status receptorctl --socket /var/run/receptor/receptor.sock routes

Node Shows Zero Capacity

# Check node in API
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/instances/?hostname=exec1.example.org" | \
  jq '.results[0] | {hostname: .hostname, capacity: .capacity, errors: .errors, cpu: .cpu, memory: .memory}'

Common causes: • Execution node out of memory • Receptor not connected • Node disabled in settings

Jobs Stuck in "Waiting"

# Check for capacity
curl -s -k -H "Authorization: Bearer $TOKEN" \
  "https://gateway.example.org/api/controller/v2/instances/" | \
  jq '.results[] | {hostname: .hostname, remaining_capacity: .remaining_capacity}'

# Check instance group assignment curl -s -k -H "Authorization: Bearer $TOKEN" \ "https://gateway.example.org/api/controller/v2/instance_groups/" | \ jq '.results[] | {name: .name, capacity: .capacity, consumed_capacity: .consumed_capacity}'

See also: AAP 2.6 Backup, Restore, and Disaster Recovery Guide

Performance Issues

Slow Job Execution

Diagnostic checklist: Check forks setting — increase for more parallelism Check execution node capacity — add more nodes Check network latency to managed hosts Check if gather_facts is needed (disable if not) Check for slow modules (use callback_whitelist = timer for task timing)

# Speed optimizations in ansible.cfg
[defaults]
forks = 50
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/facts_cache
fact_caching_timeout = 3600

[ssh_connection] pipelining = True ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Platform Gateway Slow

# Check Gateway logs
podman logs automation-gateway -f --tail 50

# Check Redis podman exec -it redis redis-cli ping podman exec -it redis redis-cli info memory

# Check database connection pool podman logs automation-gateway 2>&1 | grep -i "connection\|pool\|timeout"

High Memory Usage

# Check per-component memory
podman stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}"

# For RPM installs ps aux --sort=-%mem | head -20 free -h

Upgrade Issues

Pre-Upgrade Checklist

# 1. Backup everything
./setup.sh -b

# 2. Check current version curl -s -k "https://gateway.example.org/api/controller/v2/ping/" | jq '.version'

# 3. Check database migrations awx-manage showmigrations | grep "\[ \]"

# 4. Check disk space df -h

# 5. Check for running jobs (drain first) curl -s -k -H "Authorization: Bearer $TOKEN" \ "https://gateway.example.org/api/controller/v2/jobs/?status=running" | jq '.count'

"Migration failed" During Upgrade

# Check migration status
awx-manage showmigrations --list

# Run migrations manually awx-manage migrate --plan

# If stuck, check for locks psql -h db.example.org -U postgres -d controller -c " SELECT * FROM pg_locks WHERE NOT granted;"

Services Won't Start After Upgrade

# Check for configuration changes
diff /etc/ansible-automation-platform/controller.conf /etc/ansible-automation-platform/controller.conf.bak

# Verify database connectivity awx-manage check_db

# Check for incompatible settings awx-manage validate_settings

Useful CLI Commands

awx-manage (Controller)

# Check database connectivity
awx-manage check_db

# Run pending migrations awx-manage migrate

# Create/change admin password awx-manage changepassword admin

# Export/import assets awx-manage export_assets --all > assets.json awx-manage import_assets < assets.json

# Clear stuck jobs awx-manage cleanup_jobs --dry-run awx-manage cleanup_jobs

# Show settings awx-manage print_settings

receptorctl (Mesh)

# Node status
receptorctl --socket /var/run/receptor/receptor.sock status

# View routing table receptorctl --socket /var/run/receptor/receptor.sock routes

# Ping a mesh node receptorctl --socket /var/run/receptor/receptor.sock ping exec1

# List active work units receptorctl --socket /var/run/receptor/receptor.sock work list

FAQ

How do I enable debug logging?

For Controller: Set AWX_TASK_LOG_LEVEL=DEBUG in settings. For Gateway: Set GATEWAY_LOG_LEVEL=DEBUG. For Receptor: Set --log-level debug in receptor configuration. Remember to revert after debugging — debug logging is verbose and impacts performance.

How do I recover from a corrupted database?

Restore from the most recent backup. If no backup exists, check for PostgreSQL WAL files that might allow point-in-time recovery. As a last resort, the installer can rebuild the database (losing all data).

Why do jobs succeed in CLI but fail in Controller?

Common causes: different user context (Controller runs as awx), different Python environment (jobs run in EEs), different working directory, missing environment variables, or credential injection differences.

How do I reset the admin password?

# Containerized
podman exec -it automation-controller-task awx-manage changepassword admin

# RPM awx-manage changepassword admin

Where are the log files?

| Component | Containerized | RPM | |-----------|---------------|-----| | Controller | podman logs automation-controller- | /var/log/tower/ | | Gateway | podman logs automation-gateway | /var/log/automation-gateway/ | | Hub | podman logs automation-hub- | /var/log/pulp/ | | EDA | podman logs automation-eda-* | /var/log/automation-eda/ | | Receptor | podman logs receptor | /var/log/receptor/ |

Conclusion

Effective troubleshooting in AAP 2.6 requires understanding the architecture — knowing which component owns which functionality and where to look for errors. Start with the job output, work through service health and logs, and use the diagnostic commands in this guide to pinpoint issues quickly.

Related Articles

AAP 2.6 Architecture and Components: Complete GuideAAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log AggregationAAP 2.6 Automation Mesh: Distributed Execution Across Sites and NetworksAAP 2.6 Backup, Restore, and Disaster Recovery GuideAAP 2.6 Execution Environments: Build, Manage, and Deploy Custom EEs

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home