AAP 2.6 Job Scheduling and Capacity Planning Guide

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

Plan and optimize AAP 2.6 capacity for enterprise workloads. Job scheduling strategies, instance group sizing, fork tuning, concurrent job limits, database.

Capacity Planning for AAP 2.6

Proper capacity planning ensures jobs run on time without queue delays or resource exhaustion. AAP capacity depends on four factors: execution node count, memory per node, fork count per job, and concurrent job slots.

Understanding AAP Capacity

Capacity Formula

Each execution node's capacity is calculated as:

capacity = min(memory_capacity, cpu_capacity)

memory_capacity = (total_memory_MB - 2048) / fork_memory_MB
cpu_capacity = cpu_count * forks_per_cpu

Default values:

fork_memory_MB = 100 MB per fork
forks_per_cpu = 4

Example: A node with 16 GB RAM and 4 CPUs:

memory_capacity = (16384 - 2048) / 100 = 143
cpu_capacity = 4 × 4 = 16
capacity = min(143, 16) = 16

This node supports 16 concurrent forks across all jobs.

Sizing Reference

Environment	Managed Hosts	Execution Nodes	Node Spec	Controller	Database
Small	50-500	1-2	4 CPU, 16 GB	4 CPU, 16 GB	4 CPU, 16 GB
Medium	500-2,000	3-5	8 CPU, 32 GB	8 CPU, 32 GB	8 CPU, 32 GB
Large	2,000-10,000	5-15	16 CPU, 64 GB	16 CPU, 64 GB	16 CPU, 64 GB, SSD
Enterprise	10,000+	15-50+	16 CPU, 64 GB	16 CPU, 64 GB (HA)	32 CPU, 128 GB, NVMe

Job Scheduling

Schedule Types

# Daily at 2 AM UTC
- name: Daily maintenance
  ansible.platform.schedule:
    controller_host: "{{ gateway_url }}"
    controller_username: "{{ controller_user }}"
    controller_password: "{{ controller_pass }}"
    name: "Daily Patching Window"
    unified_job_template: "OS Patching"
    rrule: "DTSTART:20260101T020000Z RRULE:FREQ=DAILY"
    state: present

# Weekly on Mondays at 6 AM
- name: Weekly compliance scan
  ansible.platform.schedule:
    name: "Weekly Compliance"
    unified_job_template: "Full Compliance Scan"
    rrule: "DTSTART:20260105T060000Z RRULE:FREQ=WEEKLY;BYDAY=MO"
    state: present

# Monthly on 1st at midnight
- name: Monthly certificate rotation
  ansible.platform.schedule:
    name: "Certificate Rotation"
    unified_job_template: "Rotate Certificates"
    rrule: "DTSTART:20260101T000000Z RRULE:FREQ=MONTHLY;BYMONTHDAY=1"
    state: present

# Every 4 hours
- name: Frequent health check
  ansible.platform.schedule:
    name: "Health Check"
    unified_job_template: "Infrastructure Health"
    rrule: "DTSTART:20260101T000000Z RRULE:FREQ=HOURLY;INTERVAL=4"
    state: present

Maintenance Windows with Blackout Periods

# Schedule with exception dates
- name: Daily deploy (skip holidays)
  ansible.platform.schedule:
    name: "Daily Deploy"
    unified_job_template: "Application Deploy"
    rrule: >
      DTSTART:20260101T020000Z
      RRULE:FREQ=DAILY
      EXDATE:20261225T020000Z,20261226T020000Z,20270101T020000Z
    state: present

Instance Groups and Job Routing

Define Instance Groups

- name: Create production instance group
  ansible.platform.instance_group:
    controller_host: "{{ gateway_url }}"
    controller_username: "{{ controller_user }}"
    controller_password: "{{ controller_pass }}"
    name: "production"
    policy_instance_minimum: 2
    policy_instance_percentage: 50
    max_concurrent_jobs: 20
    max_forks: 200
    state: present

- name: Create network instance group
  ansible.platform.instance_group:
    name: "network"
    policy_instance_minimum: 1
    policy_instance_percentage: 25
    max_concurrent_jobs: 10
    max_forks: 50
    state: present

- name: Create DMZ instance group
  ansible.platform.instance_group:
    name: "dmz"
    policy_instance_minimum: 1
    max_concurrent_jobs: 5
    state: present

Route Jobs to Instance Groups

- name: Route production jobs
  ansible.platform.job_template:
    controller_host: "{{ gateway_url }}"
    controller_username: "{{ controller_user }}"
    controller_password: "{{ controller_pass }}"
    name: "Production Deployment"
    instance_groups:
      - "production"
    forks: 50
    state: present

- name: Route network jobs to dedicated group
  ansible.platform.job_template:
    name: "Network Backup"
    instance_groups:
      - "network"
    forks: 10  # Lower forks for network devices
    state: present

Performance Tuning

Fork Optimization

Workload	Recommended Forks	Why
Configuration management	50-100	Standard SSH tasks
Network devices	5-20	Devices have limited concurrent sessions
Cloud provisioning	10-30	API rate limits
Windows (WinRM)	20-50	WinRM is heavier than SSH
Patching/updates	10-25	Downloads and reboots are sequential anyway

ansible.cfg Performance Settings

[defaults]
forks = 50
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/facts
fact_caching_timeout = 7200
stdout_callback = default
internal_poll_interval = 0.001
host_key_checking = false

[ssh_connection]
pipelining = true
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey

Job Slicing

Split large inventories across multiple execution nodes:

- name: Configure job slicing
  ansible.platform.job_template:
    controller_host: "{{ gateway_url }}"
    controller_username: "{{ controller_user }}"
    controller_password: "{{ controller_pass }}"
    name: "Patch All Servers"
    job_slice_count: 5  # Split into 5 parallel slices
    forks: 50
    state: present

With 1,000 hosts and job_slice_count: 5, each slice manages ~200 hosts on different execution nodes simultaneously.

Concurrent Job Limits

# Controller-wide setting
- name: Set concurrent job limit
  ansible.builtin.uri:
    url: "https://gateway.example.org/api/controller/v2/settings/jobs/"
    method: PATCH
    headers:
      Authorization: "Bearer {{ token }}"
    body_format: json
    body:
      SCHEDULE_MAX_JOBS: 50
      DEFAULT_JOB_TIMEOUT: 3600  # 1 hour max
      DEFAULT_INVENTORY_UPDATE_TIMEOUT: 300
      DEFAULT_PROJECT_UPDATE_TIMEOUT: 300

Database Sizing

PostgreSQL Requirements

Metric	Small	Medium	Large	Enterprise
Job history (90 days)	5 GB	20 GB	100 GB	500 GB+
IOPS	1,000	3,000	5,000	10,000+
Connections	200	400	800	1,500+
Backup frequency	Daily	Daily	6-hourly	Continuous (WAL)

Database Maintenance

- name: Schedule database cleanup
  ansible.platform.schedule:
    controller_host: "{{ gateway_url }}"
    controller_username: "{{ controller_user }}"
    controller_password: "{{ controller_pass }}"
    name: "Database Cleanup"
    unified_job_template: "Cleanup Job Details"
    rrule: "DTSTART:20260101T030000Z RRULE:FREQ=WEEKLY;BYDAY=SU"
    extra_data:
      days: 90  # Keep 90 days of history
    state: present

- name: Schedule activity cleanup
  ansible.platform.schedule:
    name: "Activity Stream Cleanup"
    unified_job_template: "Cleanup Activity Stream"
    rrule: "DTSTART:20260101T040000Z RRULE:FREQ=WEEKLY;BYDAY=SU"
    extra_data:
      days: 180
    state: present

Monitoring Capacity

Key Metrics to Track

# Prometheus queries for capacity monitoring
groups:
  - name: aap_capacity
    rules:
      - alert: HighJobQueueTime
        expr: awx_pending_jobs_total > 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Job queue building up — {{ $value }} pending jobs"

      - alert: ExecutionNodeOverloaded
        expr: awx_instance_remaining_capacity < 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Execution node {{ $labels.hostname }} nearly out of capacity"

      - alert: DatabaseConnectionsHigh
        expr: pg_stat_activity_count > (pg_settings_max_connections * 0.8)
        for: 5m
        labels:
          severity: warning

FAQ

How many execution nodes do I need?

Divide your total peak concurrent forks by per-node capacity. Example: 500 hosts patched with forks=50 needs 10 fork slots. Add overhead for concurrent jobs. Start with peak_concurrent_forks / per_node_capacity × 1.3 as a baseline.

Should I use job slicing or serial execution?

Job slicing for independent hosts (patching, config management). Serial execution (rolling update patterns with serial:) when hosts depend on each other (load balancer drain → update → restore).

What's the maximum number of managed hosts?

No hard limit. Largest known deployments manage 50,000+ nodes. The constraint is execution capacity and database I/O, not AAP software limits.

How do I prevent schedule collisions?

Stagger schedules by 15-30 minutes. Use instance groups to isolate workloads. Set max_concurrent_jobs on instance groups to prevent resource exhaustion. Use workflow templates to chain dependent jobs.

When should I scale horizontally vs vertically?

Scale execution nodes horizontally (add more nodes). Scale the database vertically (bigger instance). Controller nodes can be added for API HA but one handles most workloads.

Conclusion

Capacity planning for AAP 2.6 requires understanding the relationship between forks, memory, concurrent jobs, and execution nodes. Start with the sizing reference for your host count, tune fork settings per workload type, and use instance groups to isolate different automation domains. Monitor queue times and capacity metrics to scale proactively.

Category: troubleshooting

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton