AnsiblePilot — Master Ansible Automation

AnsiblePilot is the leading resource for learning Ansible automation, DevOps, and infrastructure as code. Browse over 1,400 tutorials covering Ansible modules, playbooks, roles, collections, and real-world examples. Whether you are a beginner or an experienced engineer, our step-by-step guides help you automate Linux, Windows, cloud, containers, and network infrastructure.

Popular Topics

About Luca Berton

Luca Berton is an Ansible automation expert, author of 8 Ansible books published by Apress and Leanpub including "Ansible for VMware by Examples" and "Ansible for Kubernetes by Example", and creator of the Ansible Pilot YouTube channel. He shares practical automation knowledge through tutorials, books, and video courses to help IT professionals and DevOps engineers master infrastructure automation.

Ansible for Site Reliability Engineering: SRE Practices with Automation

By Luca Berton · Published 2024-01-01 · Category: installation

Implement SRE practices with Ansible. Automate incident response, capacity planning, SLO enforcement, chaos engineering, and toil reduction playbooks.

Introduction

Site Reliability Engineering (SRE) is about running production systems reliably at scale. A core SRE principle is eliminating toil through automation — and Ansible is the perfect tool for codifying operational runbooks, automating incident response, and enforcing service level objectives. This guide covers SRE-specific patterns for using Ansible in production operations.

See also: Ansible ServiceNow Integration: Automate ITSM Workflows and Change Management

Toil Reduction

Toil is manual, repetitive, automatable work that scales linearly with service growth. Ansible eliminates it:

| Toil | Manual Effort | Ansible Automation | |------|--------------|-------------------| | Password resets | SSH → passwd | Playbook + Vault | | Certificate renewal | Download, install, verify | Automated rotation | | Log cleanup | SSH → find/rm | Scheduled playbook | | Scaling | Provision → configure → deploy | One playbook | | Incident response | Read runbook → SSH → fix | Auto-remediation |

Automate Common Operations

---
- name: SRE operational tasks
  hosts: all
  become: true
  tasks:
    - name: Clean old logs (weekly toil → automated)
      ansible.builtin.find:
        paths:
          - /var/log
          - /opt/app/logs
        patterns: "*.gz,*.old,*.1,*.2,*.3"
        age: 7d
      register: old_logs

- name: Remove old logs ansible.builtin.file: path: "{{ item.path }}" state: absent loop: "{{ old_logs.files }}"

- name: Clean Docker artifacts ansible.builtin.command: docker system prune -af --volumes when: "'docker' in ansible_facts.packages" changed_when: true

- name: Verify disk space after cleanup ansible.builtin.shell: df -h / | tail -1 | awk '{print $5}' | tr -d '%' register: disk_usage changed_when: false

- name: Alert if still critical ansible.builtin.debug: msg: "ALERT: {{ inventory_hostname }} disk at {{ disk_usage.stdout }}% after cleanup" when: disk_usage.stdout | int > 85

Automated Incident Response

Runbook as Code

---
# runbooks/high-cpu.yml
- name: "RUNBOOK: High CPU Usage"
  hosts: "{{ target_host }}"
  become: true
  vars:
    incident_id: "{{ incident_id | default('manual') }}"
  tasks:
    - name: "Step 1: Identify top CPU consumers"
      ansible.builtin.shell: ps aux --sort=-%cpu | head -10
      register: top_procs
      changed_when: false

- name: "Step 2: Check for known problematic processes" ansible.builtin.shell: | ps aux | awk '$3 > 80 {print $11}' | head -5 register: high_cpu_procs changed_when: false

- name: "Step 3: Check if OOM killer was triggered" ansible.builtin.shell: dmesg | grep -i "oom\|killed process" | tail -5 register: oom_events changed_when: false

- name: "Step 4: Auto-remediate known issues" block: - name: Restart runaway Java process ansible.builtin.systemd: name: myapp state: restarted when: "'java' in high_cpu_procs.stdout"

- name: Clear application cache ansible.builtin.file: path: /opt/app/cache state: absent when: "'cache' in high_cpu_procs.stdout" rescue: - name: Remediation failed — escalate ansible.builtin.debug: msg: "Auto-remediation failed, escalating to on-call"

- name: "Step 5: Verify resolution" ansible.builtin.shell: | sleep 30 cat /proc/loadavg | awk '{print $1}' register: load_after changed_when: false

- name: "Step 6: Update incident" ansible.builtin.uri: url: "{{ incident_api }}/{{ incident_id }}/notes" method: POST body_format: json body: note: | Auto-remediation executed: Top processes: {{ top_procs.stdout_lines[:5] | join(', ') }} Load after fix: {{ load_after.stdout }} OOM events: {{ oom_events.stdout_lines | length }} delegate_to: localhost when: incident_api is defined

PagerDuty Integration

- name: Trigger PagerDuty incident
  ansible.builtin.uri:
    url: https://events.pagerduty.com/v2/enqueue
    method: POST
    body_format: json
    body:
      routing_key: "{{ pagerduty_integration_key }}"
      event_action: trigger
      dedup_key: "{{ incident_dedup_key }}"
      payload:
        summary: "{{ alert_summary }}"
        severity: "{{ alert_severity }}"
        source: "{{ inventory_hostname }}"
        component: "{{ service_name }}"
        custom_details:
          runbook_url: "https://wiki.example.com/runbooks/{{ runbook_id }}"
          auto_remediation: "{{ remediation_attempted }}"
  delegate_to: localhost

See also: Ansible Automation Mesh: Scalable Automation Across Hybrid Cloud Environments

SLO Enforcement

SLO Monitoring

- name: Check SLO compliance
  hosts: localhost
  tasks:
    - name: Query error rate from Prometheus
      ansible.builtin.uri:
        url: "{{ prometheus_url }}/api/v1/query"
        body_format: form-urlencoded
        body:
          query: >
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
      register: error_rate

- name: Check against SLO target ansible.builtin.set_fact: current_error_rate: "{{ error_rate.json.data.result[0].value[1] | float }}" slo_target: 0.001 # 99.9% availability = 0.1% error budget

- name: Alert on SLO breach ansible.builtin.uri: url: "{{ slack_webhook }}" method: POST body_format: json body: text: | 🚨 *SLO BREACH* — {{ service_name }} Current error rate: {{ (current_error_rate | float * 100) | round(3) }}% SLO target: {{ (slo_target | float * 100) | round(3) }}% Error budget remaining: {{ ((slo_target | float - current_error_rate | float) / slo_target | float * 100) | round(1) }}% when: current_error_rate | float > slo_target | float

- name: Freeze deployments on SLO breach ansible.builtin.uri: url: "{{ aap_url }}/api/v2/job_templates/{{ deploy_template_id }}/" method: PATCH headers: Authorization: "Bearer {{ aap_token }}" body_format: json body: enabled: false when: current_error_rate | float > slo_target | float * 2

Chaos Engineering

Controlled Failure Injection

---
- name: "CHAOS: Network latency injection"
  hosts: "{{ chaos_target }}"
  become: true
  vars:
    chaos_duration: 300  # 5 minutes
    latency_ms: 200
  tasks:
    - name: Inject network latency
      ansible.builtin.command: >
        tc qdisc add dev eth0 root netem delay {{ latency_ms }}ms 50ms
      register: inject_result

- name: Wait for chaos duration ansible.builtin.wait_for: timeout: "{{ chaos_duration }}"

- name: Remove latency injection ansible.builtin.command: > tc qdisc del dev eth0 root when: inject_result.changed

- name: "CHAOS: Service failure" hosts: "{{ chaos_target }}" become: true tasks: - name: Stop random service ansible.builtin.systemd: name: "{{ chaos_service }}" state: stopped

- name: Verify system recovers ansible.builtin.pause: seconds: 120 prompt: "Observing system behavior..."

- name: Check auto-recovery ansible.builtin.systemd: name: "{{ chaos_service }}" register: service_status

- name: Report if service didn't auto-recover ansible.builtin.debug: msg: "FINDING: {{ chaos_service }} did not auto-recover on {{ inventory_hostname }}" when: service_status.status.ActiveState != 'active'

- name: Restore service ansible.builtin.systemd: name: "{{ chaos_service }}" state: started

See also: Ansible vs GitHub Actions: Key Differences & When to Use Each (2026)

Capacity Planning

- name: Capacity planning data collection
  hosts: all
  tasks:
    - name: Collect resource metrics
      ansible.builtin.set_fact:
        capacity_data:
          hostname: "{{ inventory_hostname }}"
          cpu_count: "{{ ansible_processor_vcpus }}"
          cpu_usage_pct: "{{ ansible_processor_vcpus }}"
          memory_total_gb: "{{ (ansible_memtotal_mb / 1024) | round(1) }}"
          memory_used_pct: "{{ ((1 - ansible_memfree_mb / ansible_memtotal_mb) * 100) | round(1) }}"
          disk_total_gb: "{{ (ansible_mounts[0].size_total / (1024**3)) | round(1) }}"
          disk_used_pct: "{{ ((1 - ansible_mounts[0].size_available / ansible_mounts[0].size_total) * 100) | round(1) }}"

- name: Generate capacity report hosts: localhost tasks: - name: Compile capacity data ansible.builtin.template: src: capacity-report.j2 dest: "/reports/capacity-{{ ansible_date_time.date }}.html" vars: all_capacity: "{{ groups['all'] | map('extract', hostvars, 'capacity_data') | select('defined') | list }}" hosts_above_80_cpu: "{{ all_capacity | selectattr('memory_used_pct', '>', '80') | list | length }}" hosts_above_80_disk: "{{ all_capacity | selectattr('disk_used_pct', '>', '80') | list | length }}"

On-Call Handoff

- name: Generate on-call handoff report
  hosts: localhost
  tasks:
    - name: Collect recent incidents
      ansible.builtin.uri:
        url: "{{ pagerduty_api }}/incidents?since={{ handoff_start }}&until={{ handoff_end }}"
        headers:
          Authorization: "Token token={{ pagerduty_token }}"
      register: incidents

- name: Collect pending changes ansible.builtin.uri: url: "{{ aap_url }}/api/v2/jobs/?status=pending" headers: Authorization: "Bearer {{ aap_token }}" register: pending_jobs

- name: Generate handoff document ansible.builtin.template: src: oncall-handoff.j2 dest: "/reports/handoff-{{ ansible_date_time.date }}.md"

- name: Post to Slack ansible.builtin.uri: url: "{{ slack_webhook }}" method: POST body_format: json body: text: | 📋 *On-Call Handoff — {{ ansible_date_time.date }}* Incidents this shift: {{ incidents.json.incidents | length }} Pending changes: {{ pending_jobs.json.count }} Full report: {{ report_url }}

Best Practices

Runbooks as playbooks — Every operational procedure should be an executable Ansible playbook Auto-remediate known issues — If the fix is deterministic, automate it Error budgets drive deployments — Freeze deploys when SLO is breached Chaos engineering quarterly — Regular failure injection validates resilience Measure toil — Track manual operational hours; target 50% reduction per quarter Post-incident automation — After every incident, ask "can we automate the fix?" Capacity alerts before crisis — Alert at 80% utilization, not 95% On-call handoff automation — Generate shift reports automatically

FAQ

Can Ansible replace a full SRE platform?

Ansible handles automation and orchestration. Combine with Prometheus (monitoring), PagerDuty (alerting), and Grafana (dashboards) for a complete SRE stack.

How to handle incidents that need human judgment?

Use AAP workflow approval nodes — automation runs up to the decision point, pauses for human input, then continues.

Chaos engineering in production?

Start in staging. Graduate to production with strict blast radius controls (single host, short duration, automatic rollback). Always have monitoring active during chaos tests.

Conclusion

Ansible is the SRE automation backbone — turning manual runbooks into executable code, automating incident response, enforcing SLOs, and systematically eliminating toil. By codifying operational knowledge as playbooks, you make reliability a repeatable, scalable practice rather than heroic individual effort.

Related Articles

Ansible Monitoring and ObservabilityAnsible Disaster RecoveryAnsible Performance Optimization

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home