Ansible for Site Reliability Engineering: SRE Practices with Automation

By Luca Berton · Published 2024-01-01 · Category: installation

Implement SRE practices with Ansible. Automate incident response, capacity planning, SLO enforcement, chaos engineering, and toil reduction playbooks.

Introduction

Site Reliability Engineering (SRE) is about running production systems reliably at scale. A core SRE principle is eliminating toil through automation — and Ansible is the perfect tool for codifying operational runbooks, automating incident response, and enforcing service level objectives. This guide covers SRE-specific patterns for using Ansible in production operations.

Toil Reduction

Toil is manual, repetitive, automatable work that scales linearly with service growth. Ansible eliminates it:

Toil	Manual Effort	Ansible Automation
Password resets	SSH → passwd	Playbook + Vault
Certificate renewal	Download, install, verify	Automated rotation
Log cleanup	SSH → find/rm	Scheduled playbook
Scaling	Provision → configure → deploy	One playbook
Incident response	Read runbook → SSH → fix	Auto-remediation

Automate Common Operations

---
- name: SRE operational tasks
  hosts: all
  become: true
  tasks:
    - name: Clean old logs (weekly toil → automated)
      ansible.builtin.find:
        paths:
          - /var/log
          - /opt/app/logs
        patterns: "*.gz,*.old,*.1,*.2,*.3"
        age: 7d
      register: old_logs

    - name: Remove old logs
      ansible.builtin.file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_logs.files }}"

    - name: Clean Docker artifacts
      ansible.builtin.command: docker system prune -af --volumes
      when: "'docker' in ansible_facts.packages"
      changed_when: true

    - name: Verify disk space after cleanup
      ansible.builtin.shell: df -h / | tail -1 | awk '{print $5}' | tr -d '%'
      register: disk_usage
      changed_when: false

    - name: Alert if still critical
      ansible.builtin.debug:
        msg: "ALERT: {{ inventory_hostname }} disk at {{ disk_usage.stdout }}% after cleanup"
      when: disk_usage.stdout | int > 85

Automated Incident Response

Runbook as Code

---
# runbooks/high-cpu.yml
- name: "RUNBOOK: High CPU Usage"
  hosts: "{{ target_host }}"
  become: true
  vars:
    incident_id: "{{ incident_id | default('manual') }}"
  tasks:
    - name: "Step 1: Identify top CPU consumers"
      ansible.builtin.shell: ps aux --sort=-%cpu | head -10
      register: top_procs
      changed_when: false

    - name: "Step 2: Check for known problematic processes"
      ansible.builtin.shell: |
        ps aux | awk '$3 > 80 {print $11}' | head -5
      register: high_cpu_procs
      changed_when: false

    - name: "Step 3: Check if OOM killer was triggered"
      ansible.builtin.shell: dmesg | grep -i "oom\|killed process" | tail -5
      register: oom_events
      changed_when: false

    - name: "Step 4: Auto-remediate known issues"
      block:
        - name: Restart runaway Java process
          ansible.builtin.systemd:
            name: myapp
            state: restarted
          when: "'java' in high_cpu_procs.stdout"

        - name: Clear application cache
          ansible.builtin.file:
            path: /opt/app/cache
            state: absent
          when: "'cache' in high_cpu_procs.stdout"
      rescue:
        - name: Remediation failed — escalate
          ansible.builtin.debug:
            msg: "Auto-remediation failed, escalating to on-call"

    - name: "Step 5: Verify resolution"
      ansible.builtin.shell: |
        sleep 30
        cat /proc/loadavg | awk '{print $1}'
      register: load_after
      changed_when: false

    - name: "Step 6: Update incident"
      ansible.builtin.uri:
        url: "{{ incident_api }}/{{ incident_id }}/notes"
        method: POST
        body_format: json
        body:
          note: |
            Auto-remediation executed:
            Top processes: {{ top_procs.stdout_lines[:5] | join(', ') }}
            Load after fix: {{ load_after.stdout }}
            OOM events: {{ oom_events.stdout_lines | length }}
      delegate_to: localhost
      when: incident_api is defined

PagerDuty Integration

- name: Trigger PagerDuty incident
  ansible.builtin.uri:
    url: https://events.pagerduty.com/v2/enqueue
    method: POST
    body_format: json
    body:
      routing_key: "{{ pagerduty_integration_key }}"
      event_action: trigger
      dedup_key: "{{ incident_dedup_key }}"
      payload:
        summary: "{{ alert_summary }}"
        severity: "{{ alert_severity }}"
        source: "{{ inventory_hostname }}"
        component: "{{ service_name }}"
        custom_details:
          runbook_url: "https://wiki.example.com/runbooks/{{ runbook_id }}"
          auto_remediation: "{{ remediation_attempted }}"
  delegate_to: localhost

SLO Enforcement

SLO Monitoring

- name: Check SLO compliance
  hosts: localhost
  tasks:
    - name: Query error rate from Prometheus
      ansible.builtin.uri:
        url: "{{ prometheus_url }}/api/v1/query"
        body_format: form-urlencoded
        body:
          query: >
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
      register: error_rate

    - name: Check against SLO target
      ansible.builtin.set_fact:
        current_error_rate: "{{ error_rate.json.data.result[0].value[1] | float }}"
        slo_target: 0.001  # 99.9% availability = 0.1% error budget

    - name: Alert on SLO breach
      ansible.builtin.uri:
        url: "{{ slack_webhook }}"
        method: POST
        body_format: json
        body:
          text: |
            🚨 *SLO BREACH* — {{ service_name }}
            Current error rate: {{ (current_error_rate | float * 100) | round(3) }}%
            SLO target: {{ (slo_target | float * 100) | round(3) }}%
            Error budget remaining: {{ ((slo_target | float - current_error_rate | float) / slo_target | float * 100) | round(1) }}%
      when: current_error_rate | float > slo_target | float

    - name: Freeze deployments on SLO breach
      ansible.builtin.uri:
        url: "{{ aap_url }}/api/v2/job_templates/{{ deploy_template_id }}/"
        method: PATCH
        headers:
          Authorization: "Bearer {{ aap_token }}"
        body_format: json
        body:
          enabled: false
      when: current_error_rate | float > slo_target | float * 2

Chaos Engineering

Controlled Failure Injection

---
- name: "CHAOS: Network latency injection"
  hosts: "{{ chaos_target }}"
  become: true
  vars:
    chaos_duration: 300  # 5 minutes
    latency_ms: 200
  tasks:
    - name: Inject network latency
      ansible.builtin.command: >
        tc qdisc add dev eth0 root netem delay {{ latency_ms }}ms 50ms
      register: inject_result

    - name: Wait for chaos duration
      ansible.builtin.wait_for:
        timeout: "{{ chaos_duration }}"

    - name: Remove latency injection
      ansible.builtin.command: >
        tc qdisc del dev eth0 root
      when: inject_result.changed

- name: "CHAOS: Service failure"
  hosts: "{{ chaos_target }}"
  become: true
  tasks:
    - name: Stop random service
      ansible.builtin.systemd:
        name: "{{ chaos_service }}"
        state: stopped

    - name: Verify system recovers
      ansible.builtin.pause:
        seconds: 120
        prompt: "Observing system behavior..."

    - name: Check auto-recovery
      ansible.builtin.systemd:
        name: "{{ chaos_service }}"
      register: service_status

    - name: Report if service didn't auto-recover
      ansible.builtin.debug:
        msg: "FINDING: {{ chaos_service }} did not auto-recover on {{ inventory_hostname }}"
      when: service_status.status.ActiveState != 'active'

    - name: Restore service
      ansible.builtin.systemd:
        name: "{{ chaos_service }}"
        state: started

Capacity Planning

- name: Capacity planning data collection
  hosts: all
  tasks:
    - name: Collect resource metrics
      ansible.builtin.set_fact:
        capacity_data:
          hostname: "{{ inventory_hostname }}"
          cpu_count: "{{ ansible_processor_vcpus }}"
          cpu_usage_pct: "{{ ansible_processor_vcpus }}"
          memory_total_gb: "{{ (ansible_memtotal_mb / 1024) | round(1) }}"
          memory_used_pct: "{{ ((1 - ansible_memfree_mb / ansible_memtotal_mb) * 100) | round(1) }}"
          disk_total_gb: "{{ (ansible_mounts[0].size_total / (1024**3)) | round(1) }}"
          disk_used_pct: "{{ ((1 - ansible_mounts[0].size_available / ansible_mounts[0].size_total) * 100) | round(1) }}"

- name: Generate capacity report
  hosts: localhost
  tasks:
    - name: Compile capacity data
      ansible.builtin.template:
        src: capacity-report.j2
        dest: "/reports/capacity-{{ ansible_date_time.date }}.html"
      vars:
        all_capacity: "{{ groups['all'] | map('extract', hostvars, 'capacity_data') | select('defined') | list }}"
        hosts_above_80_cpu: "{{ all_capacity | selectattr('memory_used_pct', '>', '80') | list | length }}"
        hosts_above_80_disk: "{{ all_capacity | selectattr('disk_used_pct', '>', '80') | list | length }}"

On-Call Handoff

- name: Generate on-call handoff report
  hosts: localhost
  tasks:
    - name: Collect recent incidents
      ansible.builtin.uri:
        url: "{{ pagerduty_api }}/incidents?since={{ handoff_start }}&until={{ handoff_end }}"
        headers:
          Authorization: "Token token={{ pagerduty_token }}"
      register: incidents

    - name: Collect pending changes
      ansible.builtin.uri:
        url: "{{ aap_url }}/api/v2/jobs/?status=pending"
        headers:
          Authorization: "Bearer {{ aap_token }}"
      register: pending_jobs

    - name: Generate handoff document
      ansible.builtin.template:
        src: oncall-handoff.j2
        dest: "/reports/handoff-{{ ansible_date_time.date }}.md"

    - name: Post to Slack
      ansible.builtin.uri:
        url: "{{ slack_webhook }}"
        method: POST
        body_format: json
        body:
          text: |
            📋 *On-Call Handoff — {{ ansible_date_time.date }}*
            Incidents this shift: {{ incidents.json.incidents | length }}
            Pending changes: {{ pending_jobs.json.count }}
            Full report: {{ report_url }}

Best Practices

Runbooks as playbooks — Every operational procedure should be an executable Ansible playbook
Auto-remediate known issues — If the fix is deterministic, automate it
Error budgets drive deployments — Freeze deploys when SLO is breached
Chaos engineering quarterly — Regular failure injection validates resilience
Measure toil — Track manual operational hours; target 50% reduction per quarter
Post-incident automation — After every incident, ask "can we automate the fix?"
Capacity alerts before crisis — Alert at 80% utilization, not 95%
On-call handoff automation — Generate shift reports automatically

FAQ

Can Ansible replace a full SRE platform?

Ansible handles automation and orchestration. Combine with Prometheus (monitoring), PagerDuty (alerting), and Grafana (dashboards) for a complete SRE stack.

How to handle incidents that need human judgment?

Use AAP workflow approval nodes — automation runs up to the decision point, pauses for human input, then continues.

Chaos engineering in production?

Start in staging. Graduate to production with strict blast radius controls (single host, short duration, automatic rollback). Always have monitoring active during chaos tests.

Conclusion

Ansible is the SRE automation backbone — turning manual runbooks into executable code, automating incident response, enforcing SLOs, and systematically eliminating toil. By codifying operational knowledge as playbooks, you make reliability a repeatable, scalable practice rather than heroic individual effort.

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton