Ansible for Site Reliability Engineering: SRE Practices with Automation
By Luca Berton · Published 2024-01-01 · Category: installation
Implement SRE practices with Ansible. Automate incident response, capacity planning, SLO enforcement, chaos engineering, and toil reduction playbooks.
Introduction
Site Reliability Engineering (SRE) is about running production systems reliably at scale. A core SRE principle is eliminating toil through automation — and Ansible is the perfect tool for codifying operational runbooks, automating incident response, and enforcing service level objectives. This guide covers SRE-specific patterns for using Ansible in production operations.
See also: Ansible ServiceNow Integration: Automate ITSM Workflows and Change Management
Toil Reduction
Toil is manual, repetitive, automatable work that scales linearly with service growth. Ansible eliminates it:
| Toil | Manual Effort | Ansible Automation | |------|--------------|-------------------| | Password resets | SSH → passwd | Playbook + Vault | | Certificate renewal | Download, install, verify | Automated rotation | | Log cleanup | SSH → find/rm | Scheduled playbook | | Scaling | Provision → configure → deploy | One playbook | | Incident response | Read runbook → SSH → fix | Auto-remediation |
Automate Common Operations
---
- name: SRE operational tasks
hosts: all
become: true
tasks:
- name: Clean old logs (weekly toil → automated)
ansible.builtin.find:
paths:
- /var/log
- /opt/app/logs
patterns: "*.gz,*.old,*.1,*.2,*.3"
age: 7d
register: old_logs
- name: Remove old logs
ansible.builtin.file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_logs.files }}"
- name: Clean Docker artifacts
ansible.builtin.command: docker system prune -af --volumes
when: "'docker' in ansible_facts.packages"
changed_when: true
- name: Verify disk space after cleanup
ansible.builtin.shell: df -h / | tail -1 | awk '{print $5}' | tr -d '%'
register: disk_usage
changed_when: false
- name: Alert if still critical
ansible.builtin.debug:
msg: "ALERT: {{ inventory_hostname }} disk at {{ disk_usage.stdout }}% after cleanup"
when: disk_usage.stdout | int > 85
Automated Incident Response
Runbook as Code
---
# runbooks/high-cpu.yml
- name: "RUNBOOK: High CPU Usage"
hosts: "{{ target_host }}"
become: true
vars:
incident_id: "{{ incident_id | default('manual') }}"
tasks:
- name: "Step 1: Identify top CPU consumers"
ansible.builtin.shell: ps aux --sort=-%cpu | head -10
register: top_procs
changed_when: false
- name: "Step 2: Check for known problematic processes"
ansible.builtin.shell: |
ps aux | awk '$3 > 80 {print $11}' | head -5
register: high_cpu_procs
changed_when: false
- name: "Step 3: Check if OOM killer was triggered"
ansible.builtin.shell: dmesg | grep -i "oom\|killed process" | tail -5
register: oom_events
changed_when: false
- name: "Step 4: Auto-remediate known issues"
block:
- name: Restart runaway Java process
ansible.builtin.systemd:
name: myapp
state: restarted
when: "'java' in high_cpu_procs.stdout"
- name: Clear application cache
ansible.builtin.file:
path: /opt/app/cache
state: absent
when: "'cache' in high_cpu_procs.stdout"
rescue:
- name: Remediation failed — escalate
ansible.builtin.debug:
msg: "Auto-remediation failed, escalating to on-call"
- name: "Step 5: Verify resolution"
ansible.builtin.shell: |
sleep 30
cat /proc/loadavg | awk '{print $1}'
register: load_after
changed_when: false
- name: "Step 6: Update incident"
ansible.builtin.uri:
url: "{{ incident_api }}/{{ incident_id }}/notes"
method: POST
body_format: json
body:
note: |
Auto-remediation executed:
Top processes: {{ top_procs.stdout_lines[:5] | join(', ') }}
Load after fix: {{ load_after.stdout }}
OOM events: {{ oom_events.stdout_lines | length }}
delegate_to: localhost
when: incident_api is defined
PagerDuty Integration
- name: Trigger PagerDuty incident
ansible.builtin.uri:
url: https://events.pagerduty.com/v2/enqueue
method: POST
body_format: json
body:
routing_key: "{{ pagerduty_integration_key }}"
event_action: trigger
dedup_key: "{{ incident_dedup_key }}"
payload:
summary: "{{ alert_summary }}"
severity: "{{ alert_severity }}"
source: "{{ inventory_hostname }}"
component: "{{ service_name }}"
custom_details:
runbook_url: "https://wiki.example.com/runbooks/{{ runbook_id }}"
auto_remediation: "{{ remediation_attempted }}"
delegate_to: localhost
See also: Ansible Automation Mesh: Scalable Automation Across Hybrid Cloud Environments
SLO Enforcement
SLO Monitoring
- name: Check SLO compliance
hosts: localhost
tasks:
- name: Query error rate from Prometheus
ansible.builtin.uri:
url: "{{ prometheus_url }}/api/v1/query"
body_format: form-urlencoded
body:
query: >
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
register: error_rate
- name: Check against SLO target
ansible.builtin.set_fact:
current_error_rate: "{{ error_rate.json.data.result[0].value[1] | float }}"
slo_target: 0.001 # 99.9% availability = 0.1% error budget
- name: Alert on SLO breach
ansible.builtin.uri:
url: "{{ slack_webhook }}"
method: POST
body_format: json
body:
text: |
🚨 *SLO BREACH* — {{ service_name }}
Current error rate: {{ (current_error_rate | float * 100) | round(3) }}%
SLO target: {{ (slo_target | float * 100) | round(3) }}%
Error budget remaining: {{ ((slo_target | float - current_error_rate | float) / slo_target | float * 100) | round(1) }}%
when: current_error_rate | float > slo_target | float
- name: Freeze deployments on SLO breach
ansible.builtin.uri:
url: "{{ aap_url }}/api/v2/job_templates/{{ deploy_template_id }}/"
method: PATCH
headers:
Authorization: "Bearer {{ aap_token }}"
body_format: json
body:
enabled: false
when: current_error_rate | float > slo_target | float * 2
Chaos Engineering
Controlled Failure Injection
---
- name: "CHAOS: Network latency injection"
hosts: "{{ chaos_target }}"
become: true
vars:
chaos_duration: 300 # 5 minutes
latency_ms: 200
tasks:
- name: Inject network latency
ansible.builtin.command: >
tc qdisc add dev eth0 root netem delay {{ latency_ms }}ms 50ms
register: inject_result
- name: Wait for chaos duration
ansible.builtin.wait_for:
timeout: "{{ chaos_duration }}"
- name: Remove latency injection
ansible.builtin.command: >
tc qdisc del dev eth0 root
when: inject_result.changed
- name: "CHAOS: Service failure"
hosts: "{{ chaos_target }}"
become: true
tasks:
- name: Stop random service
ansible.builtin.systemd:
name: "{{ chaos_service }}"
state: stopped
- name: Verify system recovers
ansible.builtin.pause:
seconds: 120
prompt: "Observing system behavior..."
- name: Check auto-recovery
ansible.builtin.systemd:
name: "{{ chaos_service }}"
register: service_status
- name: Report if service didn't auto-recover
ansible.builtin.debug:
msg: "FINDING: {{ chaos_service }} did not auto-recover on {{ inventory_hostname }}"
when: service_status.status.ActiveState != 'active'
- name: Restore service
ansible.builtin.systemd:
name: "{{ chaos_service }}"
state: started
See also: Ansible vs GitHub Actions: Key Differences & When to Use Each (2026)
Capacity Planning
- name: Capacity planning data collection
hosts: all
tasks:
- name: Collect resource metrics
ansible.builtin.set_fact:
capacity_data:
hostname: "{{ inventory_hostname }}"
cpu_count: "{{ ansible_processor_vcpus }}"
cpu_usage_pct: "{{ ansible_processor_vcpus }}"
memory_total_gb: "{{ (ansible_memtotal_mb / 1024) | round(1) }}"
memory_used_pct: "{{ ((1 - ansible_memfree_mb / ansible_memtotal_mb) * 100) | round(1) }}"
disk_total_gb: "{{ (ansible_mounts[0].size_total / (1024**3)) | round(1) }}"
disk_used_pct: "{{ ((1 - ansible_mounts[0].size_available / ansible_mounts[0].size_total) * 100) | round(1) }}"
- name: Generate capacity report
hosts: localhost
tasks:
- name: Compile capacity data
ansible.builtin.template:
src: capacity-report.j2
dest: "/reports/capacity-{{ ansible_date_time.date }}.html"
vars:
all_capacity: "{{ groups['all'] | map('extract', hostvars, 'capacity_data') | select('defined') | list }}"
hosts_above_80_cpu: "{{ all_capacity | selectattr('memory_used_pct', '>', '80') | list | length }}"
hosts_above_80_disk: "{{ all_capacity | selectattr('disk_used_pct', '>', '80') | list | length }}"
On-Call Handoff
- name: Generate on-call handoff report
hosts: localhost
tasks:
- name: Collect recent incidents
ansible.builtin.uri:
url: "{{ pagerduty_api }}/incidents?since={{ handoff_start }}&until={{ handoff_end }}"
headers:
Authorization: "Token token={{ pagerduty_token }}"
register: incidents
- name: Collect pending changes
ansible.builtin.uri:
url: "{{ aap_url }}/api/v2/jobs/?status=pending"
headers:
Authorization: "Bearer {{ aap_token }}"
register: pending_jobs
- name: Generate handoff document
ansible.builtin.template:
src: oncall-handoff.j2
dest: "/reports/handoff-{{ ansible_date_time.date }}.md"
- name: Post to Slack
ansible.builtin.uri:
url: "{{ slack_webhook }}"
method: POST
body_format: json
body:
text: |
📋 *On-Call Handoff — {{ ansible_date_time.date }}*
Incidents this shift: {{ incidents.json.incidents | length }}
Pending changes: {{ pending_jobs.json.count }}
Full report: {{ report_url }}
Best Practices
Runbooks as playbooks — Every operational procedure should be an executable Ansible playbook Auto-remediate known issues — If the fix is deterministic, automate it Error budgets drive deployments — Freeze deploys when SLO is breached Chaos engineering quarterly — Regular failure injection validates resilience Measure toil — Track manual operational hours; target 50% reduction per quarter Post-incident automation — After every incident, ask "can we automate the fix?" Capacity alerts before crisis — Alert at 80% utilization, not 95% On-call handoff automation — Generate shift reports automaticallyFAQ
Can Ansible replace a full SRE platform?
Ansible handles automation and orchestration. Combine with Prometheus (monitoring), PagerDuty (alerting), and Grafana (dashboards) for a complete SRE stack.
How to handle incidents that need human judgment?
Use AAP workflow approval nodes — automation runs up to the decision point, pauses for human input, then continues.
Chaos engineering in production?
Start in staging. Graduate to production with strict blast radius controls (single host, short duration, automatic rollback). Always have monitoring active during chaos tests.
Conclusion
Ansible is the SRE automation backbone — turning manual runbooks into executable code, automating incident response, enforcing SLOs, and systematically eliminating toil. By codifying operational knowledge as playbooks, you make reliability a repeatable, scalable practice rather than heroic individual effort.
Related Articles
• Ansible Monitoring and Observability • Ansible Disaster Recovery • Ansible Performance OptimizationCategory: installation