Ansible Disaster Recovery Automation: Backup, Failover, and Recovery Playbooks
By Luca Berton · Published 2024-01-01 · Category: troubleshooting
Automate disaster recovery with Ansible. Build backup workflows, automated failover, infrastructure rebuild, and recovery testing playbooks for enterprise.
Introduction
When disaster strikes — datacenter outage, ransomware attack, hardware failure — recovery speed determines business impact. Manual runbooks fail under pressure. Ansible automates disaster recovery end-to-end: scheduled backups, automated failover, infrastructure rebuild from code, and regular DR testing to ensure your recovery actually works.
See also: Ansible Automation Platform High Availability and Disaster Recovery: Single Topology Architecture
DR Automation Strategy
Prevention Detection Response Recovery
┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Scheduled │ │ Health │ │ Automated │ │ Rebuild │
│ Backups │ │ Checks │ │ Failover │ │ from │
│ Config │ │ Alerts │ │ DNS Switch │ │ Code │
│ Snapshots │ │ Monitors │ │ Traffic Shift│ │ Restore │
└──────────┘ └──────────┘ └──────────────┘ └──────────┘
Automated Backups
Database Backups
---
- name: Automated database backup
hosts: database_servers
become: true
vars:
backup_dir: /backup/databases
retention_days: 30
s3_bucket: myorg-db-backups
tasks:
- name: Create backup directory
ansible.builtin.file:
path: "{{ backup_dir }}"
state: directory
mode: '0750'
- name: Backup PostgreSQL
ansible.builtin.shell: |
pg_dump -Fc {{ db_name }} > {{ backup_dir }}/{{ db_name }}-{{ ansible_date_time.iso8601_basic_short }}.dump
become_user: postgres
when: db_type == 'postgresql'
- name: Backup MySQL
community.mysql.mysql_db:
state: dump
name: "{{ db_name }}"
target: "{{ backup_dir }}/{{ db_name }}-{{ ansible_date_time.iso8601_basic_short }}.sql.gz"
login_user: backup_user
login_password: "{{ vault_mysql_backup_pass }}"
when: db_type == 'mysql'
- name: Upload to S3
amazon.aws.s3_object:
bucket: "{{ s3_bucket }}"
object: "{{ inventory_hostname }}/{{ item | basename }}"
src: "{{ item }}"
mode: put
encryption: aws:kms
loop: "{{ lookup('fileglob', backup_dir + '/*-' + ansible_date_time.date + '*', wantlist=True) }}"
delegate_to: localhost
- name: Clean old local backups
ansible.builtin.find:
paths: "{{ backup_dir }}"
age: "{{ retention_days }}d"
register: old_backups
- name: Remove old backups
ansible.builtin.file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_backups.files }}"
Configuration Backup
- name: Backup all server configurations
hosts: all
become: true
tasks:
- name: Archive critical configs
community.general.archive:
path:
- /etc/nginx/
- /etc/ssh/sshd_config
- /etc/systemd/system/
- /etc/crontab
- /etc/fstab
dest: "/tmp/{{ inventory_hostname }}-config-{{ ansible_date_time.date }}.tar.gz"
format: gz
- name: Fetch to backup server
ansible.builtin.fetch:
src: "/tmp/{{ inventory_hostname }}-config-{{ ansible_date_time.date }}.tar.gz"
dest: "backups/configs/"
flat: false
See also: Ansible vs GitHub Actions: Key Differences & When to Use Each (2026)
Health Monitoring
- name: Infrastructure health check
hosts: all
tasks:
- name: Check disk space
ansible.builtin.shell: df -h / | tail -1 | awk '{print $5}' | tr -d '%'
register: disk_usage
changed_when: false
- name: Check critical services
ansible.builtin.systemd:
name: "{{ item }}"
register: service_status
loop: "{{ critical_services }}"
failed_when: false
- name: Check database replication
ansible.builtin.shell: |
psql -t -c "SELECT status FROM pg_stat_wal_receiver;"
register: repl_status
when: "'database' in group_names"
become_user: postgres
changed_when: false
- name: Alert on issues
ansible.builtin.uri:
url: "{{ pagerduty_webhook }}"
method: POST
body_format: json
body:
routing_key: "{{ pagerduty_key }}"
event_action: trigger
payload:
summary: "DR Alert: {{ alert_message }}"
severity: critical
source: "{{ inventory_hostname }}"
delegate_to: localhost
when: disk_usage.stdout | int > 90 or
service_status.results | selectattr('status.ActiveState', 'ne', 'active') | list | length > 0
Automated Failover
DNS Failover
- name: Failover to DR site
hosts: localhost
connection: local
vars:
primary_ip: 203.0.113.10
dr_ip: 198.51.100.10
domain: app.example.com
tasks:
- name: Check primary site health
ansible.builtin.uri:
url: "https://{{ primary_ip }}/health"
validate_certs: false
timeout: 10
register: primary_health
failed_when: false
- name: Failover DNS to DR site
amazon.aws.route53:
zone: example.com
record: "{{ domain }}"
type: A
value: "{{ dr_ip }}"
ttl: 60
overwrite: true
state: present
when: primary_health.status is not defined or primary_health.status != 200
- name: Notify team of failover
ansible.builtin.uri:
url: "{{ slack_webhook }}"
method: POST
body_format: json
body:
text: "⚠️ FAILOVER ACTIVATED: {{ domain }} now pointing to DR site ({{ dr_ip }})"
when: primary_health.status is not defined or primary_health.status != 200
Database Failover
- name: Promote database replica
hosts: db_replica
become: true
tasks:
- name: Check if primary is down
ansible.builtin.wait_for:
host: "{{ primary_db_host }}"
port: 5432
timeout: 30
register: primary_check
failed_when: false
- name: Promote replica to primary
ansible.builtin.command: pg_ctl promote -D /var/lib/postgresql/16/main
become_user: postgres
when: primary_check.elapsed >= 30
- name: Update connection strings
ansible.builtin.template:
src: db-config.j2
dest: /etc/myapp/database.conf
delegate_to: "{{ item }}"
loop: "{{ groups['app_servers'] }}"
vars:
db_host: "{{ inventory_hostname }}"
when: primary_check.elapsed >= 30
notify: restart application
See also: Ansible for Cloud Migration: Lift-and-Shift, Re-Platform, and Re-Factor Strategies
Infrastructure Rebuild
- name: Rebuild infrastructure from code
hosts: localhost
connection: local
tasks:
- name: Provision infrastructure with Terraform
cloud.terraform.terraform:
project_path: ./terraform/dr-site/
state: present
variables:
environment: dr-recovery
region: us-west-2
register: tf_output
- name: Wait for instances
ansible.builtin.wait_for:
host: "{{ item }}"
port: 22
timeout: 300
loop: "{{ tf_output.outputs.instance_ips.value }}"
- name: Add new hosts to inventory
ansible.builtin.add_host:
name: "{{ item }}"
groups: recovered_servers
loop: "{{ tf_output.outputs.instance_ips.value }}"
- name: Configure recovered servers
hosts: recovered_servers
become: true
roles:
- common
- security_baseline
- monitoring
- name: Restore data
hosts: recovered_servers
become: true
tasks:
- name: Download latest backup from S3
amazon.aws.s3_object:
bucket: myorg-db-backups
object: "latest/{{ db_name }}.dump"
dest: /tmp/restore.dump
mode: get
- name: Restore database
ansible.builtin.shell: |
pg_restore -d {{ db_name }} /tmp/restore.dump
become_user: postgres
DR Testing
- name: Quarterly DR test
hosts: localhost
tasks:
- name: Create DR test ticket
servicenow.itsm.change_request:
state: new
type: normal
short_description: "Quarterly DR Test - {{ ansible_date_time.date }}"
description: "Automated DR test execution"
register: dr_ticket
- name: Provision DR test environment
ansible.builtin.include_role:
name: dr_rebuild
vars:
environment: dr-test
isolated: true
- name: Run application smoke tests
ansible.builtin.uri:
url: "https://{{ dr_test_endpoint }}/health"
validate_certs: false
register: smoke_test
retries: 5
delay: 30
until: smoke_test.status == 200
- name: Measure RTO
ansible.builtin.set_fact:
rto_minutes: "{{ ((ansible_date_time.epoch | int) - (dr_start_time | int)) / 60 }}"
- name: Generate DR test report
ansible.builtin.template:
src: dr-report.j2
dest: "./reports/dr-test-{{ ansible_date_time.date }}.html"
vars:
rto_actual: "{{ rto_minutes }} minutes"
rto_target: "60 minutes"
rpo_actual: "{{ backup_age_minutes }} minutes"
rpo_target: "15 minutes"
test_result: "{{ 'PASS' if rto_minutes | int < 60 else 'FAIL' }}"
- name: Tear down test environment
ansible.builtin.include_role:
name: dr_rebuild
vars:
state: absent
Best Practices
Automate everything — If it's not in a playbook, it won't work under pressure Test quarterly minimum — Untested DR plans are fiction Measure RTO/RPO — Track actual recovery time vs targets Backup the 3-2-1 rule — 3 copies, 2 media types, 1 offsite Encrypt backups — Always encrypt data at rest and in transit Document dependencies — Recovery order matters; database before app servers Runbook as code — DR playbooks ARE the runbook; no separate documents to maintain Include ITSM — Auto-create ServiceNow tickets during DR eventsFAQ
How often should DR playbooks run?
Backups: daily minimum. Health checks: every 5-15 minutes. Full DR test: quarterly. Failover should be automated with manual override.
Can Ansible handle split-brain scenarios?
Ansible can detect and resolve split-brain by checking replication status and promoting the most up-to-date replica. Use fencing (STONITH) for critical systems.
What's the minimum RPO achievable?
With streaming replication + automated backup: near-zero RPO. With daily backups: 24-hour RPO. Ansible orchestrates the recovery regardless of backup method.
Conclusion
Disaster recovery automation with Ansible transforms DR from a dusty binder into a tested, executable, and reliable system. By codifying backup, failover, rebuild, and testing as playbooks, you ensure recovery works when it matters most — and can prove it through regular automated testing.
Related Articles
• HashiCorp Vault Integration with AAP • Ansible ServiceNow Integration • Ansible AWS Complete GuideCategory: troubleshooting