AnsiblePilot — Master Ansible Automation

AnsiblePilot is the leading resource for learning Ansible automation, DevOps, and infrastructure as code. Browse over 1,400 tutorials covering Ansible modules, playbooks, roles, collections, and real-world examples. Whether you are a beginner or an experienced engineer, our step-by-step guides help you automate Linux, Windows, cloud, containers, and network infrastructure.

Popular Topics

About Luca Berton

Luca Berton is an Ansible automation expert, author of 8 Ansible books published by Apress and Leanpub including "Ansible for VMware by Examples" and "Ansible for Kubernetes by Example", and creator of the Ansible Pilot YouTube channel. He shares practical automation knowledge through tutorials, books, and video courses to help IT professionals and DevOps engineers master infrastructure automation.

Ansible Disaster Recovery Automation: Backup, Failover, and Recovery Playbooks

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

Automate disaster recovery with Ansible. Build backup workflows, automated failover, infrastructure rebuild, and recovery testing playbooks for enterprise.

Introduction

When disaster strikes — datacenter outage, ransomware attack, hardware failure — recovery speed determines business impact. Manual runbooks fail under pressure. Ansible automates disaster recovery end-to-end: scheduled backups, automated failover, infrastructure rebuild from code, and regular DR testing to ensure your recovery actually works.

See also: Ansible Automation Platform High Availability and Disaster Recovery: Single Topology Architecture

DR Automation Strategy

Prevention          Detection          Response           Recovery
┌──────────┐    ┌──────────┐    ┌──────────────┐    ┌──────────┐
│ Scheduled │    │ Health   │    │ Automated    │    │ Rebuild  │
│ Backups   │    │ Checks   │    │ Failover     │    │ from     │
│ Config    │    │ Alerts   │    │ DNS Switch   │    │ Code     │
│ Snapshots │    │ Monitors │    │ Traffic Shift│    │ Restore  │
└──────────┘    └──────────┘    └──────────────┘    └──────────┘

Automated Backups

Database Backups

---
- name: Automated database backup
  hosts: database_servers
  become: true
  vars:
    backup_dir: /backup/databases
    retention_days: 30
    s3_bucket: myorg-db-backups
  tasks:
    - name: Create backup directory
      ansible.builtin.file:
        path: "{{ backup_dir }}"
        state: directory
        mode: '0750'

- name: Backup PostgreSQL ansible.builtin.shell: | pg_dump -Fc {{ db_name }} > {{ backup_dir }}/{{ db_name }}-{{ ansible_date_time.iso8601_basic_short }}.dump become_user: postgres when: db_type == 'postgresql'

- name: Backup MySQL community.mysql.mysql_db: state: dump name: "{{ db_name }}" target: "{{ backup_dir }}/{{ db_name }}-{{ ansible_date_time.iso8601_basic_short }}.sql.gz" login_user: backup_user login_password: "{{ vault_mysql_backup_pass }}" when: db_type == 'mysql'

- name: Upload to S3 amazon.aws.s3_object: bucket: "{{ s3_bucket }}" object: "{{ inventory_hostname }}/{{ item | basename }}" src: "{{ item }}" mode: put encryption: aws:kms loop: "{{ lookup('fileglob', backup_dir + '/*-' + ansible_date_time.date + '*', wantlist=True) }}" delegate_to: localhost

- name: Clean old local backups ansible.builtin.find: paths: "{{ backup_dir }}" age: "{{ retention_days }}d" register: old_backups

- name: Remove old backups ansible.builtin.file: path: "{{ item.path }}" state: absent loop: "{{ old_backups.files }}"

Configuration Backup

- name: Backup all server configurations
  hosts: all
  become: true
  tasks:
    - name: Archive critical configs
      community.general.archive:
        path:
          - /etc/nginx/
          - /etc/ssh/sshd_config
          - /etc/systemd/system/
          - /etc/crontab
          - /etc/fstab
        dest: "/tmp/{{ inventory_hostname }}-config-{{ ansible_date_time.date }}.tar.gz"
        format: gz

- name: Fetch to backup server ansible.builtin.fetch: src: "/tmp/{{ inventory_hostname }}-config-{{ ansible_date_time.date }}.tar.gz" dest: "backups/configs/" flat: false

See also: Ansible vs GitHub Actions: Key Differences & When to Use Each (2026)

Health Monitoring

- name: Infrastructure health check
  hosts: all
  tasks:
    - name: Check disk space
      ansible.builtin.shell: df -h / | tail -1 | awk '{print $5}' | tr -d '%'
      register: disk_usage
      changed_when: false

- name: Check critical services ansible.builtin.systemd: name: "{{ item }}" register: service_status loop: "{{ critical_services }}" failed_when: false

- name: Check database replication ansible.builtin.shell: | psql -t -c "SELECT status FROM pg_stat_wal_receiver;" register: repl_status when: "'database' in group_names" become_user: postgres changed_when: false

- name: Alert on issues ansible.builtin.uri: url: "{{ pagerduty_webhook }}" method: POST body_format: json body: routing_key: "{{ pagerduty_key }}" event_action: trigger payload: summary: "DR Alert: {{ alert_message }}" severity: critical source: "{{ inventory_hostname }}" delegate_to: localhost when: disk_usage.stdout | int > 90 or service_status.results | selectattr('status.ActiveState', 'ne', 'active') | list | length > 0

Automated Failover

DNS Failover

- name: Failover to DR site
  hosts: localhost
  connection: local
  vars:
    primary_ip: 203.0.113.10
    dr_ip: 198.51.100.10
    domain: app.example.com
  tasks:
    - name: Check primary site health
      ansible.builtin.uri:
        url: "https://{{ primary_ip }}/health"
        validate_certs: false
        timeout: 10
      register: primary_health
      failed_when: false

- name: Failover DNS to DR site amazon.aws.route53: zone: example.com record: "{{ domain }}" type: A value: "{{ dr_ip }}" ttl: 60 overwrite: true state: present when: primary_health.status is not defined or primary_health.status != 200

- name: Notify team of failover ansible.builtin.uri: url: "{{ slack_webhook }}" method: POST body_format: json body: text: "⚠️ FAILOVER ACTIVATED: {{ domain }} now pointing to DR site ({{ dr_ip }})" when: primary_health.status is not defined or primary_health.status != 200

Database Failover

- name: Promote database replica
  hosts: db_replica
  become: true
  tasks:
    - name: Check if primary is down
      ansible.builtin.wait_for:
        host: "{{ primary_db_host }}"
        port: 5432
        timeout: 30
      register: primary_check
      failed_when: false

- name: Promote replica to primary ansible.builtin.command: pg_ctl promote -D /var/lib/postgresql/16/main become_user: postgres when: primary_check.elapsed >= 30

- name: Update connection strings ansible.builtin.template: src: db-config.j2 dest: /etc/myapp/database.conf delegate_to: "{{ item }}" loop: "{{ groups['app_servers'] }}" vars: db_host: "{{ inventory_hostname }}" when: primary_check.elapsed >= 30 notify: restart application

See also: Ansible for Cloud Migration: Lift-and-Shift, Re-Platform, and Re-Factor Strategies

Infrastructure Rebuild

- name: Rebuild infrastructure from code
  hosts: localhost
  connection: local
  tasks:
    - name: Provision infrastructure with Terraform
      cloud.terraform.terraform:
        project_path: ./terraform/dr-site/
        state: present
        variables:
          environment: dr-recovery
          region: us-west-2
      register: tf_output

- name: Wait for instances ansible.builtin.wait_for: host: "{{ item }}" port: 22 timeout: 300 loop: "{{ tf_output.outputs.instance_ips.value }}"

- name: Add new hosts to inventory ansible.builtin.add_host: name: "{{ item }}" groups: recovered_servers loop: "{{ tf_output.outputs.instance_ips.value }}"

- name: Configure recovered servers hosts: recovered_servers become: true roles: - common - security_baseline - monitoring

- name: Restore data hosts: recovered_servers become: true tasks: - name: Download latest backup from S3 amazon.aws.s3_object: bucket: myorg-db-backups object: "latest/{{ db_name }}.dump" dest: /tmp/restore.dump mode: get

- name: Restore database ansible.builtin.shell: | pg_restore -d {{ db_name }} /tmp/restore.dump become_user: postgres

DR Testing

- name: Quarterly DR test
  hosts: localhost
  tasks:
    - name: Create DR test ticket
      servicenow.itsm.change_request:
        state: new
        type: normal
        short_description: "Quarterly DR Test - {{ ansible_date_time.date }}"
        description: "Automated DR test execution"
      register: dr_ticket

- name: Provision DR test environment ansible.builtin.include_role: name: dr_rebuild vars: environment: dr-test isolated: true

- name: Run application smoke tests ansible.builtin.uri: url: "https://{{ dr_test_endpoint }}/health" validate_certs: false register: smoke_test retries: 5 delay: 30 until: smoke_test.status == 200

- name: Measure RTO ansible.builtin.set_fact: rto_minutes: "{{ ((ansible_date_time.epoch | int) - (dr_start_time | int)) / 60 }}"

- name: Generate DR test report ansible.builtin.template: src: dr-report.j2 dest: "./reports/dr-test-{{ ansible_date_time.date }}.html" vars: rto_actual: "{{ rto_minutes }} minutes" rto_target: "60 minutes" rpo_actual: "{{ backup_age_minutes }} minutes" rpo_target: "15 minutes" test_result: "{{ 'PASS' if rto_minutes | int < 60 else 'FAIL' }}"

- name: Tear down test environment ansible.builtin.include_role: name: dr_rebuild vars: state: absent

Best Practices

Automate everything — If it's not in a playbook, it won't work under pressure Test quarterly minimum — Untested DR plans are fiction Measure RTO/RPO — Track actual recovery time vs targets Backup the 3-2-1 rule — 3 copies, 2 media types, 1 offsite Encrypt backups — Always encrypt data at rest and in transit Document dependencies — Recovery order matters; database before app servers Runbook as code — DR playbooks ARE the runbook; no separate documents to maintain Include ITSM — Auto-create ServiceNow tickets during DR events

FAQ

How often should DR playbooks run?

Backups: daily minimum. Health checks: every 5-15 minutes. Full DR test: quarterly. Failover should be automated with manual override.

Can Ansible handle split-brain scenarios?

Ansible can detect and resolve split-brain by checking replication status and promoting the most up-to-date replica. Use fencing (STONITH) for critical systems.

What's the minimum RPO achievable?

With streaming replication + automated backup: near-zero RPO. With daily backups: 24-hour RPO. Ansible orchestrates the recovery regardless of backup method.

Conclusion

Disaster recovery automation with Ansible transforms DR from a dusty binder into a tested, executable, and reliable system. By codifying backup, failover, rebuild, and testing as playbooks, you ensure recovery works when it matters most — and can prove it through regular automated testing.

Related Articles

HashiCorp Vault Integration with AAPAnsible ServiceNow IntegrationAnsible AWS Complete Guide

Category: troubleshooting

Browse all Ansible tutorials · AnsiblePilot Home