Ansible Disaster Recovery Automation: Backup, Failover, and Recovery Playbooks

By Luca Berton · Published 2024-01-01 · Category: troubleshooting

Automate disaster recovery with Ansible. Build backup workflows, automated failover, infrastructure rebuild, and recovery testing playbooks for enterprise.

Introduction

When disaster strikes — datacenter outage, ransomware attack, hardware failure — recovery speed determines business impact. Manual runbooks fail under pressure. Ansible automates disaster recovery end-to-end: scheduled backups, automated failover, infrastructure rebuild from code, and regular DR testing to ensure your recovery actually works.

DR Automation Strategy

Prevention          Detection          Response           Recovery
┌──────────┐    ┌──────────┐    ┌──────────────┐    ┌──────────┐
│ Scheduled │    │ Health   │    │ Automated    │    │ Rebuild  │
│ Backups   │    │ Checks   │    │ Failover     │    │ from     │
│ Config    │    │ Alerts   │    │ DNS Switch   │    │ Code     │
│ Snapshots │    │ Monitors │    │ Traffic Shift│    │ Restore  │
└──────────┘    └──────────┘    └──────────────┘    └──────────┘

Automated Backups

Database Backups

---
- name: Automated database backup
  hosts: database_servers
  become: true
  vars:
    backup_dir: /backup/databases
    retention_days: 30
    s3_bucket: myorg-db-backups
  tasks:
    - name: Create backup directory
      ansible.builtin.file:
        path: "{{ backup_dir }}"
        state: directory
        mode: '0750'

    - name: Backup PostgreSQL
      ansible.builtin.shell: |
        pg_dump -Fc {{ db_name }} > {{ backup_dir }}/{{ db_name }}-{{ ansible_date_time.iso8601_basic_short }}.dump
      become_user: postgres
      when: db_type == 'postgresql'

    - name: Backup MySQL
      community.mysql.mysql_db:
        state: dump
        name: "{{ db_name }}"
        target: "{{ backup_dir }}/{{ db_name }}-{{ ansible_date_time.iso8601_basic_short }}.sql.gz"
        login_user: backup_user
        login_password: "{{ vault_mysql_backup_pass }}"
      when: db_type == 'mysql'

    - name: Upload to S3
      amazon.aws.s3_object:
        bucket: "{{ s3_bucket }}"
        object: "{{ inventory_hostname }}/{{ item | basename }}"
        src: "{{ item }}"
        mode: put
        encryption: aws:kms
      loop: "{{ lookup('fileglob', backup_dir + '/*-' + ansible_date_time.date + '*', wantlist=True) }}"
      delegate_to: localhost

    - name: Clean old local backups
      ansible.builtin.find:
        paths: "{{ backup_dir }}"
        age: "{{ retention_days }}d"
      register: old_backups

    - name: Remove old backups
      ansible.builtin.file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_backups.files }}"

Configuration Backup

- name: Backup all server configurations
  hosts: all
  become: true
  tasks:
    - name: Archive critical configs
      community.general.archive:
        path:
          - /etc/nginx/
          - /etc/ssh/sshd_config
          - /etc/systemd/system/
          - /etc/crontab
          - /etc/fstab
        dest: "/tmp/{{ inventory_hostname }}-config-{{ ansible_date_time.date }}.tar.gz"
        format: gz

    - name: Fetch to backup server
      ansible.builtin.fetch:
        src: "/tmp/{{ inventory_hostname }}-config-{{ ansible_date_time.date }}.tar.gz"
        dest: "backups/configs/"
        flat: false

Health Monitoring

- name: Infrastructure health check
  hosts: all
  tasks:
    - name: Check disk space
      ansible.builtin.shell: df -h / | tail -1 | awk '{print $5}' | tr -d '%'
      register: disk_usage
      changed_when: false

    - name: Check critical services
      ansible.builtin.systemd:
        name: "{{ item }}"
      register: service_status
      loop: "{{ critical_services }}"
      failed_when: false

    - name: Check database replication
      ansible.builtin.shell: |
        psql -t -c "SELECT status FROM pg_stat_wal_receiver;"
      register: repl_status
      when: "'database' in group_names"
      become_user: postgres
      changed_when: false

    - name: Alert on issues
      ansible.builtin.uri:
        url: "{{ pagerduty_webhook }}"
        method: POST
        body_format: json
        body:
          routing_key: "{{ pagerduty_key }}"
          event_action: trigger
          payload:
            summary: "DR Alert: {{ alert_message }}"
            severity: critical
            source: "{{ inventory_hostname }}"
      delegate_to: localhost
      when: disk_usage.stdout | int > 90 or
            service_status.results | selectattr('status.ActiveState', 'ne', 'active') | list | length > 0

Automated Failover

DNS Failover

- name: Failover to DR site
  hosts: localhost
  connection: local
  vars:
    primary_ip: 203.0.113.10
    dr_ip: 198.51.100.10
    domain: app.example.com
  tasks:
    - name: Check primary site health
      ansible.builtin.uri:
        url: "https://{{ primary_ip }}/health"
        validate_certs: false
        timeout: 10
      register: primary_health
      failed_when: false

    - name: Failover DNS to DR site
      amazon.aws.route53:
        zone: example.com
        record: "{{ domain }}"
        type: A
        value: "{{ dr_ip }}"
        ttl: 60
        overwrite: true
        state: present
      when: primary_health.status is not defined or primary_health.status != 200

    - name: Notify team of failover
      ansible.builtin.uri:
        url: "{{ slack_webhook }}"
        method: POST
        body_format: json
        body:
          text: "⚠️ FAILOVER ACTIVATED: {{ domain }} now pointing to DR site ({{ dr_ip }})"
      when: primary_health.status is not defined or primary_health.status != 200

Database Failover

- name: Promote database replica
  hosts: db_replica
  become: true
  tasks:
    - name: Check if primary is down
      ansible.builtin.wait_for:
        host: "{{ primary_db_host }}"
        port: 5432
        timeout: 30
      register: primary_check
      failed_when: false

    - name: Promote replica to primary
      ansible.builtin.command: pg_ctl promote -D /var/lib/postgresql/16/main
      become_user: postgres
      when: primary_check.elapsed >= 30

    - name: Update connection strings
      ansible.builtin.template:
        src: db-config.j2
        dest: /etc/myapp/database.conf
      delegate_to: "{{ item }}"
      loop: "{{ groups['app_servers'] }}"
      vars:
        db_host: "{{ inventory_hostname }}"
      when: primary_check.elapsed >= 30
      notify: restart application

Infrastructure Rebuild

- name: Rebuild infrastructure from code
  hosts: localhost
  connection: local
  tasks:
    - name: Provision infrastructure with Terraform
      cloud.terraform.terraform:
        project_path: ./terraform/dr-site/
        state: present
        variables:
          environment: dr-recovery
          region: us-west-2
      register: tf_output

    - name: Wait for instances
      ansible.builtin.wait_for:
        host: "{{ item }}"
        port: 22
        timeout: 300
      loop: "{{ tf_output.outputs.instance_ips.value }}"

    - name: Add new hosts to inventory
      ansible.builtin.add_host:
        name: "{{ item }}"
        groups: recovered_servers
      loop: "{{ tf_output.outputs.instance_ips.value }}"

- name: Configure recovered servers
  hosts: recovered_servers
  become: true
  roles:
    - common
    - security_baseline
    - monitoring

- name: Restore data
  hosts: recovered_servers
  become: true
  tasks:
    - name: Download latest backup from S3
      amazon.aws.s3_object:
        bucket: myorg-db-backups
        object: "latest/{{ db_name }}.dump"
        dest: /tmp/restore.dump
        mode: get

    - name: Restore database
      ansible.builtin.shell: |
        pg_restore -d {{ db_name }} /tmp/restore.dump
      become_user: postgres

DR Testing

- name: Quarterly DR test
  hosts: localhost
  tasks:
    - name: Create DR test ticket
      servicenow.itsm.change_request:
        state: new
        type: normal
        short_description: "Quarterly DR Test - {{ ansible_date_time.date }}"
        description: "Automated DR test execution"
      register: dr_ticket

    - name: Provision DR test environment
      ansible.builtin.include_role:
        name: dr_rebuild
      vars:
        environment: dr-test
        isolated: true

    - name: Run application smoke tests
      ansible.builtin.uri:
        url: "https://{{ dr_test_endpoint }}/health"
        validate_certs: false
      register: smoke_test
      retries: 5
      delay: 30
      until: smoke_test.status == 200

    - name: Measure RTO
      ansible.builtin.set_fact:
        rto_minutes: "{{ ((ansible_date_time.epoch | int) - (dr_start_time | int)) / 60 }}"

    - name: Generate DR test report
      ansible.builtin.template:
        src: dr-report.j2
        dest: "./reports/dr-test-{{ ansible_date_time.date }}.html"
      vars:
        rto_actual: "{{ rto_minutes }} minutes"
        rto_target: "60 minutes"
        rpo_actual: "{{ backup_age_minutes }} minutes"
        rpo_target: "15 minutes"
        test_result: "{{ 'PASS' if rto_minutes | int < 60 else 'FAIL' }}"

    - name: Tear down test environment
      ansible.builtin.include_role:
        name: dr_rebuild
      vars:
        state: absent

Best Practices

Automate everything — If it's not in a playbook, it won't work under pressure
Test quarterly minimum — Untested DR plans are fiction
Measure RTO/RPO — Track actual recovery time vs targets
Backup the 3-2-1 rule — 3 copies, 2 media types, 1 offsite
Encrypt backups — Always encrypt data at rest and in transit
Document dependencies — Recovery order matters; database before app servers
Runbook as code — DR playbooks ARE the runbook; no separate documents to maintain
Include ITSM — Auto-create ServiceNow tickets during DR events

FAQ

How often should DR playbooks run?

Backups: daily minimum. Health checks: every 5-15 minutes. Full DR test: quarterly. Failover should be automated with manual override.

Can Ansible handle split-brain scenarios?

Ansible can detect and resolve split-brain by checking replication status and promoting the most up-to-date replica. Use fencing (STONITH) for critical systems.

What's the minimum RPO achievable?

With streaming replication + automated backup: near-zero RPO. With daily backups: 24-hour RPO. Ansible orchestrates the recovery regardless of backup method.

Conclusion

Disaster recovery automation with Ansible transforms DR from a dusty binder into a tested, executable, and reliable system. By codifying backup, failover, rebuild, and testing as playbooks, you ensure recovery works when it matters most — and can prove it through regular automated testing.

Category: troubleshooting

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton