AnsiblePilot — Master Ansible Automation

AnsiblePilot is the leading resource for learning Ansible automation, DevOps, and infrastructure as code. Browse over 1,400 tutorials covering Ansible modules, playbooks, roles, collections, and real-world examples. Whether you are a beginner or an experienced engineer, our step-by-step guides help you automate Linux, Windows, cloud, containers, and network infrastructure.

Popular Topics

About Luca Berton

Luca Berton is an Ansible automation expert, author of 8 Ansible books published by Apress and Leanpub including "Ansible for VMware by Examples" and "Ansible for Kubernetes by Example", and creator of the Ansible Pilot YouTube channel. He shares practical automation knowledge through tutorials, books, and video courses to help IT professionals and DevOps engineers master infrastructure automation.

Ansible Monitoring and Observability: Prometheus, Grafana, and ELK Stack Integration

By Luca Berton · Published 2024-01-01 · Category: installation

Deploy and configure monitoring infrastructure with Ansible. Automate Prometheus, Grafana, ELK Stack, and alerting for full observability across your.

Introduction

Observability — metrics, logs, and traces — is essential for running reliable infrastructure. But deploying and configuring monitoring tools manually across hundreds of servers doesn't scale. Ansible automates the entire observability stack: deploying Prometheus and Grafana for metrics, ELK for log aggregation, and configuring alerting rules — all as reproducible, version-controlled code.

See also: AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation

Deploy Prometheus

Prometheus Server

---
- name: Deploy Prometheus
  hosts: monitoring_server
  become: true
  vars:
    prometheus_version: "2.52.0"
    prometheus_user: prometheus
    prometheus_dir: /opt/prometheus
    prometheus_data: /var/lib/prometheus
  tasks:
    - name: Create prometheus user
      ansible.builtin.user:
        name: "{{ prometheus_user }}"
        system: true
        shell: /sbin/nologin
        create_home: false

- name: Download Prometheus ansible.builtin.unarchive: src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz" dest: /opt/ remote_src: true owner: "{{ prometheus_user }}"

- name: Create symlink ansible.builtin.file: src: "/opt/prometheus-{{ prometheus_version }}.linux-amd64" dest: "{{ prometheus_dir }}" state: link

- name: Create data directory ansible.builtin.file: path: "{{ prometheus_data }}" state: directory owner: "{{ prometheus_user }}" mode: '0755'

- name: Deploy Prometheus config ansible.builtin.template: src: prometheus.yml.j2 dest: "{{ prometheus_dir }}/prometheus.yml" owner: "{{ prometheus_user }}" notify: restart prometheus

- name: Deploy alert rules ansible.builtin.template: src: alert-rules.yml.j2 dest: "{{ prometheus_dir }}/alert-rules.yml" owner: "{{ prometheus_user }}" notify: reload prometheus

- name: Create systemd service ansible.builtin.template: src: prometheus.service.j2 dest: /etc/systemd/system/prometheus.service notify: - daemon reload - restart prometheus

- name: Enable and start Prometheus ansible.builtin.systemd: name: prometheus state: started enabled: true

handlers: - name: daemon reload ansible.builtin.systemd: daemon_reload: true - name: restart prometheus ansible.builtin.systemd: name: prometheus state: restarted - name: reload prometheus ansible.builtin.uri: url: http://localhost:9090/-/reload method: POST

Prometheus Configuration Template

# templates/prometheus.yml.j2
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files: - alert-rules.yml

alerting: alertmanagers: - static_configs: - targets: ['{{ alertmanager_host }}:9093']

scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']

- job_name: 'node-exporter' static_configs: - targets: {% for host in groups['all'] %} - '{{ hostvars[host].ansible_host | default(host) }}:9100' {% endfor %}

- job_name: 'application' metrics_path: /metrics static_configs: - targets: {% for host in groups['webservers'] %} - '{{ hostvars[host].ansible_host | default(host) }}:8080' {% endfor %}

Node Exporter on All Hosts

- name: Deploy Node Exporter
  hosts: all
  become: true
  vars:
    node_exporter_version: "1.8.1"
  tasks:
    - name: Download Node Exporter
      ansible.builtin.unarchive:
        src: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.linux-amd64.tar.gz"
        dest: /opt/
        remote_src: true

- name: Create symlink ansible.builtin.file: src: "/opt/node_exporter-{{ node_exporter_version }}.linux-amd64/node_exporter" dest: /usr/local/bin/node_exporter state: link

- name: Create systemd service ansible.builtin.copy: content: | [Unit] Description=Node Exporter After=network.target

[Service] User=nobody ExecStart=/usr/local/bin/node_exporter Restart=always

[Install] WantedBy=multi-user.target dest: /etc/systemd/system/node_exporter.service

- name: Start Node Exporter ansible.builtin.systemd: name: node_exporter state: started enabled: true daemon_reload: true

Deploy Grafana

- name: Deploy Grafana
  hosts: monitoring_server
  become: true
  tasks:
    - name: Add Grafana repository
      ansible.builtin.yum_repository:
        name: grafana
        description: Grafana Repository
        baseurl: https://rpm.grafana.com
        gpgkey: https://rpm.grafana.com/gpg.key
        gpgcheck: true

- name: Install Grafana ansible.builtin.package: name: grafana state: present

- name: Configure Grafana ansible.builtin.template: src: grafana.ini.j2 dest: /etc/grafana/grafana.ini notify: restart grafana

- name: Start Grafana ansible.builtin.systemd: name: grafana-server state: started enabled: true

- name: Wait for Grafana to start ansible.builtin.uri: url: http://localhost:3000/api/health register: grafana_health retries: 10 delay: 5 until: grafana_health.status == 200

- name: Add Prometheus datasource ansible.builtin.uri: url: http://localhost:3000/api/datasources method: POST user: admin password: "{{ grafana_admin_password }}" force_basic_auth: true body_format: json body: name: Prometheus type: prometheus url: http://localhost:9090 access: proxy isDefault: true status_code: [200, 409]

- name: Import dashboards ansible.builtin.uri: url: http://localhost:3000/api/dashboards/import method: POST user: admin password: "{{ grafana_admin_password }}" force_basic_auth: true body_format: json body: dashboard: id: "{{ item }}" overwrite: true inputs: - name: DS_PROMETHEUS type: datasource pluginId: prometheus value: Prometheus loop: - 1860 # Node Exporter Full - 3662 # Prometheus 2.0 Stats - 11074 # Node Exporter for Prometheus

handlers: - name: restart grafana ansible.builtin.systemd: name: grafana-server state: restarted

See also: Ansible Monitoring: Integrate with Prometheus, Grafana & Alerting (Complete Guide)

Deploy ELK Stack

Elasticsearch

- name: Deploy Elasticsearch
  hosts: elasticsearch
  become: true
  tasks:
    - name: Install Elasticsearch
      ansible.builtin.package:
        name: elasticsearch
        state: present

- name: Configure Elasticsearch ansible.builtin.template: src: elasticsearch.yml.j2 dest: /etc/elasticsearch/elasticsearch.yml notify: restart elasticsearch

- name: Set JVM heap ansible.builtin.template: src: jvm.options.j2 dest: /etc/elasticsearch/jvm.options.d/heap.options

- name: Start Elasticsearch ansible.builtin.systemd: name: elasticsearch state: started enabled: true

Logstash

- name: Deploy Logstash
  hosts: logstash
  become: true
  tasks:
    - name: Install Logstash
      ansible.builtin.package:
        name: logstash
        state: present

- name: Deploy pipeline config ansible.builtin.template: src: logstash-pipeline.conf.j2 dest: /etc/logstash/conf.d/main.conf notify: restart logstash

- name: Start Logstash ansible.builtin.systemd: name: logstash state: started enabled: true

Filebeat on All Hosts

- name: Deploy Filebeat
  hosts: all
  become: true
  tasks:
    - name: Install Filebeat
      ansible.builtin.package:
        name: filebeat
        state: present

- name: Configure Filebeat ansible.builtin.template: src: filebeat.yml.j2 dest: /etc/filebeat/filebeat.yml notify: restart filebeat

- name: Enable system module ansible.builtin.command: filebeat modules enable system changed_when: false

- name: Start Filebeat ansible.builtin.systemd: name: filebeat state: started enabled: true

handlers: - name: restart filebeat ansible.builtin.systemd: name: filebeat state: restarted

Alert Rules

# templates/alert-rules.yml.j2
groups:
  - name: infrastructure
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ '{{' }} $labels.instance {{ '}}' }}"

- alert: DiskSpaceLow expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Disk space < 10% on {{ '{{' }} $labels.instance {{ '}}' }}"

- alert: HostDown expr: up == 0 for: 2m labels: severity: critical annotations: summary: "Host {{ '{{' }} $labels.instance {{ '}}' }} is down"

- alert: HighMemory expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90 for: 5m labels: severity: warning

See also: Integrate Automation Controller, Prometheus, and Grafana to IT Monitor Realtime

Alertmanager

- name: Deploy Alertmanager
  hosts: monitoring_server
  become: true
  tasks:
    - name: Configure Alertmanager
      ansible.builtin.template:
        src: alertmanager.yml.j2
        dest: /opt/alertmanager/alertmanager.yml
      notify: restart alertmanager
# templates/alertmanager.yml.j2
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route: group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: - match: severity: critical receiver: 'pagerduty' - match: severity: warning receiver: 'slack'

receivers: - name: 'default' email_configs: - to: 'ops@example.com'

- name: 'pagerduty' pagerduty_configs: - service_key: '{{ pagerduty_key }}'

- name: 'slack' slack_configs: - api_url: '{{ slack_webhook }}' channel: '#alerts'

Auto-Remediation

- name: Auto-remediate monitoring alerts
  hosts: "{{ alert_host }}"
  become: true
  tasks:
    - name: Clear disk space
      block:
        - name: Clean package cache
          ansible.builtin.command: apt-get clean
        - name: Remove old logs
          ansible.builtin.find:
            paths: /var/log
            patterns: "*.gz,*.old,*.1"
          register: old_logs
        - name: Delete old logs
          ansible.builtin.file:
            path: "{{ item.path }}"
            state: absent
          loop: "{{ old_logs.files }}"
      when: alert_type == 'DiskSpaceLow'

- name: Restart high-CPU service ansible.builtin.systemd: name: "{{ offending_service }}" state: restarted when: alert_type == 'HighCPU' and offending_service is defined

Best Practices

Monitor the monitoring — Prometheus should scrape itself; Grafana should have its own dashboard Retention planning — Set appropriate data retention for Prometheus and Elasticsearch Template everything — All configs should be Ansible templates for environment-specific values Alert on symptoms, not causes — "High error rate" over "High CPU" Runbook links in alerts — Include documentation URLs in alert annotations Test alerts regularly — Use Prometheus alerting rules tests Separate metrics and logs — Different storage backends for different data types Role-based access in Grafana — Different dashboards for different teams

FAQ

Prometheus vs ELK for metrics?

Prometheus is purpose-built for metrics (pull-based, efficient time-series storage). ELK is for logs. Use both — Prometheus for metrics/alerting, ELK for log aggregation/search.

How to scale Prometheus?

For large environments: Thanos or Cortex for long-term storage and multi-cluster federation. Ansible can deploy these too.

Grafana dashboards as code?

Yes — export dashboards as JSON, store in Git, deploy with Ansible's uri module to the Grafana API.

Conclusion

Ansible brings infrastructure-as-code discipline to your observability stack. By automating Prometheus, Grafana, and ELK deployment, you ensure monitoring is consistent, reproducible, and scales with your infrastructure.

Related Articles

Ansible Performance OptimizationAnsible Docker Complete GuideAnsible Automation Platform 2.6Ansible for Site Reliability Engineering

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home