Ansible Monitoring and Observability: Prometheus, Grafana, and ELK Stack Integration

By Luca Berton · Published 2024-01-01 · Category: installation

Deploy and configure monitoring infrastructure with Ansible. Automate Prometheus, Grafana, ELK Stack, and alerting for full observability across your.

Introduction

Observability — metrics, logs, and traces — is essential for running reliable infrastructure. But deploying and configuring monitoring tools manually across hundreds of servers doesn't scale. Ansible automates the entire observability stack: deploying Prometheus and Grafana for metrics, ELK for log aggregation, and configuring alerting rules — all as reproducible, version-controlled code.

Deploy Prometheus

Prometheus Server

---
- name: Deploy Prometheus
  hosts: monitoring_server
  become: true
  vars:
    prometheus_version: "2.52.0"
    prometheus_user: prometheus
    prometheus_dir: /opt/prometheus
    prometheus_data: /var/lib/prometheus
  tasks:
    - name: Create prometheus user
      ansible.builtin.user:
        name: "{{ prometheus_user }}"
        system: true
        shell: /sbin/nologin
        create_home: false

    - name: Download Prometheus
      ansible.builtin.unarchive:
        src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
        dest: /opt/
        remote_src: true
        owner: "{{ prometheus_user }}"

    - name: Create symlink
      ansible.builtin.file:
        src: "/opt/prometheus-{{ prometheus_version }}.linux-amd64"
        dest: "{{ prometheus_dir }}"
        state: link

    - name: Create data directory
      ansible.builtin.file:
        path: "{{ prometheus_data }}"
        state: directory
        owner: "{{ prometheus_user }}"
        mode: '0755'

    - name: Deploy Prometheus config
      ansible.builtin.template:
        src: prometheus.yml.j2
        dest: "{{ prometheus_dir }}/prometheus.yml"
        owner: "{{ prometheus_user }}"
      notify: restart prometheus

    - name: Deploy alert rules
      ansible.builtin.template:
        src: alert-rules.yml.j2
        dest: "{{ prometheus_dir }}/alert-rules.yml"
        owner: "{{ prometheus_user }}"
      notify: reload prometheus

    - name: Create systemd service
      ansible.builtin.template:
        src: prometheus.service.j2
        dest: /etc/systemd/system/prometheus.service
      notify:
        - daemon reload
        - restart prometheus

    - name: Enable and start Prometheus
      ansible.builtin.systemd:
        name: prometheus
        state: started
        enabled: true

  handlers:
    - name: daemon reload
      ansible.builtin.systemd:
        daemon_reload: true
    - name: restart prometheus
      ansible.builtin.systemd:
        name: prometheus
        state: restarted
    - name: reload prometheus
      ansible.builtin.uri:
        url: http://localhost:9090/-/reload
        method: POST

Prometheus Configuration Template

# templates/prometheus.yml.j2
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - alert-rules.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['{{ alertmanager_host }}:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets:
{% for host in groups['all'] %}
          - '{{ hostvars[host].ansible_host | default(host) }}:9100'
{% endfor %}

  - job_name: 'application'
    metrics_path: /metrics
    static_configs:
      - targets:
{% for host in groups['webservers'] %}
          - '{{ hostvars[host].ansible_host | default(host) }}:8080'
{% endfor %}

Node Exporter on All Hosts

- name: Deploy Node Exporter
  hosts: all
  become: true
  vars:
    node_exporter_version: "1.8.1"
  tasks:
    - name: Download Node Exporter
      ansible.builtin.unarchive:
        src: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.linux-amd64.tar.gz"
        dest: /opt/
        remote_src: true

    - name: Create symlink
      ansible.builtin.file:
        src: "/opt/node_exporter-{{ node_exporter_version }}.linux-amd64/node_exporter"
        dest: /usr/local/bin/node_exporter
        state: link

    - name: Create systemd service
      ansible.builtin.copy:
        content: |
          [Unit]
          Description=Node Exporter
          After=network.target

          [Service]
          User=nobody
          ExecStart=/usr/local/bin/node_exporter
          Restart=always

          [Install]
          WantedBy=multi-user.target
        dest: /etc/systemd/system/node_exporter.service

    - name: Start Node Exporter
      ansible.builtin.systemd:
        name: node_exporter
        state: started
        enabled: true
        daemon_reload: true

Deploy Grafana

- name: Deploy Grafana
  hosts: monitoring_server
  become: true
  tasks:
    - name: Add Grafana repository
      ansible.builtin.yum_repository:
        name: grafana
        description: Grafana Repository
        baseurl: https://rpm.grafana.com
        gpgkey: https://rpm.grafana.com/gpg.key
        gpgcheck: true

    - name: Install Grafana
      ansible.builtin.package:
        name: grafana
        state: present

    - name: Configure Grafana
      ansible.builtin.template:
        src: grafana.ini.j2
        dest: /etc/grafana/grafana.ini
      notify: restart grafana

    - name: Start Grafana
      ansible.builtin.systemd:
        name: grafana-server
        state: started
        enabled: true

    - name: Wait for Grafana to start
      ansible.builtin.uri:
        url: http://localhost:3000/api/health
      register: grafana_health
      retries: 10
      delay: 5
      until: grafana_health.status == 200

    - name: Add Prometheus datasource
      ansible.builtin.uri:
        url: http://localhost:3000/api/datasources
        method: POST
        user: admin
        password: "{{ grafana_admin_password }}"
        force_basic_auth: true
        body_format: json
        body:
          name: Prometheus
          type: prometheus
          url: http://localhost:9090
          access: proxy
          isDefault: true
        status_code: [200, 409]

    - name: Import dashboards
      ansible.builtin.uri:
        url: http://localhost:3000/api/dashboards/import
        method: POST
        user: admin
        password: "{{ grafana_admin_password }}"
        force_basic_auth: true
        body_format: json
        body:
          dashboard:
            id: "{{ item }}"
          overwrite: true
          inputs:
            - name: DS_PROMETHEUS
              type: datasource
              pluginId: prometheus
              value: Prometheus
      loop:
        - 1860   # Node Exporter Full
        - 3662   # Prometheus 2.0 Stats
        - 11074  # Node Exporter for Prometheus

  handlers:
    - name: restart grafana
      ansible.builtin.systemd:
        name: grafana-server
        state: restarted

Deploy ELK Stack

Elasticsearch

- name: Deploy Elasticsearch
  hosts: elasticsearch
  become: true
  tasks:
    - name: Install Elasticsearch
      ansible.builtin.package:
        name: elasticsearch
        state: present

    - name: Configure Elasticsearch
      ansible.builtin.template:
        src: elasticsearch.yml.j2
        dest: /etc/elasticsearch/elasticsearch.yml
      notify: restart elasticsearch

    - name: Set JVM heap
      ansible.builtin.template:
        src: jvm.options.j2
        dest: /etc/elasticsearch/jvm.options.d/heap.options

    - name: Start Elasticsearch
      ansible.builtin.systemd:
        name: elasticsearch
        state: started
        enabled: true

Logstash

- name: Deploy Logstash
  hosts: logstash
  become: true
  tasks:
    - name: Install Logstash
      ansible.builtin.package:
        name: logstash
        state: present

    - name: Deploy pipeline config
      ansible.builtin.template:
        src: logstash-pipeline.conf.j2
        dest: /etc/logstash/conf.d/main.conf
      notify: restart logstash

    - name: Start Logstash
      ansible.builtin.systemd:
        name: logstash
        state: started
        enabled: true

Filebeat on All Hosts

- name: Deploy Filebeat
  hosts: all
  become: true
  tasks:
    - name: Install Filebeat
      ansible.builtin.package:
        name: filebeat
        state: present

    - name: Configure Filebeat
      ansible.builtin.template:
        src: filebeat.yml.j2
        dest: /etc/filebeat/filebeat.yml
      notify: restart filebeat

    - name: Enable system module
      ansible.builtin.command: filebeat modules enable system
      changed_when: false

    - name: Start Filebeat
      ansible.builtin.systemd:
        name: filebeat
        state: started
        enabled: true

  handlers:
    - name: restart filebeat
      ansible.builtin.systemd:
        name: filebeat
        state: restarted

Alert Rules

# templates/alert-rules.yml.j2
groups:
  - name: infrastructure
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ '{{' }} $labels.instance {{ '}}' }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space < 10% on {{ '{{' }} $labels.instance {{ '}}' }}"

      - alert: HostDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ '{{' }} $labels.instance {{ '}}' }} is down"

      - alert: HighMemory
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning

Alertmanager

- name: Deploy Alertmanager
  hosts: monitoring_server
  become: true
  tasks:
    - name: Configure Alertmanager
      ansible.builtin.template:
        src: alertmanager.yml.j2
        dest: /opt/alertmanager/alertmanager.yml
      notify: restart alertmanager

# templates/alertmanager.yml.j2
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '{{ pagerduty_key }}'

  - name: 'slack'
    slack_configs:
      - api_url: '{{ slack_webhook }}'
        channel: '#alerts'

Auto-Remediation

- name: Auto-remediate monitoring alerts
  hosts: "{{ alert_host }}"
  become: true
  tasks:
    - name: Clear disk space
      block:
        - name: Clean package cache
          ansible.builtin.command: apt-get clean
        - name: Remove old logs
          ansible.builtin.find:
            paths: /var/log
            patterns: "*.gz,*.old,*.1"
          register: old_logs
        - name: Delete old logs
          ansible.builtin.file:
            path: "{{ item.path }}"
            state: absent
          loop: "{{ old_logs.files }}"
      when: alert_type == 'DiskSpaceLow'

    - name: Restart high-CPU service
      ansible.builtin.systemd:
        name: "{{ offending_service }}"
        state: restarted
      when: alert_type == 'HighCPU' and offending_service is defined

Best Practices

Monitor the monitoring — Prometheus should scrape itself; Grafana should have its own dashboard
Retention planning — Set appropriate data retention for Prometheus and Elasticsearch
Template everything — All configs should be Ansible templates for environment-specific values
Alert on symptoms, not causes — "High error rate" over "High CPU"
Runbook links in alerts — Include documentation URLs in alert annotations
Test alerts regularly — Use Prometheus alerting rules tests
Separate metrics and logs — Different storage backends for different data types
Role-based access in Grafana — Different dashboards for different teams

FAQ

Prometheus vs ELK for metrics?

Prometheus is purpose-built for metrics (pull-based, efficient time-series storage). ELK is for logs. Use both — Prometheus for metrics/alerting, ELK for log aggregation/search.

How to scale Prometheus?

For large environments: Thanos or Cortex for long-term storage and multi-cluster federation. Ansible can deploy these too.

Grafana dashboards as code?

Yes — export dashboards as JSON, store in Git, deploy with Ansible's uri module to the Grafana API.

Conclusion

Ansible brings infrastructure-as-code discipline to your observability stack. By automating Prometheus, Grafana, and ELK deployment, you ensure monitoring is consistent, reproducible, and scales with your infrastructure.

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton