Ansible Monitoring and Observability: Prometheus, Grafana, and ELK Stack Integration
By Luca Berton · Published 2024-01-01 · Category: installation
Deploy and configure monitoring infrastructure with Ansible. Automate Prometheus, Grafana, ELK Stack, and alerting for full observability across your.
Introduction
Observability — metrics, logs, and traces — is essential for running reliable infrastructure. But deploying and configuring monitoring tools manually across hundreds of servers doesn't scale. Ansible automates the entire observability stack: deploying Prometheus and Grafana for metrics, ELK for log aggregation, and configuring alerting rules — all as reproducible, version-controlled code.
See also: AAP 2.6 Monitoring and Logging: Prometheus, Grafana, and Log Aggregation
Deploy Prometheus
Prometheus Server
---
- name: Deploy Prometheus
hosts: monitoring_server
become: true
vars:
prometheus_version: "2.52.0"
prometheus_user: prometheus
prometheus_dir: /opt/prometheus
prometheus_data: /var/lib/prometheus
tasks:
- name: Create prometheus user
ansible.builtin.user:
name: "{{ prometheus_user }}"
system: true
shell: /sbin/nologin
create_home: false
- name: Download Prometheus
ansible.builtin.unarchive:
src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
dest: /opt/
remote_src: true
owner: "{{ prometheus_user }}"
- name: Create symlink
ansible.builtin.file:
src: "/opt/prometheus-{{ prometheus_version }}.linux-amd64"
dest: "{{ prometheus_dir }}"
state: link
- name: Create data directory
ansible.builtin.file:
path: "{{ prometheus_data }}"
state: directory
owner: "{{ prometheus_user }}"
mode: '0755'
- name: Deploy Prometheus config
ansible.builtin.template:
src: prometheus.yml.j2
dest: "{{ prometheus_dir }}/prometheus.yml"
owner: "{{ prometheus_user }}"
notify: restart prometheus
- name: Deploy alert rules
ansible.builtin.template:
src: alert-rules.yml.j2
dest: "{{ prometheus_dir }}/alert-rules.yml"
owner: "{{ prometheus_user }}"
notify: reload prometheus
- name: Create systemd service
ansible.builtin.template:
src: prometheus.service.j2
dest: /etc/systemd/system/prometheus.service
notify:
- daemon reload
- restart prometheus
- name: Enable and start Prometheus
ansible.builtin.systemd:
name: prometheus
state: started
enabled: true
handlers:
- name: daemon reload
ansible.builtin.systemd:
daemon_reload: true
- name: restart prometheus
ansible.builtin.systemd:
name: prometheus
state: restarted
- name: reload prometheus
ansible.builtin.uri:
url: http://localhost:9090/-/reload
method: POST
Prometheus Configuration Template
# templates/prometheus.yml.j2
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- alert-rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['{{ alertmanager_host }}:9093']
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets:
{% for host in groups['all'] %}
- '{{ hostvars[host].ansible_host | default(host) }}:9100'
{% endfor %}
- job_name: 'application'
metrics_path: /metrics
static_configs:
- targets:
{% for host in groups['webservers'] %}
- '{{ hostvars[host].ansible_host | default(host) }}:8080'
{% endfor %}
Node Exporter on All Hosts
- name: Deploy Node Exporter
hosts: all
become: true
vars:
node_exporter_version: "1.8.1"
tasks:
- name: Download Node Exporter
ansible.builtin.unarchive:
src: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.linux-amd64.tar.gz"
dest: /opt/
remote_src: true
- name: Create symlink
ansible.builtin.file:
src: "/opt/node_exporter-{{ node_exporter_version }}.linux-amd64/node_exporter"
dest: /usr/local/bin/node_exporter
state: link
- name: Create systemd service
ansible.builtin.copy:
content: |
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=nobody
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
dest: /etc/systemd/system/node_exporter.service
- name: Start Node Exporter
ansible.builtin.systemd:
name: node_exporter
state: started
enabled: true
daemon_reload: true
Deploy Grafana
- name: Deploy Grafana
hosts: monitoring_server
become: true
tasks:
- name: Add Grafana repository
ansible.builtin.yum_repository:
name: grafana
description: Grafana Repository
baseurl: https://rpm.grafana.com
gpgkey: https://rpm.grafana.com/gpg.key
gpgcheck: true
- name: Install Grafana
ansible.builtin.package:
name: grafana
state: present
- name: Configure Grafana
ansible.builtin.template:
src: grafana.ini.j2
dest: /etc/grafana/grafana.ini
notify: restart grafana
- name: Start Grafana
ansible.builtin.systemd:
name: grafana-server
state: started
enabled: true
- name: Wait for Grafana to start
ansible.builtin.uri:
url: http://localhost:3000/api/health
register: grafana_health
retries: 10
delay: 5
until: grafana_health.status == 200
- name: Add Prometheus datasource
ansible.builtin.uri:
url: http://localhost:3000/api/datasources
method: POST
user: admin
password: "{{ grafana_admin_password }}"
force_basic_auth: true
body_format: json
body:
name: Prometheus
type: prometheus
url: http://localhost:9090
access: proxy
isDefault: true
status_code: [200, 409]
- name: Import dashboards
ansible.builtin.uri:
url: http://localhost:3000/api/dashboards/import
method: POST
user: admin
password: "{{ grafana_admin_password }}"
force_basic_auth: true
body_format: json
body:
dashboard:
id: "{{ item }}"
overwrite: true
inputs:
- name: DS_PROMETHEUS
type: datasource
pluginId: prometheus
value: Prometheus
loop:
- 1860 # Node Exporter Full
- 3662 # Prometheus 2.0 Stats
- 11074 # Node Exporter for Prometheus
handlers:
- name: restart grafana
ansible.builtin.systemd:
name: grafana-server
state: restarted
See also: Ansible Monitoring: Integrate with Prometheus, Grafana & Alerting (Complete Guide)
Deploy ELK Stack
Elasticsearch
- name: Deploy Elasticsearch
hosts: elasticsearch
become: true
tasks:
- name: Install Elasticsearch
ansible.builtin.package:
name: elasticsearch
state: present
- name: Configure Elasticsearch
ansible.builtin.template:
src: elasticsearch.yml.j2
dest: /etc/elasticsearch/elasticsearch.yml
notify: restart elasticsearch
- name: Set JVM heap
ansible.builtin.template:
src: jvm.options.j2
dest: /etc/elasticsearch/jvm.options.d/heap.options
- name: Start Elasticsearch
ansible.builtin.systemd:
name: elasticsearch
state: started
enabled: true
Logstash
- name: Deploy Logstash
hosts: logstash
become: true
tasks:
- name: Install Logstash
ansible.builtin.package:
name: logstash
state: present
- name: Deploy pipeline config
ansible.builtin.template:
src: logstash-pipeline.conf.j2
dest: /etc/logstash/conf.d/main.conf
notify: restart logstash
- name: Start Logstash
ansible.builtin.systemd:
name: logstash
state: started
enabled: true
Filebeat on All Hosts
- name: Deploy Filebeat
hosts: all
become: true
tasks:
- name: Install Filebeat
ansible.builtin.package:
name: filebeat
state: present
- name: Configure Filebeat
ansible.builtin.template:
src: filebeat.yml.j2
dest: /etc/filebeat/filebeat.yml
notify: restart filebeat
- name: Enable system module
ansible.builtin.command: filebeat modules enable system
changed_when: false
- name: Start Filebeat
ansible.builtin.systemd:
name: filebeat
state: started
enabled: true
handlers:
- name: restart filebeat
ansible.builtin.systemd:
name: filebeat
state: restarted
Alert Rules
# templates/alert-rules.yml.j2
groups:
- name: infrastructure
rules:
- alert: HighCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ '{{' }} $labels.instance {{ '}}' }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space < 10% on {{ '{{' }} $labels.instance {{ '}}' }}"
- alert: HostDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Host {{ '{{' }} $labels.instance {{ '}}' }} is down"
- alert: HighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
See also: Integrate Automation Controller, Prometheus, and Grafana to IT Monitor Realtime
Alertmanager
- name: Deploy Alertmanager
hosts: monitoring_server
become: true
tasks:
- name: Configure Alertmanager
ansible.builtin.template:
src: alertmanager.yml.j2
dest: /opt/alertmanager/alertmanager.yml
notify: restart alertmanager
# templates/alertmanager.yml.j2
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '{{ pagerduty_key }}'
- name: 'slack'
slack_configs:
- api_url: '{{ slack_webhook }}'
channel: '#alerts'
Auto-Remediation
- name: Auto-remediate monitoring alerts
hosts: "{{ alert_host }}"
become: true
tasks:
- name: Clear disk space
block:
- name: Clean package cache
ansible.builtin.command: apt-get clean
- name: Remove old logs
ansible.builtin.find:
paths: /var/log
patterns: "*.gz,*.old,*.1"
register: old_logs
- name: Delete old logs
ansible.builtin.file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_logs.files }}"
when: alert_type == 'DiskSpaceLow'
- name: Restart high-CPU service
ansible.builtin.systemd:
name: "{{ offending_service }}"
state: restarted
when: alert_type == 'HighCPU' and offending_service is defined
Best Practices
Monitor the monitoring — Prometheus should scrape itself; Grafana should have its own dashboard Retention planning — Set appropriate data retention for Prometheus and Elasticsearch Template everything — All configs should be Ansible templates for environment-specific values Alert on symptoms, not causes — "High error rate" over "High CPU" Runbook links in alerts — Include documentation URLs in alert annotations Test alerts regularly — Use Prometheus alerting rules tests Separate metrics and logs — Different storage backends for different data types Role-based access in Grafana — Different dashboards for different teamsFAQ
Prometheus vs ELK for metrics?
Prometheus is purpose-built for metrics (pull-based, efficient time-series storage). ELK is for logs. Use both — Prometheus for metrics/alerting, ELK for log aggregation/search.
How to scale Prometheus?
For large environments: Thanos or Cortex for long-term storage and multi-cluster federation. Ansible can deploy these too.
Grafana dashboards as code?
Yes — export dashboards as JSON, store in Git, deploy with Ansible's uri module to the Grafana API.
Conclusion
Ansible brings infrastructure-as-code discipline to your observability stack. By automating Prometheus, Grafana, and ELK deployment, you ensure monitoring is consistent, reproducible, and scales with your infrastructure.
Related Articles
• Ansible Performance Optimization • Ansible Docker Complete Guide • Ansible Automation Platform 2.6 • Ansible for Site Reliability EngineeringCategory: installation