Ansible for AI Infrastructure: Deploy LLMs, GPUs & ML Pipelines (2026 Guide)
By Luca Berton · Published 2024-01-01 · Category: installation
Complete guide to automating AI infrastructure with Ansible. Deploy GPU clusters, configure NVIDIA drivers, serve LLMs with vLLM and TGI, manage model training.
AI infrastructure is the biggest IT spending category of 2026. Deloitte calls it an "AI infrastructure reckoning" — organizations must balance GPU costs, model choice, inference optimization, and deployment architecture. Ansible automates the entire AI compute stack from bare metal GPU provisioning to model serving.
AI Infrastructure Stack
┌─────────────────────────────────┐
│ Applications & Agents │ ← Agent frameworks, APIs
├─────────────────────────────────┤
│ Model Serving (Inference) │ ← vLLM, TGI, Triton
├─────────────────────────────────┤
│ Training & Fine-tuning │ ← PyTorch, DeepSpeed
├─────────────────────────────────┤
│ ML Platform │ ← MLflow, Kubeflow, Ray
├─────────────────────────────────┤
│ Container Runtime │ ← Docker + NVIDIA Toolkit
├─────────────────────────────────┤
│ GPU Drivers & CUDA │ ← NVIDIA drivers, CUDA, cuDNN
├─────────────────────────────────┤
│ Bare Metal / Cloud VMs │ ← Ansible manages this entire stack
└─────────────────────────────────┘
See also: Ansible for Agentic AI: Automate Multi-Agent Systems Infrastructure (2026 Guide)
Provision GPU Servers
Install NVIDIA Drivers
- name: Provision GPU servers for AI workloads
hosts: gpu_servers
become: true
vars:
nvidia_driver_version: "550"
cuda_version: "12.6"
tasks:
- name: Add NVIDIA driver repository
ansible.builtin.apt_repository:
repo: "ppa:graphics-drivers/ppa"
state: present
when: ansible_os_family == "Debian"
- name: Install NVIDIA drivers
ansible.builtin.apt:
name:
- "nvidia-driver-{{ nvidia_driver_version }}"
- nvidia-utils-{{ nvidia_driver_version }}
state: present
update_cache: true
notify: reboot for nvidia
- name: Install CUDA toolkit
ansible.builtin.apt:
name: "nvidia-cuda-toolkit"
state: present
- name: Verify GPU detection
ansible.builtin.command: nvidia-smi
register: nvidia_smi
changed_when: false
- name: Display GPU info
ansible.builtin.debug:
msg: "{{ nvidia_smi.stdout_lines[:5] }}"
Configure NVIDIA Container Toolkit
- name: Add NVIDIA Container Toolkit repo
ansible.builtin.shell: |
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
args:
creates: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- name: Install NVIDIA Container Toolkit
ansible.builtin.apt:
name: nvidia-container-toolkit
state: present
update_cache: true
- name: Configure Docker for NVIDIA runtime
ansible.builtin.command: nvidia-ctk runtime configure --runtime=docker
notify: restart docker
Deploy Model Inference Servers
vLLM — High-Throughput LLM Serving
- name: Deploy vLLM inference server
hosts: inference_servers
become: true
vars:
models:
- name: "meta-llama/Llama-3.1-70B-Instruct"
port: 8000
gpu_memory: 0.9
max_model_len: 8192
- name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
port: 8001
gpu_memory: 0.85
max_model_len: 32768
tasks:
- name: Deploy vLLM instances
community.docker.docker_container:
name: "vllm-{{ item.port }}"
image: vllm/vllm-openai:latest
state: started
restart_policy: unless-stopped
ports:
- "{{ item.port }}:8000"
volumes:
- /models:/root/.cache/huggingface
env:
HUGGING_FACE_HUB_TOKEN: "{{ vault_hf_token }}"
command: >
--model {{ item.name }}
--gpu-memory-utilization {{ item.gpu_memory }}
--max-model-len {{ item.max_model_len }}
--enable-prefix-caching
device_requests:
- driver: nvidia
count: -1
capabilities: [["gpu"]]
loop: "{{ models }}"
no_log: true
- name: Wait for inference servers
ansible.builtin.uri:
url: "http://localhost:{{ item.port }}/health"
method: GET
loop: "{{ models }}"
register: health
until: health is succeeded
retries: 30
delay: 10
NVIDIA Triton Inference Server
- name: Deploy Triton for multi-model serving
hosts: inference_servers
become: true
tasks:
- name: Create model repository
ansible.builtin.file:
path: /models/triton-repo/{{ item }}/1
state: directory
loop:
- llama-3
- embedding-model
- reranker
- name: Deploy Triton server
community.docker.docker_container:
name: triton
image: nvcr.io/nvidia/tritonserver:24.10-py3
state: started
ports:
- "8000:8000" # HTTP
- "8001:8001" # gRPC
- "8002:8002" # Metrics
volumes:
- /models/triton-repo:/models
command: tritonserver --model-repository=/models
device_requests:
- driver: nvidia
count: -1
capabilities: [["gpu"]]
See also: AI DevOps Ansible Community on Skool
Training Infrastructure
PyTorch Distributed Training
- name: Configure distributed training cluster
hosts: training_nodes
become: true
vars:
nccl_socket_ifname: "eth0"
master_addr: "{{ hostvars[groups['training_nodes'][0]]['ansible_host'] }}"
master_port: 29500
tasks:
- name: Install training dependencies
ansible.builtin.pip:
name:
- torch
- torchvision
- deepspeed
- transformers
- accelerate
- wandb
virtualenv: /opt/training/venv
- name: Configure NCCL for multi-node training
ansible.builtin.copy:
content: |
NCCL_SOCKET_IFNAME={{ nccl_socket_ifname }}
NCCL_DEBUG=INFO
MASTER_ADDR={{ master_addr }}
MASTER_PORT={{ master_port }}
WORLD_SIZE={{ groups['training_nodes'] | length }}
RANK={{ groups['training_nodes'].index(inventory_hostname) }}
dest: /opt/training/.env
- name: Configure shared storage for checkpoints
ansible.posix.mount:
path: /opt/training/checkpoints
src: "nfs-server:/exports/checkpoints"
fstype: nfs4
opts: rw,hard,intr
state: mounted
GPU Monitoring and Cost Optimization
- name: Deploy GPU monitoring
hosts: gpu_servers
become: true
tasks:
- name: Deploy DCGM exporter for GPU metrics
community.docker.docker_container:
name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.3.0-ubuntu22.04
state: started
restart_policy: unless-stopped
ports:
- "9400:9400"
device_requests:
- driver: nvidia
count: -1
capabilities: [["gpu"]]
- name: Create GPU utilization alert rules
ansible.builtin.copy:
content: |
groups:
- name: gpu_alerts
rules:
- alert: GPULowUtilization
expr: DCGM_FI_DEV_GPU_UTIL < 20
for: 30m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} underutilized on {{ $labels.instance }}"
description: "GPU utilization below 20% for 30 minutes — consider consolidating workloads"
- alert: GPUMemoryNearFull
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory >95% on {{ $labels.instance }}"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 10m
labels:
severity: warning
annotations:
summary: "GPU temperature {{ $value }}°C on {{ $labels.instance }}"
dest: /etc/prometheus/rules/gpu-alerts.yml
notify: reload prometheus
See also: Ansible for Domain-Specific AI Models: Deploy & Manage Enterprise DSLMs (2026 Guide)
MLOps Platform Deployment
- name: Deploy MLflow tracking server
hosts: mlops
become: true
vars:
mlflow_port: 5000
artifact_store: s3://ml-artifacts
tasks:
- name: Install MLflow
ansible.builtin.pip:
name:
- mlflow
- boto3
- psycopg2-binary
virtualenv: /opt/mlflow/venv
- name: Deploy MLflow service
ansible.builtin.copy:
content: |
[Unit]
Description=MLflow Tracking Server
After=network.target postgresql.service
[Service]
Type=simple
User=mlflow
ExecStart=/opt/mlflow/venv/bin/mlflow server \
--backend-store-uri postgresql://mlflow:{{ vault_mlflow_db_pass }}@localhost/mlflow \
--default-artifact-root {{ artifact_store }} \
--host 0.0.0.0 \
--port {{ mlflow_port }}
Restart=always
[Install]
WantedBy=multi-user.target
dest: /etc/systemd/system/mlflow.service
no_log: true
notify: restart mlflow
Dynamic Inventory for AI Infrastructure
# inventory/ai-infrastructure.yml
all:
children:
gpu_servers:
children:
inference:
hosts:
inf01: { ansible_host: 10.0.1.10, gpus: 8, gpu_type: H100 }
inf02: { ansible_host: 10.0.1.11, gpus: 4, gpu_type: A100 }
training:
hosts:
train01: { ansible_host: 10.0.2.10, gpus: 8, gpu_type: H100 }
train02: { ansible_host: 10.0.2.11, gpus: 8, gpu_type: H100 }
mlops:
hosts:
mlops01: { ansible_host: 10.0.3.10 }
vector_db:
hosts:
qdrant01: { ansible_host: 10.0.4.10 }
Cost Optimization Strategies
Right-size GPU allocation — Use Ansible to deploy appropriate model quantizations (4-bit, 8-bit) on smaller GPUs Schedule workloads — Cron-based Ansible jobs to scale down inference servers during off-peak hours Spot instance management — Dynamic inventory for cloud spot instances with automatic failover Model caching — Pre-download models to local NVMe storage to avoid repeated HuggingFace downloads Batch inference — Configure vLLM continuous batching parameters for higher throughput per GPU dollarFAQ
How does Ansible help with AI infrastructure management?
Ansible automates the entire AI stack: GPU driver installation, CUDA toolkit setup, container runtime configuration, model deployment, inference server management, training cluster orchestration, and monitoring. It ensures consistent, reproducible environments across development, staging, and production.
Can Ansible manage GPU clusters?
Yes. Ansible installs NVIDIA drivers, configures CUDA, deploys the NVIDIA Container Toolkit, provisions inference servers (vLLM, Triton), manages distributed training configurations, and monitors GPU utilization with DCGM exporter.
What is the best way to deploy LLMs with Ansible?
Use containerized inference servers like vLLM or NVIDIA Triton, deployed via community.docker.docker_container. Pre-download models to shared storage, configure GPU memory allocation, and use health checks to verify deployment.
How do I optimize AI inference costs with Ansible?
Deploy model quantization (GPTQ, AWQ) for lower GPU memory usage, configure continuous batching in vLLM, use dynamic inventory for spot instances, schedule auto-scaling based on traffic, and monitor GPU utilization with alerts for underused resources.
Conclusion
AI infrastructure in 2026 demands the same automation discipline as traditional infrastructure. Ansible provides the tooling to deploy GPU clusters, manage model serving, orchestrate training pipelines, and optimize costs — turning AI infrastructure from artisanal GPU management into production-grade, version-controlled automation.
Related Articles
• Ansible for Agentic AI: Multi-Agent Systems • Ansible Kubernetes k8s Module • Ansible for AWS: Complete GuideCategory: installation