AnsiblePilot — Master Ansible Automation

AnsiblePilot is the leading resource for learning Ansible automation, DevOps, and infrastructure as code. Browse over 1,400 tutorials covering Ansible modules, playbooks, roles, collections, and real-world examples. Whether you are a beginner or an experienced engineer, our step-by-step guides help you automate Linux, Windows, cloud, containers, and network infrastructure.

Popular Topics

About Luca Berton

Luca Berton is an Ansible automation expert, author of 8 Ansible books published by Apress and Leanpub including "Ansible for VMware by Examples" and "Ansible for Kubernetes by Example", and creator of the Ansible Pilot YouTube channel. He shares practical automation knowledge through tutorials, books, and video courses to help IT professionals and DevOps engineers master infrastructure automation.

Ansible for AI Infrastructure: Deploy LLMs, GPUs & ML Pipelines (2026 Guide)

By Luca Berton · Published 2024-01-01 · Category: installation

Complete guide to automating AI infrastructure with Ansible. Deploy GPU clusters, configure NVIDIA drivers, serve LLMs with vLLM and TGI, manage model training.

AI infrastructure is the biggest IT spending category of 2026. Deloitte calls it an "AI infrastructure reckoning" — organizations must balance GPU costs, model choice, inference optimization, and deployment architecture. Ansible automates the entire AI compute stack from bare metal GPU provisioning to model serving.

AI Infrastructure Stack

┌─────────────────────────────────┐
│     Applications & Agents       │ ← Agent frameworks, APIs
├─────────────────────────────────┤
│     Model Serving (Inference)   │ ← vLLM, TGI, Triton
├─────────────────────────────────┤
│     Training & Fine-tuning      │ ← PyTorch, DeepSpeed
├─────────────────────────────────┤
│     ML Platform                 │ ← MLflow, Kubeflow, Ray
├─────────────────────────────────┤
│     Container Runtime           │ ← Docker + NVIDIA Toolkit
├─────────────────────────────────┤
│     GPU Drivers & CUDA          │ ← NVIDIA drivers, CUDA, cuDNN
├─────────────────────────────────┤
│     Bare Metal / Cloud VMs      │ ← Ansible manages this entire stack
└─────────────────────────────────┘

See also: Ansible for Agentic AI: Automate Multi-Agent Systems Infrastructure (2026 Guide)

Provision GPU Servers

Install NVIDIA Drivers

- name: Provision GPU servers for AI workloads
  hosts: gpu_servers
  become: true
  vars:
    nvidia_driver_version: "550"
    cuda_version: "12.6"

tasks: - name: Add NVIDIA driver repository ansible.builtin.apt_repository: repo: "ppa:graphics-drivers/ppa" state: present when: ansible_os_family == "Debian"

- name: Install NVIDIA drivers ansible.builtin.apt: name: - "nvidia-driver-{{ nvidia_driver_version }}" - nvidia-utils-{{ nvidia_driver_version }} state: present update_cache: true notify: reboot for nvidia

- name: Install CUDA toolkit ansible.builtin.apt: name: "nvidia-cuda-toolkit" state: present

- name: Verify GPU detection ansible.builtin.command: nvidia-smi register: nvidia_smi changed_when: false

- name: Display GPU info ansible.builtin.debug: msg: "{{ nvidia_smi.stdout_lines[:5] }}"

Configure NVIDIA Container Toolkit

    - name: Add NVIDIA Container Toolkit repo
      ansible.builtin.shell: |
        curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
          gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
        curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
          sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
          tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
      args:
        creates: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

- name: Install NVIDIA Container Toolkit ansible.builtin.apt: name: nvidia-container-toolkit state: present update_cache: true

- name: Configure Docker for NVIDIA runtime ansible.builtin.command: nvidia-ctk runtime configure --runtime=docker notify: restart docker

Deploy Model Inference Servers

vLLM — High-Throughput LLM Serving

- name: Deploy vLLM inference server
  hosts: inference_servers
  become: true
  vars:
    models:
      - name: "meta-llama/Llama-3.1-70B-Instruct"
        port: 8000
        gpu_memory: 0.9
        max_model_len: 8192
      - name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
        port: 8001
        gpu_memory: 0.85
        max_model_len: 32768

tasks: - name: Deploy vLLM instances community.docker.docker_container: name: "vllm-{{ item.port }}" image: vllm/vllm-openai:latest state: started restart_policy: unless-stopped ports: - "{{ item.port }}:8000" volumes: - /models:/root/.cache/huggingface env: HUGGING_FACE_HUB_TOKEN: "{{ vault_hf_token }}" command: > --model {{ item.name }} --gpu-memory-utilization {{ item.gpu_memory }} --max-model-len {{ item.max_model_len }} --enable-prefix-caching device_requests: - driver: nvidia count: -1 capabilities: [["gpu"]] loop: "{{ models }}" no_log: true

- name: Wait for inference servers ansible.builtin.uri: url: "http://localhost:{{ item.port }}/health" method: GET loop: "{{ models }}" register: health until: health is succeeded retries: 30 delay: 10

NVIDIA Triton Inference Server

- name: Deploy Triton for multi-model serving
  hosts: inference_servers
  become: true
  tasks:
    - name: Create model repository
      ansible.builtin.file:
        path: /models/triton-repo/{{ item }}/1
        state: directory
      loop:
        - llama-3
        - embedding-model
        - reranker

- name: Deploy Triton server community.docker.docker_container: name: triton image: nvcr.io/nvidia/tritonserver:24.10-py3 state: started ports: - "8000:8000" # HTTP - "8001:8001" # gRPC - "8002:8002" # Metrics volumes: - /models/triton-repo:/models command: tritonserver --model-repository=/models device_requests: - driver: nvidia count: -1 capabilities: [["gpu"]]

See also: AI DevOps Ansible Community on Skool

Training Infrastructure

PyTorch Distributed Training

- name: Configure distributed training cluster
  hosts: training_nodes
  become: true
  vars:
    nccl_socket_ifname: "eth0"
    master_addr: "{{ hostvars[groups['training_nodes'][0]]['ansible_host'] }}"
    master_port: 29500

tasks: - name: Install training dependencies ansible.builtin.pip: name: - torch - torchvision - deepspeed - transformers - accelerate - wandb virtualenv: /opt/training/venv

- name: Configure NCCL for multi-node training ansible.builtin.copy: content: | NCCL_SOCKET_IFNAME={{ nccl_socket_ifname }} NCCL_DEBUG=INFO MASTER_ADDR={{ master_addr }} MASTER_PORT={{ master_port }} WORLD_SIZE={{ groups['training_nodes'] | length }} RANK={{ groups['training_nodes'].index(inventory_hostname) }} dest: /opt/training/.env

- name: Configure shared storage for checkpoints ansible.posix.mount: path: /opt/training/checkpoints src: "nfs-server:/exports/checkpoints" fstype: nfs4 opts: rw,hard,intr state: mounted

GPU Monitoring and Cost Optimization

- name: Deploy GPU monitoring
  hosts: gpu_servers
  become: true
  tasks:
    - name: Deploy DCGM exporter for GPU metrics
      community.docker.docker_container:
        name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.3.0-ubuntu22.04
        state: started
        restart_policy: unless-stopped
        ports:
          - "9400:9400"
        device_requests:
          - driver: nvidia
            count: -1
            capabilities: [["gpu"]]

- name: Create GPU utilization alert rules ansible.builtin.copy: content: | groups: - name: gpu_alerts rules: - alert: GPULowUtilization expr: DCGM_FI_DEV_GPU_UTIL < 20 for: 30m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} underutilized on {{ $labels.instance }}" description: "GPU utilization below 20% for 30 minutes — consider consolidating workloads"

- alert: GPUMemoryNearFull expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95 for: 5m labels: severity: critical annotations: summary: "GPU memory >95% on {{ $labels.instance }}"

- alert: GPUTemperatureHigh expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 10m labels: severity: warning annotations: summary: "GPU temperature {{ $value }}°C on {{ $labels.instance }}" dest: /etc/prometheus/rules/gpu-alerts.yml notify: reload prometheus

See also: Ansible for Domain-Specific AI Models: Deploy & Manage Enterprise DSLMs (2026 Guide)

MLOps Platform Deployment

- name: Deploy MLflow tracking server
  hosts: mlops
  become: true
  vars:
    mlflow_port: 5000
    artifact_store: s3://ml-artifacts

tasks: - name: Install MLflow ansible.builtin.pip: name: - mlflow - boto3 - psycopg2-binary virtualenv: /opt/mlflow/venv

- name: Deploy MLflow service ansible.builtin.copy: content: | [Unit] Description=MLflow Tracking Server After=network.target postgresql.service

[Service] Type=simple User=mlflow ExecStart=/opt/mlflow/venv/bin/mlflow server \ --backend-store-uri postgresql://mlflow:{{ vault_mlflow_db_pass }}@localhost/mlflow \ --default-artifact-root {{ artifact_store }} \ --host 0.0.0.0 \ --port {{ mlflow_port }} Restart=always

[Install] WantedBy=multi-user.target dest: /etc/systemd/system/mlflow.service no_log: true notify: restart mlflow

Dynamic Inventory for AI Infrastructure

# inventory/ai-infrastructure.yml
all:
  children:
    gpu_servers:
      children:
        inference:
          hosts:
            inf01: { ansible_host: 10.0.1.10, gpus: 8, gpu_type: H100 }
            inf02: { ansible_host: 10.0.1.11, gpus: 4, gpu_type: A100 }
        training:
          hosts:
            train01: { ansible_host: 10.0.2.10, gpus: 8, gpu_type: H100 }
            train02: { ansible_host: 10.0.2.11, gpus: 8, gpu_type: H100 }
    mlops:
      hosts:
        mlops01: { ansible_host: 10.0.3.10 }
    vector_db:
      hosts:
        qdrant01: { ansible_host: 10.0.4.10 }

Cost Optimization Strategies

Right-size GPU allocation — Use Ansible to deploy appropriate model quantizations (4-bit, 8-bit) on smaller GPUs Schedule workloads — Cron-based Ansible jobs to scale down inference servers during off-peak hours Spot instance management — Dynamic inventory for cloud spot instances with automatic failover Model caching — Pre-download models to local NVMe storage to avoid repeated HuggingFace downloads Batch inference — Configure vLLM continuous batching parameters for higher throughput per GPU dollar

FAQ

How does Ansible help with AI infrastructure management?

Ansible automates the entire AI stack: GPU driver installation, CUDA toolkit setup, container runtime configuration, model deployment, inference server management, training cluster orchestration, and monitoring. It ensures consistent, reproducible environments across development, staging, and production.

Can Ansible manage GPU clusters?

Yes. Ansible installs NVIDIA drivers, configures CUDA, deploys the NVIDIA Container Toolkit, provisions inference servers (vLLM, Triton), manages distributed training configurations, and monitors GPU utilization with DCGM exporter.

What is the best way to deploy LLMs with Ansible?

Use containerized inference servers like vLLM or NVIDIA Triton, deployed via community.docker.docker_container. Pre-download models to shared storage, configure GPU memory allocation, and use health checks to verify deployment.

How do I optimize AI inference costs with Ansible?

Deploy model quantization (GPTQ, AWQ) for lower GPU memory usage, configure continuous batching in vLLM, use dynamic inventory for spot instances, schedule auto-scaling based on traffic, and monitor GPU utilization with alerts for underused resources.

Conclusion

AI infrastructure in 2026 demands the same automation discipline as traditional infrastructure. Ansible provides the tooling to deploy GPU clusters, manage model serving, orchestrate training pipelines, and optimize costs — turning AI infrastructure from artisanal GPU management into production-grade, version-controlled automation.

Related Articles

Ansible for Agentic AI: Multi-Agent SystemsAnsible Kubernetes k8s ModuleAnsible for AWS: Complete Guide

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home