Ansible for AI Infrastructure: Deploy LLMs, GPUs & ML Pipelines (2026 Guide)

By Luca Berton · Published 2024-01-01 · Category: installation

Complete guide to automating AI infrastructure with Ansible. Deploy GPU clusters, configure NVIDIA drivers, serve LLMs with vLLM and TGI, manage model training.

AI infrastructure is the biggest IT spending category of 2026. Deloitte calls it an "AI infrastructure reckoning" — organizations must balance GPU costs, model choice, inference optimization, and deployment architecture. Ansible automates the entire AI compute stack from bare metal GPU provisioning to model serving.

AI Infrastructure Stack

┌─────────────────────────────────┐
│     Applications & Agents       │ ← Agent frameworks, APIs
├─────────────────────────────────┤
│     Model Serving (Inference)   │ ← vLLM, TGI, Triton
├─────────────────────────────────┤
│     Training & Fine-tuning      │ ← PyTorch, DeepSpeed
├─────────────────────────────────┤
│     ML Platform                 │ ← MLflow, Kubeflow, Ray
├─────────────────────────────────┤
│     Container Runtime           │ ← Docker + NVIDIA Toolkit
├─────────────────────────────────┤
│     GPU Drivers & CUDA          │ ← NVIDIA drivers, CUDA, cuDNN
├─────────────────────────────────┤
│     Bare Metal / Cloud VMs      │ ← Ansible manages this entire stack
└─────────────────────────────────┘

Provision GPU Servers

Install NVIDIA Drivers

- name: Provision GPU servers for AI workloads
  hosts: gpu_servers
  become: true
  vars:
    nvidia_driver_version: "550"
    cuda_version: "12.6"

  tasks:
    - name: Add NVIDIA driver repository
      ansible.builtin.apt_repository:
        repo: "ppa:graphics-drivers/ppa"
        state: present
      when: ansible_os_family == "Debian"

    - name: Install NVIDIA drivers
      ansible.builtin.apt:
        name:
          - "nvidia-driver-{{ nvidia_driver_version }}"
          - nvidia-utils-{{ nvidia_driver_version }}
        state: present
        update_cache: true
      notify: reboot for nvidia

    - name: Install CUDA toolkit
      ansible.builtin.apt:
        name: "nvidia-cuda-toolkit"
        state: present

    - name: Verify GPU detection
      ansible.builtin.command: nvidia-smi
      register: nvidia_smi
      changed_when: false

    - name: Display GPU info
      ansible.builtin.debug:
        msg: "{{ nvidia_smi.stdout_lines[:5] }}"

Configure NVIDIA Container Toolkit

    - name: Add NVIDIA Container Toolkit repo
      ansible.builtin.shell: |
        curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
          gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
        curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
          sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
          tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
      args:
        creates: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

    - name: Install NVIDIA Container Toolkit
      ansible.builtin.apt:
        name: nvidia-container-toolkit
        state: present
        update_cache: true

    - name: Configure Docker for NVIDIA runtime
      ansible.builtin.command: nvidia-ctk runtime configure --runtime=docker
      notify: restart docker

Deploy Model Inference Servers

vLLM — High-Throughput LLM Serving

- name: Deploy vLLM inference server
  hosts: inference_servers
  become: true
  vars:
    models:
      - name: "meta-llama/Llama-3.1-70B-Instruct"
        port: 8000
        gpu_memory: 0.9
        max_model_len: 8192
      - name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
        port: 8001
        gpu_memory: 0.85
        max_model_len: 32768

  tasks:
    - name: Deploy vLLM instances
      community.docker.docker_container:
        name: "vllm-{{ item.port }}"
        image: vllm/vllm-openai:latest
        state: started
        restart_policy: unless-stopped
        ports:
          - "{{ item.port }}:8000"
        volumes:
          - /models:/root/.cache/huggingface
        env:
          HUGGING_FACE_HUB_TOKEN: "{{ vault_hf_token }}"
        command: >
          --model {{ item.name }}
          --gpu-memory-utilization {{ item.gpu_memory }}
          --max-model-len {{ item.max_model_len }}
          --enable-prefix-caching
        device_requests:
          - driver: nvidia
            count: -1
            capabilities: [["gpu"]]
      loop: "{{ models }}"
      no_log: true

    - name: Wait for inference servers
      ansible.builtin.uri:
        url: "http://localhost:{{ item.port }}/health"
        method: GET
      loop: "{{ models }}"
      register: health
      until: health is succeeded
      retries: 30
      delay: 10

NVIDIA Triton Inference Server

- name: Deploy Triton for multi-model serving
  hosts: inference_servers
  become: true
  tasks:
    - name: Create model repository
      ansible.builtin.file:
        path: /models/triton-repo/{{ item }}/1
        state: directory
      loop:
        - llama-3
        - embedding-model
        - reranker

    - name: Deploy Triton server
      community.docker.docker_container:
        name: triton
        image: nvcr.io/nvidia/tritonserver:24.10-py3
        state: started
        ports:
          - "8000:8000"   # HTTP
          - "8001:8001"   # gRPC
          - "8002:8002"   # Metrics
        volumes:
          - /models/triton-repo:/models
        command: tritonserver --model-repository=/models
        device_requests:
          - driver: nvidia
            count: -1
            capabilities: [["gpu"]]

Training Infrastructure

PyTorch Distributed Training

- name: Configure distributed training cluster
  hosts: training_nodes
  become: true
  vars:
    nccl_socket_ifname: "eth0"
    master_addr: "{{ hostvars[groups['training_nodes'][0]]['ansible_host'] }}"
    master_port: 29500

  tasks:
    - name: Install training dependencies
      ansible.builtin.pip:
        name:
          - torch
          - torchvision
          - deepspeed
          - transformers
          - accelerate
          - wandb
        virtualenv: /opt/training/venv

    - name: Configure NCCL for multi-node training
      ansible.builtin.copy:
        content: |
          NCCL_SOCKET_IFNAME={{ nccl_socket_ifname }}
          NCCL_DEBUG=INFO
          MASTER_ADDR={{ master_addr }}
          MASTER_PORT={{ master_port }}
          WORLD_SIZE={{ groups['training_nodes'] | length }}
          RANK={{ groups['training_nodes'].index(inventory_hostname) }}
        dest: /opt/training/.env

    - name: Configure shared storage for checkpoints
      ansible.posix.mount:
        path: /opt/training/checkpoints
        src: "nfs-server:/exports/checkpoints"
        fstype: nfs4
        opts: rw,hard,intr
        state: mounted

GPU Monitoring and Cost Optimization

- name: Deploy GPU monitoring
  hosts: gpu_servers
  become: true
  tasks:
    - name: Deploy DCGM exporter for GPU metrics
      community.docker.docker_container:
        name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.3.0-ubuntu22.04
        state: started
        restart_policy: unless-stopped
        ports:
          - "9400:9400"
        device_requests:
          - driver: nvidia
            count: -1
            capabilities: [["gpu"]]

    - name: Create GPU utilization alert rules
      ansible.builtin.copy:
        content: |
          groups:
            - name: gpu_alerts
              rules:
                - alert: GPULowUtilization
                  expr: DCGM_FI_DEV_GPU_UTIL < 20
                  for: 30m
                  labels:
                    severity: warning
                  annotations:
                    summary: "GPU {{ $labels.gpu }} underutilized on {{ $labels.instance }}"
                    description: "GPU utilization below 20% for 30 minutes — consider consolidating workloads"

                - alert: GPUMemoryNearFull
                  expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 0.95
                  for: 5m
                  labels:
                    severity: critical
                  annotations:
                    summary: "GPU memory >95% on {{ $labels.instance }}"

                - alert: GPUTemperatureHigh
                  expr: DCGM_FI_DEV_GPU_TEMP > 85
                  for: 10m
                  labels:
                    severity: warning
                  annotations:
                    summary: "GPU temperature {{ $value }}°C on {{ $labels.instance }}"
        dest: /etc/prometheus/rules/gpu-alerts.yml
      notify: reload prometheus

MLOps Platform Deployment

- name: Deploy MLflow tracking server
  hosts: mlops
  become: true
  vars:
    mlflow_port: 5000
    artifact_store: s3://ml-artifacts

  tasks:
    - name: Install MLflow
      ansible.builtin.pip:
        name:
          - mlflow
          - boto3
          - psycopg2-binary
        virtualenv: /opt/mlflow/venv

    - name: Deploy MLflow service
      ansible.builtin.copy:
        content: |
          [Unit]
          Description=MLflow Tracking Server
          After=network.target postgresql.service

          [Service]
          Type=simple
          User=mlflow
          ExecStart=/opt/mlflow/venv/bin/mlflow server \
            --backend-store-uri postgresql://mlflow:{{ vault_mlflow_db_pass }}@localhost/mlflow \
            --default-artifact-root {{ artifact_store }} \
            --host 0.0.0.0 \
            --port {{ mlflow_port }}
          Restart=always

          [Install]
          WantedBy=multi-user.target
        dest: /etc/systemd/system/mlflow.service
      no_log: true
      notify: restart mlflow

Dynamic Inventory for AI Infrastructure

# inventory/ai-infrastructure.yml
all:
  children:
    gpu_servers:
      children:
        inference:
          hosts:
            inf01: { ansible_host: 10.0.1.10, gpus: 8, gpu_type: H100 }
            inf02: { ansible_host: 10.0.1.11, gpus: 4, gpu_type: A100 }
        training:
          hosts:
            train01: { ansible_host: 10.0.2.10, gpus: 8, gpu_type: H100 }
            train02: { ansible_host: 10.0.2.11, gpus: 8, gpu_type: H100 }
    mlops:
      hosts:
        mlops01: { ansible_host: 10.0.3.10 }
    vector_db:
      hosts:
        qdrant01: { ansible_host: 10.0.4.10 }

Cost Optimization Strategies

Right-size GPU allocation — Use Ansible to deploy appropriate model quantizations (4-bit, 8-bit) on smaller GPUs
Schedule workloads — Cron-based Ansible jobs to scale down inference servers during off-peak hours
Spot instance management — Dynamic inventory for cloud spot instances with automatic failover
Model caching — Pre-download models to local NVMe storage to avoid repeated HuggingFace downloads
Batch inference — Configure vLLM continuous batching parameters for higher throughput per GPU dollar

FAQ

How does Ansible help with AI infrastructure management?

Ansible automates the entire AI stack: GPU driver installation, CUDA toolkit setup, container runtime configuration, model deployment, inference server management, training cluster orchestration, and monitoring. It ensures consistent, reproducible environments across development, staging, and production.

Can Ansible manage GPU clusters?

Yes. Ansible installs NVIDIA drivers, configures CUDA, deploys the NVIDIA Container Toolkit, provisions inference servers (vLLM, Triton), manages distributed training configurations, and monitors GPU utilization with DCGM exporter.

What is the best way to deploy LLMs with Ansible?

Use containerized inference servers like vLLM or NVIDIA Triton, deployed via community.docker.docker_container. Pre-download models to shared storage, configure GPU memory allocation, and use health checks to verify deployment.

How do I optimize AI inference costs with Ansible?

Deploy model quantization (GPTQ, AWQ) for lower GPU memory usage, configure continuous batching in vLLM, use dynamic inventory for spot instances, schedule auto-scaling based on traffic, and monitor GPU utilization with alerts for underused resources.

Conclusion

AI infrastructure in 2026 demands the same automation discipline as traditional infrastructure. Ansible provides the tooling to deploy GPU clusters, manage model serving, orchestrate training pipelines, and optimize costs — turning AI infrastructure from artisanal GPU management into production-grade, version-controlled automation.

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton