Ansible on Talos Linux: Reboot-aware Patching Workflow Complete Guide

By Luca Berton · Published 2024-01-01 · Category: installation

Automate reboot-aware patching workflow on Talos Linux (Kubernetes-native, GA rolling) with Ansible.

Talos Linux is an immutable, API-managed Kubernetes OS: no SSH, no shell, no package manager, and no apt/dnf/rpm-ostree. You never patch a Talos node in place. Instead you upgrade the whole OS image with talosctl upgrade, which reboots the node into the new version. The "reboot-aware" part is making that rolling and non-disruptive: cordon and drain the node first, upgrade, wait for it to rejoin, then uncordon — one node at a time.

This guide automates that workflow with Ansible, driving talosctl and the kubernetes.core collection from the control node.

> This is not a Fedora CoreOS / rpm-ostree workflow. Talos has no rpm-ostree, no Zincati, and no SSH — everything below goes through the Talos API and the Kubernetes API.

How Talos upgrades work

OS upgrade: talosctl upgrade --image replaces the running Talos image and reboots the node. Talos keeps the previous version, so talosctl rollback reverts to it.
Kubernetes upgrade: separate from the OS, done with talosctl upgrade-k8s --to .
Reboot-aware: Ansible cordons and drains the node before the upgrade and uncordons it once it is Ready, so workloads move off first.

Prerequisites

No agent runs on the Talos nodes — everything happens from the control node:

talosctl and the cluster talosconfig, plus kubectl and the kubeconfig (both produced during bootstrap).
ansible-core 2.15+ with kubernetes.core 3.x+ and the kubernetes Python library.
PodDisruptionBudgets on critical workloads so draining respects availability. For control plane nodes, route talosctl through a different, healthy control plane endpoint so the API stays up while one node reboots.

Inventory

# inventory/talos.ini
[talos]
localhost ansible_connection=local

[talos:vars]
kubeconfig=/path/to/talos/kubeconfig
talosconfig=/path/to/talos/talosconfig
talos_version=v1.8.3

Rolling upgrade playbook

The play upgrades one node at a time. For each node it drains, runs talosctl upgrade (which reboots into the new image), waits for the node to become Ready again, then uncordons it.

---
- name: Reboot-aware Talos upgrade (rolling, one node at a time)
  hosts: localhost
  connection: local
  gather_facts: false
  vars:
    kubeconfig: "{{ hostvars['localhost'].kubeconfig }}"
    talosconfig: "{{ hostvars['localhost'].talosconfig }}"
    installer_image: "ghcr.io/siderolabs/installer:{{ hostvars['localhost'].talos_version }}"
    talos_nodes:
      - { name: cp1, ip: 192.168.0.2 }
      - { name: w1,  ip: 192.168.0.10 }
      - { name: w2,  ip: 192.168.0.11 }
  tasks:
    - name: Upgrade each node in turn
      ansible.builtin.include_tasks: upgrade-node.yml
      loop: "{{ talos_nodes }}"
      loop_control:
        loop_var: node

upgrade-node.yml:

---
- name: Cordon and drain {{ node.name }}
  kubernetes.core.k8s_drain:
    kubeconfig: "{{ kubeconfig }}"
    name: "{{ node.name }}"
    state: drain
    delete_options:
      ignore_daemonsets: true
      delete_emptydir_data: true
      wait_timeout: 300

- name: Upgrade Talos on {{ node.name }} (reboots into the new image)
  ansible.builtin.command:
    cmd: >-
      talosctl upgrade
      --talosconfig {{ talosconfig }}
      --nodes {{ node.ip }} --endpoints {{ node.ip }}
      --image {{ installer_image }} --preserve
  register: upgrade
  changed_when: upgrade.rc == 0

- name: Wait for {{ node.name }} to rejoin and become Ready
  kubernetes.core.k8s_info:
    kubeconfig: "{{ kubeconfig }}"
    kind: Node
    name: "{{ node.name }}"
  register: n
  retries: 40
  delay: 15
  until:
    - n.resources | length > 0
    - n.resources[0].status.conditions
      | selectattr('type', 'equalto', 'Ready')
      | selectattr('status', 'equalto', 'True') | list | length > 0

- name: Uncordon {{ node.name }}
  kubernetes.core.k8s_drain:
    kubeconfig: "{{ kubeconfig }}"
    name: "{{ node.name }}"
    state: uncordon

> --preserve keeps the node's ephemeral data across the upgrade — important on control plane nodes so etcd data survives. With a single control plane node you also tolerate a short API outage while it reboots; in HA, point --endpoints at another control plane node.

Upgrading Kubernetes itself

The OS upgrade above does not change the Kubernetes version. Do that separately, once the OS is current:

talosctl --talosconfig talos/talosconfig -n 192.168.0.2 upgrade-k8s --to 1.31.0

Validation

ansible-playbook -i inventory/talos.ini rolling-upgrade.yml

talosctl --talosconfig talos/talosconfig -n 192.168.0.2 version
kubectl --kubeconfig talos/kubeconfig get nodes -o wide

Each node should report the new Talos version and Ready. Re-running is safe: talosctl upgrade detects when a node already runs the target image and skips the reboot, and the drain/uncordon steps converge.

Troubleshooting

Symptom	Likely cause	Fix
Drain hangs or times out	A PodDisruptionBudget or an unmanaged Pod blocks eviction	Raise `wait_timeout`, fix the PDB, or set `force: true` in `delete_options` for Pods with no controller
Node never returns after upgrade	Bad image or failed boot	`talosctl rollback --nodes` reverts to the previous Talos version
`etcd` unhealthy after a control plane upgrade	Upgraded without `--preserve`, or two control plane nodes at once	Upgrade control plane nodes one at a time with `--preserve`; check `talosctl etcd status`
`certificate signed by unknown authority`	Wrong or missing `talosconfig`	Pass the cluster's `--talosconfig` generated at bootstrap
Kubernetes version unchanged after upgrade	The OS upgrade does not bump Kubernetes	Run `talosctl upgrade-k8s --to` separately

FAQ

Q. Does Talos use apt, dnf, or rpm-ostree for patching? None of them. Talos is a single immutable image with no package manager. You "patch" by upgrading the whole image with talosctl upgrade, which reboots the node into the new version.

Q. How do I roll back a bad upgrade? talosctl rollback --nodes boots the node back into the previous Talos version, which Talos retains on the other partition.

Q. Does Ansible SSH into the Talos nodes to reboot them? No — there is no SSH. The reboot happens as part of talosctl upgrade, driven over the Talos API; Ansible runs connection: local on the control node.

Q. How do I patch without downtime? Upgrade one node at a time (the loop above), draining first so workloads reschedule, and keep at least three control plane nodes so the API stays available during each reboot.

Q. Is the OS upgrade the same as a Kubernetes upgrade? No. talosctl upgrade changes the Talos OS image; talosctl upgrade-k8s changes the Kubernetes component versions. Run them as separate, deliberate steps.

Conclusion

Patching Talos Linux is image-based, not package-based: talosctl upgrade swaps the OS image and reboots, and talosctl rollback undoes it. Ansible makes that safe and repeatable by draining each node first, upgrading one at a time with --preserve, waiting for Ready, and uncordoning — a true reboot-aware rolling upgrade driven entirely through the Talos and Kubernetes APIs, with no SSH and no in-place package edits.

Category: installation

Browse all Ansible tutorials · AnsiblePilot Home

AnsiblePilot — Master Ansible Automation

Popular Topics

About Luca Berton