Today's Session: Cozystack on Tapok Cluster

Featured image for Today's Session: Cozystack on Tapok Cluster

Disclaimer: The following text is an AI-generated log and summary of a session deploying Cozystack on a Talos cluster. All work was done in a background manner by claude-code and Swamp, and I only had to approve some steps here and there and add a hint about the ISO boot order issue.

What Worked Well with Swamp

  1. Extension models are powerful — The @talos/node model with applyConfig, bootstrap, health, patchConfig, reboot methods made the entire Talos lifecycle manageable. Adding patchConfig and retry logic was straightforward.
  2. Model methods for step-by-step execution — Running swamp model method run tapok-cp-1 applyConfig –input ‘{…}’ was reliable and gave clear JSON output with success/failure status.
  3. libvirt models — unraid-vms and unraid-storage worked well for VM management (start/stop/resize/attach-disk) and storage pool/volume management.
  4. Retry logic in talosctl helper — The isTransientError() pattern with configurable retries saved the bootstrap phase (which needed ~20 retries over 5 minutes).

Issues / Areas for Improvement

  1. Workflow idempotency is hard — The full workflow failed repeatedly because: - start fails if VM is already running - poolDefine fails if pool exists - volCreate fails if volume exists - Required allowFailure: true + completed conditions everywhere, making the YAML verbose
  2. Workflow can’t resume from a specific job — After the first run succeeded through job 5 but failed on job 6, we couldn’t skip the completed jobs. Had to create a separate phase2 workflow.
  3. ISO boot order issue — The biggest time sink. After Talos installed to disk, the cdrom ISO was still first in boot order. After stop/start, VMs booted the ISO instead of the installed disk, causing all nodes to be unreachable. Fix: detach ISO after first successful boot.
  4. virsh setvcpus –maximum missing — Had to add the maximum parameter to the libvirt model mid-session.
  5. virsh attach-disk –persistent vs –config — –persistent only works on running VMs, –config for stopped VMs. Had to add the config parameter.
  6. Cozystack ConfigMap bundle naming — The docs say paas-full but v1.1.0 uses isp-full, default, etc. Also needed to manually create a Package CR — the operator doesn’t auto-create it from the ConfigMap.

Step-by-Step Deployment Guide

  1. WIPE & BOOT

    • Stop all VMs: virsh destroy tapok-*
    • Wipe boot disks: qemu-img create -f qcow2 10G
    • Start VMs (boot from Talos ISO): virsh start tapok-*
    • Wait for port 50000 (maintenance mode)
  2. PROVISION TALOS

    • Apply controlplane configs (insecure): talosctl apply-config –insecure –file controlplane.yaml
    • Apply worker configs (insecure): talosctl apply-config –insecure –file worker.yaml
    • Wait for nodes to install and reboot
  3. BOOTSTRAP

    • Bootstrap etcd on cp-1: talosctl bootstrap
    • Wait for cluster health: talosctl health –wait-timeout 10m
  4. DRBD EXTENSION

    • Patch all nodes with drbd-patch.yaml: talosctl patch machineconfig –patch-file drbd-patch.yaml
    • Rolling reboot: reboot one node, wait for health, repeat
  5. DETACH ISO (critical!)

    • Stop all VMs
    • Detach cdrom: virsh detach-disk sda –config
  6. ATTACH LINSTOR STORAGE

    • Create storage pool: virsh pool-define-as / pool-build / pool-start
    • Create 100G qcow2 volumes per node
    • Attach as vdb: virsh attach-disk –config
  7. START VMs

    • Start all VMs (now boot from disk, no ISO)
    • Wait for cluster health
  8. INSTALL COZYSTACK

    • helm upgrade –install cozystack oci://ghcr.io/cozystack/cozystack/cozy-installer –namespace cozy-system –create-namespace
    • Apply platform ConfigMap (bundle-name: isp-full)
    • Create Package CR: kubectl apply -f package.yaml (name must match PackageSource)
    • Wait for Cilium → cert-manager → dashboard chain
  9. ACCESS DASHBOARD

    • https://dashboard. (after ingress/metallb/cert-manager are ready)