Today's Session: Cozystack on Tapok Cluster
Disclaimer: The following text is an AI-generated log and summary of a session deploying Cozystack on a Talos cluster. All work was done in a background manner by
claude-codeandSwamp, and I only had to approve some steps here and there and add a hint about the ISO boot order issue.
What Worked Well with Swamp
- Extension models are powerful — The @talos/node model with applyConfig, bootstrap, health, patchConfig, reboot methods made the entire Talos lifecycle manageable. Adding patchConfig and retry logic was straightforward.
- Model methods for step-by-step execution — Running swamp model method run tapok-cp-1 applyConfig –input ‘{…}’ was reliable and gave clear JSON output with success/failure status.
- libvirt models — unraid-vms and unraid-storage worked well for VM management (start/stop/resize/attach-disk) and storage pool/volume management.
- Retry logic in talosctl helper — The isTransientError() pattern with configurable retries saved the bootstrap phase (which needed ~20 retries over 5 minutes).
Issues / Areas for Improvement
- Workflow idempotency is hard — The full workflow failed repeatedly because: - start fails if VM is already running - poolDefine fails if pool exists - volCreate fails if volume exists - Required allowFailure: true + completed conditions everywhere, making the YAML verbose
- Workflow can’t resume from a specific job — After the first run succeeded through job 5 but failed on job 6, we couldn’t skip the completed jobs. Had to create a separate phase2 workflow.
- ISO boot order issue — The biggest time sink. After Talos installed to disk, the cdrom ISO was still first in boot order. After stop/start, VMs booted the ISO instead of the installed disk, causing all nodes to be unreachable. Fix: detach ISO after first successful boot.
- virsh setvcpus –maximum missing — Had to add the maximum parameter to the libvirt model mid-session.
- virsh attach-disk –persistent vs –config — –persistent only works on running VMs, –config for stopped VMs. Had to add the config parameter.
- Cozystack ConfigMap bundle naming — The docs say paas-full but v1.1.0 uses isp-full, default, etc. Also needed to manually create a Package CR — the operator doesn’t auto-create it from the ConfigMap.
Step-by-Step Deployment Guide
-
WIPE & BOOT
- Stop all VMs: virsh destroy tapok-*
- Wipe boot disks: qemu-img create -f qcow2 10G
- Start VMs (boot from Talos ISO): virsh start tapok-*
- Wait for port 50000 (maintenance mode)
-
PROVISION TALOS
- Apply controlplane configs (insecure): talosctl apply-config –insecure –file controlplane.yaml
- Apply worker configs (insecure): talosctl apply-config –insecure –file worker.yaml
- Wait for nodes to install and reboot
-
BOOTSTRAP
- Bootstrap etcd on cp-1: talosctl bootstrap
- Wait for cluster health: talosctl health –wait-timeout 10m
-
DRBD EXTENSION
- Patch all nodes with drbd-patch.yaml: talosctl patch machineconfig –patch-file drbd-patch.yaml
- Rolling reboot: reboot one node, wait for health, repeat
-
DETACH ISO (critical!)
- Stop all VMs
- Detach cdrom: virsh detach-disk sda –config
-
ATTACH LINSTOR STORAGE
- Create storage pool: virsh pool-define-as / pool-build / pool-start
- Create 100G qcow2 volumes per node
- Attach as vdb: virsh attach-disk –config
-
START VMs
- Start all VMs (now boot from disk, no ISO)
- Wait for cluster health
-
INSTALL COZYSTACK
- helm upgrade –install cozystack oci://ghcr.io/cozystack/cozystack/cozy-installer –namespace cozy-system –create-namespace
- Apply platform ConfigMap (bundle-name: isp-full)
- Create Package CR: kubectl apply -f package.yaml (name must match PackageSource)
- Wait for Cilium → cert-manager → dashboard chain
-
ACCESS DASHBOARD
- https://dashboard. (after ingress/metallb/cert-manager are ready)