swamp-go-brr, the brain

Last post I built a cage — a Firecracker microVM to hold an agent I had let off the leash. This post is about what I put in it, and why the thing I put in it is a state machine and not a swarm.

It started with a question I had been circling for a while. My engineering already runs through issue-lifecycle: Claude researches and plans, I review and correct the plan, and it implements against tests. It works, and over time the repository fills with more than code — the methodologies, the architecture decisions, the patterns and antipatterns Claude hit along the way, and a set of UAT/BDD scenarios that pin the behaviour. The process writes its own knowledge base. But it needs my hand on the wheel for every plan and every correction. So the question was: what if the principles and the tests are good enough that I could throw a task at it and walk away?

其ノ壹 01 / 07

Not a swarm

There is a fashionable answer to that, and I did not want it. You can hand the whole thing to a swarm of agents — Gas Town and its kin — and let them improvise their way to a result, spawning sub-agents, negotiating, retrying, until something falls out the end. It is genuinely impressive, and I cannot reason about it. I cannot tell you why it did what it did, I cannot replay it, and when it goes wrong I cannot point at the transition that broke. For a loop that runs unattended, with my credentials, on real code, “impressive and impossible to reason about” is exactly the wrong trade. I wanted the orchestration deterministic and the non-determinism quarantined to the leaves.

其ノ貳 02 / 07

Brain and hands

So I asked Claude to design swamp-go-brr the way I build everything else — through issue-lifecycle — but with the human removed from the execution loop. What came out has a clean split.

The brain is gobrr: a pure state machine, a Run aggregate over a dynamic task DAG, and nothing it does is an LLM call. It seeds the DAG in one batch; it leases ready tasks to workers, one per call, up to a concurrency cap; it reaps any task whose lease lapses on a heartbeat TTL, so a worker that dies in its microVM does not wedge the run; it caps total attempts so a task that keeps failing surfaces instead of looping forever; and it moves each task through its states on the verdicts it is handed. Every one of those is a deterministic transition I can inspect and replay, because it is swamp state on disk, not an agent’s short-term memory.

The hands are a thin driver: ask gobrr for the next ready task, build the work order, spawn a claude to do it, report the result back, repeat until the DAG is green. The only non-determinism in the whole system lives inside one leaf at a time — inside a microVM, behind a gate.

其ノ參 03 / 07

The agent proposes; the state machine disposes

The part I like most is how decomposition works, because it is where most “autonomous” systems quietly cheat. The agent decomposes the goal: it reads the repository and breaks the work into tasks, each with a spec, an explicit write-allowlist, and its dependencies. But gobrr does not trust it. It validates the decomposition mechanically and refuses a bad one — it derives each task’s gate and forces the tests to be separated from the code, and it rejects a task whose write-allowlist smears across both. The creative step is allowed to be an LLM; the structural rules that keep it honest are code.

The other half of a good decomposition is sizing, and this is where Claude’s planning mode quietly earns its keep. Left to plan, it produces tasks at a grain where — with the issue-lifecycle skill loaded — a single agent carries one from zero to completion inside a 1M-token context window, without running out of room mid-task. And on the off chance a task’s scope is still too large, the agent does not thrash against the ceiling: it files a follow-up and finishes what it can. Right-sizing is essential here, because it is what lets a leaf run to done unattended.

The DAG is the schedule, a role that goes beyond simple bookkeeping. File-disjoint tasks with no dependency between them are independent by construction, so they run at the same time. A task that needs another’s output simply waits for it. The shape of the decomposition is the parallelism, which is why the validator cares so much about getting it right.

From there the loop is four models, one job each, and no model trusts the next one’s output. gobrr schedules. The driver leases a task and hands the microVM fabric a claude -p with a crafted work order. The leaf does not write to the repository — it emits its files inside a fenced work-contract envelope, and a separate integration model parses that envelope and applies it as one base-isolated change behind the task’s write-allowlist. That is the isolation invariant: code is only ever authored inside a VM, and the host applies a reviewed diff rather than running anything the agent produced. Then a containerised verifier gates the change in a --network none, read-only box and returns an exit code. If the exit code is green, the base advances; if it is red, the task bounces back to pending for another attempt.

The one structural subtlety worth calling out: every task branches off a single fixed base, so a task that imports a sibling’s brand-new file will fail its own isolated gate — the file is not on its base yet. The fix is to sequence by dependency. Build the independent units first, rebase the green ones into a linear stack, advance the base to the top of that stack, and only then seed the dependent round — embedding the exact export signatures from the finished units into the next round’s prompts so the imports resolve on the first try.

其ノ肆 04 / 07

The fork I did not take

There was a more ambitious design sitting right there, and I want to be honest that I looked at it. I could have made the thing recursive: put swamp itself inside each microVM and let a leaf spawn its own sub-VMs, a fractal of agents decomposing their own subtrees all the way down. It was too much — more blast radius, more state to reason about, more ways to deadlock, for a payoff I did not need. I kept the orchestration in the one main VM and had Opus split the task into subtasks that feed a single flat pool. The final design has one conductor, many hands, and no recursion. It is simple, it works for now, and it is a good enough place to start.

其ノ伍 05 / 07

Why it is safe to walk away

None of this would be safe to leave alone if the agent were starting cold. It is not. The reason I could even consider taking my hand off the wheel is that issue-lifecycle has spent months writing the foundation down: the architecture decisions, the patterns and antipatterns, and — the load-bearing part — the UAT and BDD scenarios that say what “working” means. An autonomous loop is only as trustworthy as its definition of done, and mine is concrete: it is a test suite the process built for itself. The agent in the leaf is free to be creative, but the gate it has to pass is deterministic.

其ノ陸 06 / 07

If you are building one of these

I think a lot of people are about to build some version of this, so here is what actually held and what bit me.

The pattern that carries the whole thing: the agent proposes, the state machine disposes. Let the LLM do the creative, fuzzy step — decomposition, code — and put every rule you actually depend on into deterministic code that can refuse its output. If your orchestrator is itself an LLM, the non-determinism remains, just moved up a level.

Quarantine the non-determinism to the leaves. This means one agent, one task, one sandbox, and one gate. Everything between the leaves should be replayable state you can open after the fact and read in order. When something goes wrong — and it will — you want a specific transition to point at, which is a more direct diagnostic than re-reading a transcript.

Make “done” a gate the agent cannot argue with. A green test suite in a --network none, read-only container is worth more than any amount of the agent telling you it is finished. No model trusts the next one’s output; each one checks.

What bit me. The verify gate ran the tests but not the formatter or the linter, so those failures only surfaced later at publish time — your gate is exactly as good as what you put in it, and no broader. Isolation is always along some axis: network namespaces isolated the network but not /tmp, and a shared control-plane file clobbered across concurrent VMs until I keyed it per namespace. A read-only verify box means the toolchain has to be baked in and offline ahead of time, with the lockfile-writing turned off, or the run dies on a write it is not allowed to make. And I learned that retries are only effective with better instructions; a task that bounced on a type error passed only when I put the exact cast into its spec, because the second attempt needs improved guidance to succeed, which is more deterministic than another roll of the dice.

其ノ柒 07 / 07

The one gate I keep

I keep exactly one transition for myself — the publish gate — for the reason I gave in the cage post: the loop is allowed to write the code and prove it green, but deciding what is allowed out into the world is mine. I proved the rest on a real task, self-hosting a Bluesky PDS end to end while I watched, and I have not shipped what it built, because there is no adversarial-review pass on it yet.

That is the shape I am comfortable with: autonomy I can reason about, the creativity boxed into leaves, the structure held in a state machine, and a human on the one gate that matters. The swarm can have the Overton window. I will take the thing I can replay.

This is a first working version, and there is more coming. Next I am hardening swamp-go-brr and wiring it with telemetry.