How I ship with agents | Sebastian Gade

TL;DR

Serious agentic work needs two things at the same time: gates that stop bad changes from moving, and shared work state that survives the chat window.

The durable object is not the chat window. It is the work state: what changed, what was decided, what is still open, what is blocked, what evidence exists, and what the next agent needs before touching the repo.

I use agents for implementation, Codex as a second reader, GitHub as the ledger, sp3cmar as the shared playbook, and 3ngram as shared context and follow-through for real work across AI tools. PocketOS is a useful warning, but not the whole thesis. The real point is that agentic work only becomes serious when the surrounding system can say no.

The Agents Are Not The Magic. The Gates Are.

AI agents are good enough to do real work now.

That also means they are good enough to do real damage.

The useful question is no longer whether agents can write code. They can. The sharper question is what system keeps them from shipping the wrong thing, deleting the wrong thing, or forgetting why the work started.

The PocketOS incident made this concrete. The Guardian reported that PocketOS founder Jeremy Crane said a Cursor coding agent powered by Anthropic’s Claude Opus 4.6 deleted the company’s production database and backups. He said customers were left without software they relied on, and that the company had to restore from an older offsite backup and rebuild from other systems.

The important detail is not just that an agent broke something. It is that it apparently broke something despite written safety rules, then explained which rules it ignored.

Written rules matter. They are not enough.

The workflow around the agent has to decide what moves. A model can propose a patch, open a PR, and explain itself with confidence. None of that is proof. Proof is the issue it closes, the diff it produced, the tests it passed, the review comments it resolved, the preview someone checked, and the state it leaves behind for the next session.

That is the core of my workflow: gates plus shared work state.

The Stack

I use Claude Code and Codex as primary implementation drivers. Claude Code runs in my terminal and VS Code, can spawn background sub-agents, and can isolate them in git worktrees. Codex is useful when I want a separate implementation or review pass. The handoff matters more than which agent typed the diff.

Codex Code Review is the second reader. It reviews PRs from a fresh context, without inheriting the implementation conversation. That separation matters. The implementation session wants to finish. The review session has no reason to defend the patch.

GitHub is the ledger. Issues define the work, PRs carry the proof, Actions enforce the gates, Projects show state, and branch protection keeps the path to production narrow. I do not ask an agent whether something is done and take the answer on faith. I ask GitHub what changed, what checks ran, what review comments remain, and what state moved.

sp3cmar is my open-source skill and reviewer-agent library at github.com/b3dmar/sp3cmar. It gives Claude Code and Codex the same operating playbook: how to split work, review it, ship it, debrief it, and reconcile state afterward. Skills are useful when they encode decisions, not when they are thin wrappers around commands.

3ngram is shared context and follow-through for real work across AI tools. It is not the thesis of the workflow by itself, and I do not lead with “memory layer” because that undersells the problem. The point is continuity: commitments, decisions, blockers, patterns, and next steps that Claude Code, Codex, ChatGPT, Cursor, and other MCP clients can read and write.

None of this is exotic. That is the point. The stack is valuable because it makes agent speed survivable.

The Gated Pipeline

Agents write code fast.

That includes bad code.

The pipeline exists because agent confidence is not evidence.

Gated CI/CD pipeline: feature branches to staging to main, with quality gates at each tier

The branch shape is simple: main <- staging <- feat/* | fix/* | chore/*. Both main and staging are protected. No direct push. Every change lands through a PR with required checks.

PRs stay small. My target is 200 lines or less. Bigger work gets stacked so each PR is reviewable on its own. An agent can work one layer while I review the one below, but the stack only works if each layer has a clear contract.

GitHub Issues are the work packets. A good issue has context, acceptance criteria, priority, milestone, and links to related decisions. If an agent finds broader work while fixing a narrow bug, it opens a follow-up issue instead of smuggling scope into the PR.

GitHub Actions is the referee. Lint, type checks, unit tests, backend tests, contract tests, smoke tests, coverage rules, and environment checks run where they belong. In engram, PRs get isolated Neon Postgres branches named ci-pr-{PR#}-{run_id} so tests do not share state. MCP contract tests run against Docker-Postgres with savepoint isolation so they roll back cleanly.

The exact checks change by repo, but the principle does not: the board can say ready, the agent can say done, and the PR body can look polished. None of that matters if the gate fails.

This is the practical lesson from PocketOS. A prompt can tell an agent not to do dangerous things. GitHub can make the dangerous path harder: protected branches, required reviews, CI gates, environment separation, and a public trail of what changed before it reaches production.

Strict Where It Counts

Staging tolerates some noise. Main does not.

One example from my own workflow is a test governance rule called TG005. It checks that a test’s directory matches its marker, so a file under backend/tests/unit/ cannot carry pytest.mark.integration. On PRs to staging, TG005 can be advisory. On PRs to main, it runs strict.

That caught a real release issue. A test added in PR #3130 sat happily in staging CI for days, then failed strict mode when the release PR opened against main. The path and marker disagreed. I cut a small rename PR before release. Annoying, correct.

That is the gate doing its job.

The rhythm is simple: test new functionality on staging before merge, smoke test production after merge, and make main stricter than the place where active integration happens.

Multi-Agent Work

Parallel agents are useful only when the work has boundaries.

Orchestrator session spawns N parallel worktree sub-agents, each opening a PR, all feeding into a merge queue

The pattern is: split a spec into independent tasks, put each agent in an isolated worktree, give each one a self-contained brief, and have each one stop when a PR is open. The parent session watches CI, fixes obvious failures, rebases dependent branches, and merges green PRs to staging.

This is where “tech lead for an AI team” stops being a metaphor. You are not only asking for code. You are deciding boundaries, order, dependencies, acceptance criteria, and what counts as proof.

The brief matters more than the model. It needs file paths, setup notes, acceptance criteria, commit and PR format, and a clear stopping point. Agents do not inherit the parent conversation, so vague context becomes bad work.

The gotchas are real. Stacked PRs can strand commits if PR B gets merged into PR A’s branch after A already landed on staging. Staging can move under dependent branches, and PR CI evaluates the merge ref, so a branch that looked green locally can fail in CI. Claude Code’s Bash sessions also do not persist cd between tool calls, so worktree agents need git -C <worktree-path> or a single chained command when committing.

Those details are not the thesis. They are reminders that orchestration is a real engineering surface. Five agents in parallel means five code reviews, five CI runs, and five merge decisions. The orchestrator’s job is to be a reliable merge queue, not to be impressed by the volume of generated code.

The Second Reader

CI is necessary. It is not enough.

I run Codex Code Review as an independent reviewer on every PR. Same spec, same acceptance criteria, separate context. It does not matter whether the implementation came from Claude Code, Codex, or a background sub-agent. The review is a fresh pass against the diff and proof.

That catches a different class of problem: removed dependencies still imported by a route, lockfile mismatches that would break install, duplicate decorators that double-consume rate-limit budget, workflow branches that hide Playwright failures, regex linters reading string literals as comments, and CI conditions that skip the workspace they were supposed to audit.

The honest caveat: my sample is too small to claim Codex is a better reviewer than Claude Code. This is a process claim, not a model-ranking claim.

The bigger lesson is that two independent passes expose weak acceptance criteria. If both reviewers miss the same issue, I usually find the spec was thin. Better prompts are not the first fix. Better issue shape is.

sp3cmar: The Shared Playbook

sp3cmar is the part of the system that keeps the operating style consistent across tools.

It installs into both Claude Code and Codex from the same source:

sp3cmar install --ai claude && sp3cmar install --ai codex

The daily verbs are:

/sp3cmar-worktree plan N to split a spec into independent tasks.
/sp3cmar-implement to run a spec-to-PR loop.
/sp3cmar-review to run a structured review pass.
/sp3cmar-ship to lint, commit, push, and open a PR.
/sp3cmar-post-merge to reconcile issue, project, changelog, and memory state after merge.
/sp3cmar-done to debrief a session and check dirty state before close.

The useful part is not the command syntax. It is that the decisions are encoded: what a good PR body needs, when to split a branch, when to open a follow-up issue, when to stop and escalate, what “done” means, and what state must be captured before the session ends.

That matters because agent work fails when every session starts from zero. The same playbook makes cross-review meaningful. Claude Code and Codex may disagree about a patch, but they are judging it against the same operating rules.

3ngram: Shared Work State

Multi-agent work breaks if the work state only exists inside one chat transcript.

3ngram as the shared context hub. Claude Code, Codex, ChatGPT, Cursor, and sp3cmar skills all read from and write to the same typed memory records

3ngram is how I keep that state portable. It gives every AI and agent I use the same context: what we decided, what is open, what is blocked, what should not be reopened, and what the next agent needs before touching the repo.

The records are typed, not just free-form notes. A record can be a commitment, decision, blocker, pattern, note, or preference. Records can be open or resolved, have due dates, show up in stale or overdue reports, and be retrieved semantically when a later agent needs them.

This is why I avoid leading with “memory layer”. Memory sounds passive, like a better notebook. The useful behavior is active continuity. A commitment opened in Claude Code on Monday can be seen by Codex during review on Wednesday. A decision captured while drafting in ChatGPT can guide a Claude Code implementation session later. A blocker found during review can become a GitHub issue, then land back in 3ngram as open work.

The wider category is moving toward this shape. Andrej Karpathy’s LLM Wiki gist is useful because it names a related move away from stateless RAG: raw sources stay immutable, the model maintains a compiled artifact, and a schema tells the model how to keep that artifact useful. His object is closer to a knowledge wiki. In my workflow, the object is cross-tool work state.

For agentic coding, that distinction matters. Source material is not enough. The next agent needs decisions, open commitments, blockers, release state, branch state, and caveats. The chat transcript can be evidence, but it should not be the operating state.

That is also where 3ngram becomes product-relevant without needing to dominate the article. I built it because every AI tool I used forgot too much between sessions. Shared context is the entry point. Follow-through is the value: surfacing stale commitments, carrying blockers across tools, reminding the next agent what already happened, and keeping work from being reopened because the chat window changed.

The Handoff Loop

The most underrated part of this stack is the handoff.

Agents are useful for bursts of work, but real projects span days. If the next session starts with “remind me where we were”, the system is leaking.

My rule is simple: do not leave important state only in the transcript. At the end of a session, /sp3cmar-done checks the dirty worktree, summarizes what shipped, extracts decisions, records open commitments, resolves completed items, and writes the next pickup point into 3ngram.

A good handoff is concrete. It does not say “finish this tomorrow.” It says which PR is open, what blocks it, what branch it targets, which files are intentionally dirty, which issues it closes, and which prior decisions matter. The next agent should not have to infer the noun.

This is how work moves across tools. A follow-up found during Codex review can become a GitHub issue with acceptance criteria. The issue can be implemented by Claude Code. The release state can be captured into 3ngram. ChatGPT can later answer what is waiting, because the state was not trapped in the coding session.

The handoff is where retrieval becomes follow-through.

What Still Bites

The workflow works for me. That does not make it universal.

Agents still make mistakes I do not catch, especially on surfaces where no one has written tests and I do not have strong product taste yet. The gates catch what the gates are built to catch. They do not catch surprise.

Stacked PRs have foot-guns. Worktrees have foot-guns. CI can be green while the user experience is wrong. A second reader can miss the same problem as the first reader if the acceptance criteria are weak.

Codex versus Claude is not settled. I have seen Codex catch real bugs that implementation agents missed, but I do not treat that as a model leaderboard. It is evidence for independent review, not proof that one model is permanently better.

The dependency is real too. Drop me into a vanilla editor with no hooks, no shared work state, no review gates, and no orchestrator, and my output drops. That may be the wrong tradeoff for some teams.

The strongest disproof would be simple: if smaller manual workflows ship cleaner over time, or if maintaining the gates costs more than parallel agents save, then this is too much machinery.

So far, my experience points the other way.

The Shift

The bottleneck moved.

It used to be “can you write the code”. Now it is “can you design the gates and shared state that let agents work without turning speed into damage”.

If the gates are good, a model that is 80% right can be useful. If the gates are bad, the same model becomes a liability generator.

Taste still matters. Architecture still matters. Problem decomposition matters more than before, because you are directing work, not only doing it. Knowing which five PRs to stack, in what order, with what boundaries, so three agents can run in parallel without colliding, is the actual skill.

The payoff is concrete: I can move work through in a weekend that I would previously have planned as a multi-person sprint. Not because the agents are magic. Because the gates and shared work state let me trust more of the speed.

The durable object is not the chat window. It is the work state.