Code Review Is Becoming Work Review

The code is no longer the only thing that needs review.

That sentence sounds wrong if you grew up inside pull requests.

I did. I still care about readable diffs, small patches, tests, comments that explain the non-obvious parts, and reviewers who actually think. Code review has been one of the best social inventions in software engineering because it put a human pause between private certainty and shared production reality.

But the economics changed.

Agents can now generate more code than humans can responsibly read. Not someday. Now. The new bottleneck is not typing. It is not even implementation. It is deciding whether the work was the right work, constrained in the right way, verified with enough evidence, rolled out with enough care, and owned after it lands.

So code review is becoming work review.

That does not mean “stop reading code.” It means line-by-line review is no longer the default scarce human act. The scarce act is judgment over the whole unit of work.

TL;DR

Agents increase code volume faster than human review capacity.
Reading every generated line is not a strategy. It is a coping mechanism with a time limit.
Humans should move more review upstream and outward: intent, constraints, acceptance criteria, permissions, verification, rollout, observability, rollback, and ownership.
Code still needs human review when the blast radius is high: auth, payments, data loss, security, infrastructure, migrations, irreversible operations.
Agents can review syntax, style, test gaps, common bugs, consistency, and even adversarial cases. Humans still own what matters, what risk is acceptable, and what proof is enough.
The next review culture will reward engineers who can specify, constrain, verify, and operate work, not engineers who heroically skim giant diffs at 11 p.m.

Code review expands into work review

The line-by-line model is breaking

The old code review model assumed a manageable amount of code.

A developer works on a change. They open a PR. Another developer reads the diff, leaves comments, asks for tests, checks for weird edge cases, and eventually approves. It is a decent system when the unit of work is human-paced and the reviewer has enough context to understand it.

That assumption is rotting.

When agents join the workflow, output volume jumps. A single engineer can run multiple implementation sessions in parallel. One agent fixes tests while another adds an endpoint while a third refactors a component. Even without full autonomy, the number of diffs per engineer rises. The reviewer does not get a second brain for free. They get a queue.

Latent.Space’s “How to Kill the Code Review” makes the uncomfortable version of this argument: teams already struggled to keep up with human-written review queues, and AI increases both the number and size of changes. The proposed answer is to move human review upstream to specs, plans, constraints, acceptance criteria, verification rules, permissions, and adversarial checks.

I agree with the direction, but I would phrase it less theatrically.

Code review is not dead. It is being demoted.

The diff is still evidence. It is just not the whole trial.

If a reviewer opens a 900-line agent-generated PR and starts scrolling for vibes, the process has already failed. The question should have been asked before the code existed: what problem is this solving, what is allowed to change, what must not change, how will we prove it works, how will we reverse it, and who owns the outcome?

Without those answers, line comments are busywork with a moral costume.

Code review checks the artifact. Work review checks the job.

Code review asks: is this code acceptable?

Work review asks: was this work defined, bounded, verified, shipped, and owned correctly?

That sounds like a broader process phrase. It is more concrete than that. A good work review can be run from a PR, issue, design doc, deployment plan, CI run, eval report, production dashboard, and rollback note. The code is one artifact among several.

The unit of review moves from “this diff” to “this change in the world.”

For an agent-generated migration that touches billing records, the code is the least interesting part to review first. I want to know the invariant, the dry run result, the backup plan, the exact permission boundary, the affected accounts, the idempotency story, the rollback path, the monitoring query, and the person who will watch it after deploy. Then I will read the code.

Humans are bad at evaluating risk after they have been hypnotized by implementation detail. Agents make this worse because they can produce plausible code at industrial speed. Plausible is dangerous. Plausible invites the reviewer to ask “does this look right?” instead of “what would make this unacceptable?”

Work review starts with unacceptable.

The checklist I want before I trust the diff

Here is the checklist I want for agent-heavy work. Not every item needs a ceremony. Every item needs an answer proportional to risk.

Problem. What user, system, or business problem is this change supposed to solve? “Refactor X” is not enough unless the problem is maintainability and the acceptance criteria make that concrete.

Constraints. What is allowed to change? What is explicitly out of scope? What files, services, dependencies, data shapes, APIs, and behaviors must remain stable?

Acceptance criteria. What observable facts make the work done? These should be specific enough that another agent, another engineer, or CI can verify them without reading the author’s mind.

Risk class. Is this disposable, reversible, internal, user-facing, regulated, financial, security-sensitive, destructive, or infrastructure-level work? A prototype and a payment capture path should not pass through the same review gate.

Permissions. What could the agent touch? What required escalation? Did it modify dependencies, auth, secrets, database schemas, CI, deployment config, production data, or permissions?

Tests and evals. What deterministic checks prove the core behavior? Unit tests, integration tests, regression evals, contract tests, smoke tests, fixture comparisons, golden files. The names matter less than the evidence.

Rollout. How does this reach users? Feature flag, staged deploy, migration window, beta cohort, percentage rollout, config switch, internal-only release. “Merge equals ship” is sometimes fine. It should be a decision, not a default.

Observability. What will tell us it worked or failed? Logs, metrics, traces, alerts, dashboards, support tags, queryable events, error budgets. If you cannot observe the outcome, you are not done reviewing the work.

Rollback. How do we undo it? Revert commit, disable flag, restore backup, replay events, run inverse migration, drain queue. If rollback is hard, review gets stricter.

Ownership. Who is responsible after merge? Not who typed the prompt. Who watches the deploy, answers the incident, handles the follow-up, and updates the docs when reality disagrees with the plan?

That checklist is the review.

The diff supports it.

Test-first becomes less optional

The most practical shift is test-first work.

Akash Bajwa’s write-up of a 2026 Anthropic engineering roundtable, “The Future Of Software Engineering with Anthropic”, describes teams defining test cases before agents implement, using regression evals that must stay green and frontier evals for new capabilities. It also captures the awkward middle state of code review: some teams already see human review turning into a quick approval because AI review catches enough low-level issues, while people still disagree on when that is safe.

That discomfort is useful. It is the sound of a practice being re-priced.

If an agent writes the implementation first, then writes tests that confirm its own implementation, you have not created proof. You have created a courtroom where the defendant drafted the law.

The better pattern is to define the behavior before the implementation run. Human writes or approves the acceptance criteria. Agent writes code against them. CI runs deterministic checks. A separate reviewer, human or agent, attacks the result. The important thing is separation. The generator should not be the only judge of correctness.

Long-horizon tasks make this stricter. The review artifact is not “look, lots of code.” It is “here is the trail of decisions and proof.”

Small PRs still matter

Work review is not an excuse for giant diffs.

Chris Roth’s “Building An Elite AI Engineering Culture In 2026” points at the review bottleneck created by AI-generated code and argues for stacked PRs, AI first-pass review, and humans acting more like editors and architects than line-by-line gatekeepers. That matches my experience.

Small PRs are still one of the cheapest quality controls we have.

Not because humans should read every token. Because small changes make work review possible. Scope is legible. Risk is bounded. Rollback is plausible. Ownership is clear. CI failure points to something specific. A second agent can review it without needing to reconstruct a week of wandering context.

Large PRs hide weak thinking. They let agents bury decisions in volume. They turn reviewers into archaeologists.

Stacked PRs are not just a git technique. They are a review design pattern. Each PR should carry one claim: this work changes this part of the system, for this reason, with this proof, under this risk class. If the claim is too big to state simply, the PR is probably too big to review honestly.

What humans should still read

There is a lazy version of this argument that says humans should stop reading code entirely.

No.

That is not engineering. That is abdication with a productivity dashboard.

Humans should still read code when the change carries serious blast radius or when the code itself is the only faithful representation of the risk. I want human eyes on auth, permissions, payments, billing, privacy, encryption, data deletion, migrations, infrastructure, deployment pipelines, concurrency, dependency upgrades, public APIs, and anything irreversible.

I also want human eyes when the agent is operating in a weakly specified area. If the acceptance criteria are vague, the architecture is unsettled, or the system has hidden institutional rules, code review becomes discovery. That discovery should feed back into better specs and better guardrails, but skipping it would be foolish.

There is another category: taste.

Agents can imitate local patterns. They can often improve ugly code. They can enforce style. But product taste, API taste, domain taste, and operational taste still need humans. “This endpoint technically works but teaches the wrong mental model” is a human review comment. “This abstraction is clean but points the team in the wrong direction” is a human review comment. “This feature is correct and still not worth shipping” is the most human review comment of all.

The goal is not less human judgment.

It is less human attention wasted on the wrong layer.

What agents can review

Agents are already useful reviewers.

They can compare a diff to a spec. They can find missing tests. They can check naming consistency. They can flag obvious security smells. They can inspect whether public APIs changed. They can run local commands. They can summarize risk. They can generate adversarial test cases. They can compare two implementations. They can ask why a migration has no rollback. They can notice that a PR says “no dependency changes” while the lockfile changed.

That is real value.

But agent review has a trap: it can make weak work look processed.

A beautiful AI review summary does not mean the work is safe. It means one more model produced one more artifact. The question is whether the review was grounded in the right contract. If the spec is mush, the agent will review against mush. If the acceptance criteria miss the real risk, the agent may confidently validate the wrong thing.

So I use agents as reviewers, but I do not outsource accountability to them. I want them to be annoying, literal, tireless, and separate from the implementation context. I want them to produce findings and evidence. I do not want them to decide that the business risk is acceptable.

Agents can review consistency.

Humans own consequence.

My working rule

My current rule is simple: review less generated text, review more intent and proof.

In my own agent workflow, GitHub is the ledger. Issues define the work. PRs carry the evidence. CI enforces the boring checks. A second reader reviews the diff and acceptance proof. Shared context, including tools like 3ngram, keeps decisions and handoffs from living only in a chat transcript. None of that is magic. It is just an admission that agentic work needs an operating system around it.

The more code agents generate, the less I trust review processes that only stare at code.

I want the issue to say what problem we are solving. I want the prompt or plan to say what the agent is allowed to touch. I want the PR to name the risk class. I want tests and evals that existed because of the acceptance criteria, not because the agent needed something green. I want rollout and rollback to be explicit. I want observability before confidence. I want ownership after merge.

Then, for risky changes, I want the code too.

This is the part many teams will get wrong. They will keep the old review ritual, add AI code generation underneath it, and wonder why reviewers are exhausted. Or they will swing too far the other way, let agents approve agents, and call it modern. Both are weak.

The strong version is more demanding.

It asks engineers to become better spec writers, risk classifiers, verification designers, release operators, and owners. It treats code as an artifact of work, not the entire work. It makes agents faster without letting their speed define the standard.

The future reviewer is not the person who reads the most generated lines.

It is the person who knows which lines matter, which proof matters more, and when the right answer is “do not ship this yet.”

Review less generated text. Review more intent and proof. That is where the human job is moving.