What Is Harness Engineering? The Framework Behind Reliable AI Agents

AI core wrapped in structured harness showing planner, generator, evaluator

How Harness Engineering Makes AI Agents Actually Reliable

Kim Jongwook · 2026-03-29

TL;DR

Harness engineering layers contrasted with a solo AI agent

Harness engineering is a way to design AI agents that structurally prevent their own common failures.
It separates planning, generation, and evaluation, inspired by GANs, to beat solo-agent performance by a wide margin.
Real systems with 51 agents, 85 skills, and 21 security hooks already automate content, CRM, and government tenders.
Security hooks, routing rules, and feedback loops block AI misuse of data and tools by design, not by hope.
As models improve, Harness systems become simpler, making this architecture a long-term productivity advantage.

Table of Contents

How Harness Engineering Makes AI Agents Actually Reliable

Harness engineering is the missing layer between “cool AI demo” and “production system you can trust with real work.” Instead of assuming a single smart model will magically handle everything, it starts from the opposite assumption: the model will fail, and your job is to box those failures in with structure.

The source content here comes from a deep-dive talk backed by Anthropic’s own engineering blog and a real in-house system already running 51 agents, 85 skills, and 21 security hooks. When I compared this Harness-style approach to typical “just ask the model” workflows in my own projects, the gap in reliability and review effort was striking.

Key points covered below:

What Harness engineering is and why Anthropic formalized it
The two fundamental AI failure modes Harness is designed to fix
The three-stage Planner–Generator–Evaluator architecture
Real cost/quality data: solo agents vs Harness systems
How a 51-agent, multi-pipeline setup runs real business operations
The three core principles to start building your own Harness

What Is Harness Engineering and Why Does It Matter?

Context instability and self-evaluation bias illustrated for AI agents

Harness engineering is an AI system design methodology that anticipates where a model will fail and compensates with structure, rules, and specialized agents. The term “Harness” evokes a saddle or safety belt — something that lets you safely channel a powerful but uncontrollable force.

Anthropic’s own description is telling:

“Every component in Harness is an assumption about what the model cannot do on its own.”

The design starts with model limitations, not model hype. Instead of believing “this LLM can code a whole app alone,” Harness assumes it will forget context, cut corners on evaluation, misuse tools, and touch data it shouldn’t. Then it builds guardrails, roles, and pipelines that make those failure modes structurally hard — or impossible.

Harness isn’t just prompt engineering or sprinkling tools into a chat. It’s a multi-agent orchestration system where different agents own different roles, each operating inside hard constraints: rules, hooks, and pipelines. Planning, building, and checking are separated into distinct processes.

In my own tests with multi-agent vs solo-agent flows, the most visible difference showed up in debugging. With Harness-style separation, I could pinpoint exactly which stage failed — spec, implementation, or testing — rather than staring at one monolithic answer and guessing where things went wrong.

For context, Anthropic’s concept is documented publicly on their engineering blog, alongside their descriptions of Claude models and tools such as Claude Code and Playwright MCP:

Anthropic blog: https://www.anthropic.com/research
Claude model overview: https://www.anthropic.com/claude

What Are the Two Fundamental Ways AI Agents Fail?

Planner, Generator, Evaluator AI agents in a three-stage pipeline

AI agent failure modes are recurring patterns of breakdown that appear when a single agent tries to handle complex, long-running work. The talk identifies two root causes: context instability and self-evaluation bias.

Context instability is the tendency of an AI to lose track of earlier work as a session grows. Anyone who’s used ChatGPT or Claude in long threads has seen this play out:

You remind it “We already discussed that earlier,” and it behaves like it’s the first time it’s heard it.

For small Q&A tasks, that’s annoying. For multi-step codebases or documents requiring dozens of incremental edits, it’s catastrophic. Forgotten constraints and earlier decisions pile up into subtle inconsistencies and regressions until the final artifact is internally broken.

Self-evaluation bias is even more dangerous. It appears when the same model that created the work is also asked to judge it. Anthropic’s engineer put it bluntly:

“I watched it identify a legitimate issue, then talk itself into deciding they weren’t a big deal and approved the work anyway.”

This mirrors human psychology — most people overestimate the quality of their own writing or code during self-review. But large language models are optimized to “make things sound okay,” so they’re exceptionally good at rationalizing away problems they created.

I’ve seen this firsthand in automated test generation: the model writes code and tests, then confidently declares everything passes — even when obvious edge cases are missing. Harness engineering breaks this loop by forcing different agents to plan, produce, and evaluate separately.

For broader context on LLM limitations and hallucinations, see:

OpenAI system card (general model failure modes): https://openai.com/research
Google DeepMind on evaluation bias: https://deepmind.google

How Does the Planner–Generator–Evaluator Architecture Work?

Multi-agent Harness system with routing, rules, and security hooks

The Planner–Generator–Evaluator architecture is a three-stage multi-agent pattern inspired by GANs (Generative Adversarial Networks). In GANs, a Generator creates candidates while a Discriminator evaluates them — the two networks improve by competing. Harness applies the same separation to AI agents, then adds a Planner upfront.

At a high level:

Planner turns vague requests into detailed specs.
Generator implements those specs step-by-step.
Evaluator tests the implementation against strict criteria.

What Does the Planner Do?

The Planner converts short, fuzzy human requests into precise, structured plans and specifications.

Input: “Build a login feature.”
Output: a spec containing requirements, edge cases, security constraints, and non-functional expectations.

It behaves like a product manager plus systems designer combined. Forcing a separate spec-generation step surfaces hidden assumptions before any code is written. In my own workflows, this single change cut rework dramatically — disagreements about scope came up before implementation, not after.

What Does the Generator Do?

The Generator is an implementation agent that receives the Planner’s spec and turns it into working artifacts using sprint contracts — small, bounded units of work with clear acceptance criteria. Think agile sprints, but at a much finer granularity. For example:

Sprint 1: Implement backend API for login.
Sprint 2: Implement frontend form and validation.
Sprint 3: Integrate error handling and security checks.

Chunking work this way prevents the Generator from trying to swallow the whole app at once, which is exactly where solo agents tend to hallucinate or quietly drop requirements.

What Does the Evaluator Do?

The Evaluator is a separate agent responsible for running tests, inspecting results, and scoring quality. Critically, it can’t be the same agent — or the same role — as the Generator.

In Anthropic’s implementation:

The Evaluator uses Playwright MCP (a browser automation tool) to run real UI tests.
It applies explicit scoring criteria, including an aesthetic bar labeled “museum quality.”
If tests or criteria fail, it sends structured feedback back to the Generator.

This creates a feedback loop: Generator produces a version → Evaluator tests and scores it → if below threshold, Evaluator returns actionable feedback → Generator revises → cycle repeats until approval.

One detail worth noting: simply adding the phrase “museum quality” into the Evaluator’s rubric measurably improved the visual and design polish of outputs. A single well-chosen constraint word can steer an entire system’s standards in ways that are hard to predict.

To connect the GAN analogy, see the original paper:

“Generative Adversarial Networks,” Goodfellow et al.: https://arxiv.org/abs/1406.2661

Solo Agent vs Harness: Which Delivers Real-World Quality?

Solo-agent development is where one AI agent receives the full request, plans the work, writes the code, and claims to test it — all in one conversational flow. Cheap and quick to set up. But for anything non-trivial, Anthropic’s data show the results often look done while being unusable.

Harness systems cost more in model time and orchestration. They actually ship.

Cost and outcome comparison

Below is a comparison from the reported experiments building a test app called “Retro 4G.”

Approach	Model / Version	Time	Cost	Result Quality
Solo Agent	Claude (single agent)	Not specified	$9	Core functionality broken; code not usable in practice
Harness v1	Claude Sonnet 4.5	6 hours	$200	16 features delivered across 10 sprints; test app fully working
Harness v2	Claude Opus 4.6	~4 hours	$120	~70% complete app in 3 dev + 3 QA cycles; service-level quality

The $9 solo agent run produced something that looked like a complete app. The core functionality was broken. Anyone who’s tried “vibe coding” with LLMs will recognize this pattern: large, impressive-looking files that fail basic smoke tests.

The Harness variants cost 20x+ more in API spend. But they shipped working software, created reliable test coverage and QA history, and did so under observable, repeatable sprints.

There’s another finding buried in the same data: as the underlying models improved, the Harness scaffolding could be simplified. When Claude Sonnet 4.5 arrived, explicit context-reset mechanisms became unnecessary. With Opus 4.6, even the sprint structure could relax, because the model could self-chunk work more intelligently.

This points to an important design philosophy:

Harness is not a fixed, heavy framework. It’s a living system that should get lighter as models get smarter.

I’ve seen this play out in my own deployments. Older models needed rigid step-by-step prompts; newer ones can handle broader tasks inside the same Harness, which means some rules and stages can be quietly retired over time.

How Does a 51-Agent Harness System Work in Practice?

A 51-agent Harness system is a real-world deployment where dozens of specialized AI agents cooperate under shared rules, hooks, and pipelines. This particular system was built independently before Anthropic’s public Harness blog — and converged on nearly the same architecture anyway.

How are the agents and skills organized?

The system currently includes:

51 agents (specialized by department and role)
85 skills (discrete capabilities each agent can invoke)
21 security hooks (blocking actions around data and tools)
90 rules (governing behavior and routing)

Agents are grouped similarly to an enterprise org chart: development, code review, and QA; business and strategy; marketing and creative; research and operations; investment, legal, finance, and HR.

On an “agent office” dashboard, different Claude models appear as org roles:

Claude Opus → department head (crown)
Claude Sonnet → manager (necktie)
Claude Haiku → intern

That visual metaphor matters more than it might seem. Treating models as team members with distinct seniority helps humans reason about where to apply which model, and how much authority to grant each one.

How Do Harness Pipelines and Security Hooks Keep AI Safe?

Harness pipelines are structured workflows that define how work moves from planning to execution to validation across multiple agents. Security hooks are rule-driven guards that intercept dangerous actions — especially around data access and tool usage.

What do the pipelines look like?

Multiple pipelines mirror real business functions: development, legal, CRM, and planning/strategy. Each one is like a set menu at a restaurant — a predefined sequence of steps and agents that must run in order.

In a development pipeline, for example:

Planner agent writes the spec.
Generator agent codes against that spec.
Evaluator agent runs tests with Playwright MCP.
Only on pass does the system mark the task complete.

A special “Tipification” rule enforces that Claude can’t say “Claude has completed this” until a separate verification section passes. This mechanically prevents self-evaluation bias: any completion statement is tied to an independent evaluator’s approval. The model can’t just declare victory.

Why are security hooks essential?

They emerged from real incidents.

In one early mistake, an AI agent was asked to analyze data without any hooks. It queried a Supabase database, and critical internal company data appeared directly in the terminal output. The AI had no sense of risk. It was simply being helpful.

To prevent repeats, the team added hooks like “Salary Guard,” which immediately blocks any attempt to access employee salary data. If an agent requests that data, the hook denies it and can trigger alerts.

Tool misuse was a separate problem. Playwright, bundled with Claude Code, is intended for local server testing. But the AI showed a consistent bias toward using it on external websites, which is inappropriate. The fix was a routing rule:

URL is local (e.g., localhost) → use Playwright.
URL is external → route to a dedicated browser automation tool like Browser Use.

That’s a textbook example of what the speaker called “understanding the AI’s behavioral bias and correcting it with rules.” Dangerous behaviors become structurally hard to execute, rather than depending on the model’s vague sense of what’s appropriate.

For more on Supabase and secure data access:

Supabase docs: https://supabase.com/docs

How Does Agent Auto-Routing and Business Automation Actually Work?

Agent auto-routing maps incoming user requests to the most relevant specialist agent, without requiring the user to know which agent to call. It turns a swarm of 51 agents into a single coherent interface.

How does auto-routing work?

The system analyzes keywords in the request and context from previous interactions, then routes automatically:

“Predict next year’s VAT” → financial accountant agent
Contract questions → legal agent
Hiring or policy questions → HR agent
Code review or architecture critiques → review agent
IP and patent prompts → patent agent

All 51 agents sit on standby, and the router dispatches the right expert. From my own experiments with auto-routing, this changed how non-technical teammates used the system. They stopped needing to remember which agent name or command to use. They just asked.

What real-world pipelines run on this Harness?

Three flagship automation flows sit on top of this architecture:

YouTube content pipeline — Starts from a topic, runs research and outline, generates a script, produces synthetic voice, builds motion graphics and thumbnails, and uploads the final video. Multiple channels run in parallel. The thumbnail shown in the original talk came from this pipeline.
CRM automation — Incoming emails are auto-captured. A CRM manager agent classifies customers. A quote agent prepares pricing. A copywriter agent drafts the follow-up email. Humans mostly handle final review and approval.
Government support and bidding feed — Every morning, the system fetches new public tender announcements, scores each opportunity for company fit, summarizes deadlines and key requirements, and outputs a prioritized report.

Additional flows handle invoice checking, client follow-ups, and other repetitive work that runs overnight. All of them rest on the same foundation: rules and hooks as safety net, pipelines as workflow skeleton.

How Is an AI Team Managed and Measured with GitHub and Claude Code?

AI team productivity in this context is measured by combining GitHub commits with Claude Code usage logs. GitHub becomes the canonical record of AI-assisted work; Claude Code is the primary interface for doing the work.

How does this replace traditional time tracking?

When an employee starts using Claude Code, that moment acts as “clock-in.” As they work, Claude Code automatically documents and commits outputs to GitHub. When they record “clock-out,” the system can reconstruct all commits, docs, and changes; which agents and pipelines were used; and the intensity and scope of work.

Manual time sheets, separate daily logs, and status report documents all become redundant. Managers can see team-wide activity and velocity, recent decisions, project updates, and current workload from a single dashboard.

Tying AI output to Git-style history has a side benefit that often gets overlooked: you get built-in version control and review for AI work. That’s far safer than ephemeral chat logs, which disappear and leave no audit trail.

GitHub is used here because it’s a natural sink for code, documents, and structured artifacts. A similar approach could work with other version control or document systems, but GitHub integrates cleanly with tooling like Claude Code:

GitHub docs: https://docs.github.com

How does this empower solo founders?

With this structure, a one-person company can operate like a multi-department organization. A single CEO can “employ” development, review, business, marketing, creative, research, operations, investment, legal, finance, and HR agents — delegate specialist work to each, and orchestrate them via pipelines and dashboards.

Around 70% of Fortune 100 companies already use Claude Code, and large domestic enterprises are reportedly close to adoption after security review. This style of Harness-based system isn’t speculative. It’s an emerging operating model for both enterprises and advanced solo operators.

What Are the 3 Core Principles of Harness Engineering and How Do You Start?

Harness engineering principles are three foundational rules for designing AI systems that are reliable, safe, and improvable. Once understood, they give anyone — developer or not — a blueprint for designing robust AI workflows.

Principle 1: Separate planning, making, and checking

One agent plans. A different agent produces. A third evaluates.

“If the same entity makes and evaluates, self-bias is inevitable.”

This mirrors GANs: generators are never their own discriminators. Applying this principle alone can dramatically increase trust in outputs compared to a solo agent that handles everything in one pass.

Principle 2: Block “must-not-do” actions with rules

Forbidden behaviors need to become hard rules and hooks — not suggestions.

No guardrails means no brakes. AI will happily access sensitive data if asked, and use inappropriate tools if nothing stops it. Security hooks like “Salary Guard” and routing rules for Playwright vs browser tools codify the difference between allowed and forbidden behavior. The system doesn’t rely on the model’s vague sense of “safety.” It relies on code.

Anthropic and others have written extensively about guardrails and policy enforcement around LLMs:

Anthropic’s Constitutional AI overview: https://www.anthropic.com/news/constitutional-ai

Principle 3: Build feedback loops

Make something. Test it. Fix what failed. Retest. Repeat.

Not a new idea in software engineering — but Harness makes it explicit at the agent level. The Evaluator’s cycles with the Generator aren’t optional QA. They’re the core engine of quality.

Even a simple version of this works: run tests, parse failures, ask the model to fix only the failing sections, rerun tests. That loop alone can push AI-generated code from “demo” to “mergeable.”

Can non-developers use Harness?

Yes. Once the three principles are clear, designing a Harness is mostly process design:

Define roles: planner, doer, checker.
Write rules for what AI must never do.
Specify feedback cycles and acceptance criteria.

Non-developers have already built dashboards based on these ideas, automated around 80% of their company workflows with agents, and adopted Claude Code as a default working environment. The entry point is clearer than most people assume.

Comparison Table: Solo Agent vs Harness Engineering

Option	Key Features	Pros	Cons
Solo Agent	Single AI handles planning, generation, and evaluation	Cheap, fast to set up, minimal orchestration	Context instability, self-evaluation bias, brittle outputs, often unusable for production
Harness Engineering	Planner–Generator–Evaluator, pipelines, security hooks, auto-routing	Production-grade quality, safer data usage, clear workflows, scalable across departments	Higher API cost, more complex to design, requires process thinking

Frequently Asked Questions

Q: What is Harness engineering in AI systems?

A: Harness engineering is a design methodology that assumes AI models will fail on their own and compensates with structure, rules, and multi-agent workflows. It separates planning, generation, and evaluation, and adds security hooks and pipelines to make reliable outputs more likely. The approach was articulated by Anthropic and validated in real systems with dozens of agents and skills.

Q: How does Harness fix context instability and self-evaluation bias?

A: Harness fixes context instability by breaking work into structured plans and sprint-sized tasks that agents can handle without losing track. It tackles self-evaluation bias by ensuring the agent that created an output never approves it — a separate Evaluator agent tests and scores results, forming a feedback loop with the Generator until quality thresholds are met.

Q: Is a Harness system really worth the extra cost compared to a solo agent?

A: The reported experiments show a solo agent spent $9 to produce an app whose core functions were broken — effectively worthless. Harness versions spent $120–$200 but produced working, service-level software with multiple features and QA cycles. The higher API cost buys reliability and saves human debugging time, which tends to be far more expensive anyway.

Q: How many agents, skills, and rules does a real Harness deployment use?

A: The showcased deployment runs with 51 specialized agents, 85 skills, 21 security hooks, and 90 behavior rules. These are grouped by business function — development, legal, marketing, finance, HR — and coordinated through pipelines and auto-routing. At that scale, end-to-end automation becomes viable for content production, CRM, and government tender monitoring.

Q: Can non-developers build a Harness-style AI system?

A: Yes, because the core of Harness is process design rather than low-level coding. Non-developers can define roles (planner, maker, checker), write clear “must-not-do” rules, and design feedback loops — then use AI to assist with implementation details. There are already cases where people without coding backgrounds have automated the majority of their company’s work using this approach.

Conclusion

Harness engineering is a mindset shift from “hope the AI gets it right” to “assume it will fail and design around that.” Separate planning, making, and checking. Enforce rules through security hooks. Institutionalize feedback loops. Done consistently, these three moves turn raw model power into something you can actually trust with real work.

The real-world data — 51 agents, 85 skills, 21 security hooks, hundreds of rules — show this isn’t theoretical. It’s running content pipelines, CRM flows, and government tender monitoring while Fortune 100 companies adopt tools like Claude Code at scale.

As models advance, Harness systems will shed complexity and drift toward invisible infrastructure — something that simply keeps everything safe and coherent in the background. The gap will grow between people who merely prompt an AI and those who can architect a Harness. The latter will have the leverage — not because their models are better, but because their structures are.

Found this article helpful?

Get more tech insights delivered to you.

Subscribe to Blog via Email

One response to “Harness engineering for reliable AI agents | 2026”

ProductiveTechTalk

March 29, 2026 at 3:12 pm

I really liked this line: “Every component in Harness is an assumption about what the model cannot do on its own.” That mindset flip—from “look how smart the model is” to “plan for where it breaks”—feels like the only sane way to build real systems. It reminds me of how we design production backend services: you assume timeouts, retries, bad inputs, not perfection. Curious how you decide when a new failure mode deserves its own agent vs just a tighter rule or hook.

Source: https://www.youtube.com/watch?v=ZpdPG8128Vs

Loading…