The Agent Harness: Infrastructure That Makes LLM Agents Actually Work

When people compare agent benchmarks, they usually talk about the model. But in practice, two teams running the same LLM can get very different results—because what wraps the model matters.

That wrapper is the Agent Harness: the runtime infrastructure that turns model outputs into sustained, long-horizon behavior. Think Claude Code, OpenClaw, or the scaffolding behind any agent that runs for hours instead of one chat turn.

Zhongwei Xie led our new survey on this topic—A Survey on AI Agent Harness—with collaborators at HKUST. We map 150+ papers and production systems, and release a living bibliography at Awesome-Agent-Harness.

A useful mental model:

Agent = Model (stochastic intelligence) + Harness (deterministic infrastructure)

The model proposes; the harness schedules, remembers, connects to tools, and decides what is allowed.

Four layers, one stack

We organize the harness as four outward-expanding layers:

Layer 1 — Execution & Orchestration
The control loop: when to call the model, which tool or sub-agent runs next, how to retry after failure, when to stop. This covers model routing, multi-agent coordination, and resilience primitives like checkpointing and fallback.

Layer 2 — Context & Trajectory Management
What persists across turns: context compression, agent memory stores, trajectory logging, and observability. As trajectories grow, irrelevant history causes context rot—reasoning degrades even when the model itself is capable.

Layer 3 — Interaction Surface & Execution Environment
How the agent touches the world: function calling, MCP servers, browsers, code interpreters, and sandboxes. LLMs speak tokens; this layer translates intent into concrete state changes.

Layer 4 — Constraints & Guardrails
What the agent is not allowed to do: access control, permission scoping, prompt injection defense, and post-hoc auditing. Higher autonomy means higher stakes—this layer does not shrink as models improve.

The insight that reframed our thinking

The relationship between model and harness is asymmetric co-evolution.

When models gain longer context or native tool use, some Layer 1–3 scaffolding becomes obsolete—simple routing templates, basic wrappers, short-horizon memory tricks. The harness should be built to delete: modular enough that obsolete parts can be retired as capabilities move into the model.

Layer 4 is different. More capable models often get broader action spaces—more tools, more privileges, more exposure to external systems. Guardrails are less likely to be absorbed; they become more central, not less.

A second finding hit us during the literature review: the harness is a hidden confound in evaluation. Most benchmarks score the combined model–harness system. Change the retry logic, memory policy, or tool schema and the same model can look much better—or much worse. We may be measuring harness engineering as much as model capability. Future evaluation needs cross-harness robustness: does the model still succeed under different orchestration, memory, and tool interfaces?

Why this matters now

Agent research has exploded across reasoning, memory, and tool use—but usually one mechanism at a time. The harness is where those pieces must actually compose: durable execution across API failures, inspectable trajectories for debugging, intent-scoped permissions for safety.

If operational grounding (our companion survey) asks what knowledge an enterprise agent needs, the Agent Harness asks what runtime turns that knowledge into reliable action over time.

Paper: A Survey on AI Agent Harness (PDF)

Curated resources: HKUST-KnowComp/Awesome-Agent-Harness

Lead author: Zhongwei Xie (HKUST)

Contributions welcome on the awesome list—we are especially tracking new work on cross-harness evaluation and durable, governable execution.