ENFR
8news

Tech • IA • Crypto

TodayMy briefingVideosTop articles 24hArchivesFavoritesMy topics

Harness Engineering Is AI’s New Gold Rush

AIAI RevolutionJune 7, 2026 at 11:31 PM13:06
Audio player
0:00 / 0:00

TL;DR

The AI industry is shifting focus from improving models to building better “harnesses,” with research showing system design can boost the same model’s performance by up to sixfold.

KEY POINTS

Rise of Harness Engineering

Major AI players are increasingly emphasizing harness engineering, a concept describing the full system surrounding a model. This includes tools, memory, permissions, verification layers, and workflows that guide how AI operates. The shift reflects a move away from one-off prompt optimization toward building reliable, repeatable systems.

Performance Gains Without New Models

A joint study by Stanford University and Tsinghua University found that identical models can vary in effectiveness by as much as 6x depending on their harness design. This suggests competitive advantage is moving from model capability to system architecture.

From Prompting to Systems Design

Prompt engineering focuses on improving a single interaction, while harness engineering aims to prevent entire classes of errors. The approach prioritizes long-term reliability by embedding checks, fallback paths, and feedback loops into the system rather than relying on retries.

Industry-Wide Adoption

Companies including OpenAI, Anthropic, and LangChain are already implementing harness-like systems. In large-scale coding workflows, OpenAI processed roughly 1 million lines of code and 1,500 pull requests in five months, highlighting a shift toward AI-assisted system orchestration rather than manual output generation.

Economic Potential vs. Slow Adoption

Despite projections from Goldman Sachs that generative AI could add 7% to global GDP over a decade, adoption remains limited. By April 2024, only 4% of U.S. firms had deployed generative AI, rising to 16% in information services, indicating that access to models is not the primary bottleneck.

Agentic AI Raises Complexity

Unlike chatbots, AI agents must act over time, interacting with files, APIs, and environments. Their performance depends not just on reasoning but on system layers such as orchestration loops, memory, and safety checks, making harness quality critical.

Context Management Challenges

Larger context windows do not guarantee better performance. Systems face “context rot,” where useful information is buried under noise. Advanced setups now summarize, filter, and selectively expose data, sometimes limiting outputs to small previews before deeper analysis.

Risks of Faulty Memory

Persistent memory can introduce errors when outdated information is treated as current. This “stale but confident” problem has led to designs where memory is treated as a hint and must be verified against real-time data before actions are taken.

Tool Use and Skill Routing

Expanding an agent’s toolset increases complexity. Effective harnesses must decide which tools to use, when to use them, and how to verify results. Without proper routing and validation, even accurate-looking outputs can be incorrect.

Self-Improving Systems

New research from Microsoft Research Asia and City University of Hong Kong introduces Retrospective Harness Optimization (RHO), enabling agents to improve their own systems by analyzing past tasks. Using GPT-5.5, RHO improved performance benchmarks such as SWE-bench Pro from 0.59 to 0.78 without external labels.

Learning From Failure

RHO works by rerunning difficult past tasks, comparing multiple attempts, and identifying inconsistencies or errors. It then proposes and tests harness updates, keeping only those that produce measurable improvements.

Emerging Risks

Self-improving harnesses introduce new concerns. Systems that adapt based on their own judgments may reinforce flawed behaviors or unsafe shortcuts, making audit logs, human oversight, and governance essential.

CONCLUSION

As AI models become more standardized, the decisive factor is shifting to how they are deployed and controlled, with harness engineering emerging as the next major frontier in performance and reliability.

Full transcript

The AI race may be entering a strange new phase. For years, everyone obsessed over the model itself. But now, some of the biggest names in AI are starting to focus on something else entirely. They are calling it harness engineering. Because apparently the same AI model can become up to six times more effective just by changing the system around it. Same model, same raw capability, completely different result. So the question becomes what is the real difference? That difference is the harness. The easiest way to understand it is this. The model is the intelligence engine. But the harness is everything around it that turns that intelligence into reliable work. It includes the rules, tools, memory, skill libraries, verification systems, context management, permissions, fallback paths, audit logs, and feedback loops that guide the model before it acts, while it acts, and after it gives an answer. Mitchell Hashimoto, the co-founder of Hashi Corp and creator of Terraform, helped push the term into the mainstream earlier in 2026. His framing was very direct. When an AI agent makes a mistake, the answer should not be to just rerun the same prompt and hope it works next time. The better answer is to change the system so that entire class of mistakes stops coming back. That is the real shift here. Prompt engineering was mostly about getting the model to do something right in one interaction. Harness engineering is about building an environment where the model keeps doing the right thing over time. It is the difference between correcting an AI once and designing the system so the same error becomes much harder to repeat. And that is why the phrase spread so quickly. Open AI, Anthropic, Langchain and other parts of the AI industry have all been moving in this direction even when they use slightly different words. Open AAI published its own essay around the idea and described how this works inside large code generation workflows. According to one article, OpenAI processed roughly 1 million lines of code and around 1,500 pull requests in 5 months. With humans moving away from writing every line manually and towards shaping the environment around the agent, Langchain compressed the idea into a simple message that people could repeat. Martin Fowler's site gave it a more formal engineering frame. Anthropic has often been more practical than terminological, focusing on the actual systems and safety layers rather than the label itself. And that matters because harness engineering is not some random new buzzword for old prompting. Prompting, context, and harness work are related, but they are not the same thing. If you change the words the model directly reads, that is prompt work. If you change what information the model receives, that is context work. But if you change the invisible structure around the model like the tools it can call, the checks it must pass, the memory it can trust, the permissions it has, and the recovery process when something goes wrong. That is harness work. A tool by itself is not the harness. An MCP server by itself is not the harness. A skill library by itself is not the harness. Those are components. The harness is the assembled system that decides how all those pieces work together. And this is where the AI race starts to look very different. A Stanford and Singua University joint study reportedly found that the same model with different harness designs could vary in performance by up to six times. The model stayed the same. The surrounding scaffold changed. That is a massive result because it suggests that as Frontier models become more widely available and more similar in capability, the advantage moves to the team that builds the better system around them. This also helps explain why AI adoption in the economy still looks strange. On one side, Goldman Sachs argued in April 2023 that generative AI could raise global GDP by 7% or nearly 7 trillion over a decade. That is a huge macro claim. But by April 2024, Goldman said only 4% of US firms had actually adopted generative AI. Even in information services where you would expect adoption to be much higher, the number was just 16% with 23% expected within six months. So the promise is massive but the rollout is still uneven. That gap is not only about access to models. Plenty of companies can access strong models. Now the bigger issue is that they do not yet have the system layer that turns AI capability into repeatable productivity. The model may be powerful, but without the harness, it remains fragile. It can answer one question, generate one file, write one piece of code, or summarize one document, but it may struggle to operate reliably inside a real workflow with memory, permissions, tools, deadlines, edge cases, and consequences. This is especially clear with Agentic AI. A normal chatbot gives an answer. An agent has to operate over time. It may need to open a terminal, search files, read documentation, write code, test the result, call an API, update a database, ask for clarification, store memory, recover from a failed command, and decide whether an action is safe before it touches a live environment. Once an AI model is embedded inside tools, browsers, terminals, repositories, memory stores, and external services, its behavior is no longer determined by the model alone. It is determined by the whole system. That is why a new UC Berkeley paper argues that for agentic AI, model scaling alone is no longer the full story. For normal chatbots, the model matters the most. But once an AI becomes an agent, once it starts using tools, opening files, running commands, remembering things, and taking actions, the model is only one part of the machine. The paper says the next major bottleneck is system scaling or scaling the harness. A real agent needs several layers working together. It needs the LLM itself, which is the reasoning engine. It needs memory so it can remember useful information across tasks. It needs a context system so it knows what information to put in front of the model and what to leave out. It needs skill routing so it can pick the right tool or workflow at the right time. It needs an orchestration loop which controls the sequence of steps. And it needs verification and governance. So the agent cannot just take risky actions without checks, permissions, logs or roll back path. That sounds technical, but this is already happening in serious AI systems. Clawed code, open claw and cheetah clause are different kinds of agent systems, but they all face the same basic problem. How do you control what the AI sees, remembers, uses, checks, and changes? And the first major problem is context. A lot of people think a bigger context window automatically makes an AI agent better, but the UC Berkeley paper makes a sharper point. The hard part is not giving the model more tokens. The hard part is giving it the right tokens. A million token context window does not help much if the useful detail is buried under old logs, stale notes, irrelevant files, and conflicting information. That is where context rot comes in. The model technically has the information somewhere in the window, but the signal gets drowned in noise. That is why real systems already fight this aggressively. Recent analyses of clawed code describe a five- tier compaction system with things like micro compact to clean up old tool results and context collapse to summarize long conversations. And when a tool produces a massive output like a giant server error log, the system does not just dump everything into the model. It can write the full file to local disk and give the model only an 8 kilobyte preview first. So the agent behaves more like a developer. Check the top of the log, understand the shape of the problem, then dig deeper only when needed. The second problem is memory. Memory sounds useful, but bad memory can be dangerous. An agent might remember an old note about how a codebase works, miss the fact that the code was refactored yesterday, and then confidently apply the wrong fix. The paper calls this the stale but confident problem. The memory is outdated, but the agent treats it like truth. So, a serious harness treats memory with suspicion. Something like a memory MD file should act more like a hint than a fact. Before the agent edits files or takes a risky action, it has to check the live environment and verify that the memory is still true. Some systems even clean memory in the background during idle time, removing contradictions, compressing useful lessons, and stopping the agent from slowly filling up with old or messy information. The third problem is skills. Giving an agent more skills sounds like an obvious upgrade, but it creates another problem. Choosing the right one. The agent has to know which skill to use, when to use it, how to combine it with other skills, and how to check the result. A specialized tool can produce an answer that looks confident and useful while still being completely wrong. So the real issue is not just having skills. It is routing and checking them. That is where harness engineering becomes practical. A strong harness does not just give the model more tools and hope it behaves. It connects those tools to checks. Did the task actually finish? Did the output match the request? Did the system change safely? Was the tool result verified? Is the agent even allowed to continue? And now researchers are taking this one step further. They are asking whether AI agents can improve their own harnesses from experience. That is where retrospective harness optimization or RH comes in. A new paper from Microsoft Research Asia and City University of Hong Kong introduces RHO as a way for an agent to improve its harness by looking back at its own past work. Instead of needing a labeled validation set with correct answers, the system studies old trajectories, finds difficult and diverse tasks, reruns them, compares different attempts, diagnoses what went wrong, and proposes harness updates. The key part is that RH does not need ground truth labels. It uses the agents own preference over different attempts. First, it selects a small group of past tasks that are both hard and diverse. The paper uses a method called DPP to balance those two things because choosing only the hardest tasks can focus too narrowly on one type of failure while choosing only for variety can miss the serious problems. Then it runs multiple attempts on each task and looks for two signals. Self- validation checks whether the agent actually completed the task properly and catches things like false assumptions, wrong tool calls, and stopping too early. Self-consistency compares different attempts on the same task and looks for major disagreements in the plan, the tools used, or the final answer. Those signals become instructions for improving the harness. Then, RHO generates several candidate harnesses, tests them against the old one, and keeps the candidate that performs better, but only if the score is actually positive. So, it is not randomly changing the agent. It is using past failures to decide what the system around the model should learn. And the results are the part that makes this hard to ignore. Using codecs with GPT 5.5, RH improved S.WEB Pro from 0.59 to 0.78 without external grading. And it also improved terminal bench 2 and Gaia 2. So the gains showed up across coding, technical work, and knowledge tasks. What makes this more interesting is that RHO was not just giving the agent more memory. It was changing the actual system around it. the tools, skills, instructions, and checks that shape how the agent works. After optimization, the agent verified its work more often, used tools more carefully, and performed better on long tasks where normal agents usually start falling apart. That points to the bigger shift. Future agents may improve by learning from their own work history. Every task leaves a trail. Every failure leaves a clue. And repeated mistakes can become updates to the harness itself. Of course, that also creates risk. If an AI can update persistent behavior from its own judgments, it can also reinforce bad habits or unsafe shortcuts. So, serious systems still need audit logs, human approval, and safety checks. And the next phase of AI may be won by whoever builds the best harness around the model. Also, if you want more content around science, space, and advanced tech, we've launched a separate channel for that. Links in the description. Go check it out. If you think harness engineering is really the next big AI advantage, drop your take in the comments. Hit subscribe if this made you look at AI agents differently. Thanks for watching and I'll catch you in the next one.

More from AI