ENFR

Tech • IA • Crypto

Today All videos Video recaps All topics Top articles 24h Archives

The expanding toolkit

AnthropicClaudeMay 8, 202621:20

0:00 / 0:00

TL;DR

AI models are rapidly absorbing developer-built scaffolding, turning complex agent systems into simpler, more autonomous tool-driven workflows.

KEY POINTS

Shift From Models to “Toolkits”

AI systems are evolving beyond simple input-output models into integrated tool ecosystems. Capabilities such as tool use, memory handling, and execution are now embedded directly within models, reducing the need for external engineering layers. This marks a structural shift in how developers design AI applications.

End of Manual Tool Routing

Earlier systems required handcrafted routing logic to decide which tools a model should use, often relying on brittle heuristics. Modern models can now evaluate available tools and select the appropriate one autonomously. Improved reliability also allows models to retry failed tool calls without external intervention, eliminating the need for custom retry loops.

Smarter Tool Design Improves Performance

Providing models with both input parameters and expected output schemas enhances efficiency. By understanding the structure of tool responses in advance, models can better plan actions such as ranking or filtering results, reducing unnecessary back-and-forth calls and improving response quality.

Context Windows Approach “Infinite” Scale

Long-context limitations are being addressed through 1 million token context windows, flat pricing, and built-in server-side compaction. Previously, developers relied on techniques like chunking, retrieval systems, and summarization loops. These are increasingly replaced by native context management features requiring minimal configuration.

Token Efficiency Through Context Pruning

Removing stale tool outputs—such as screenshots or large file reads—while preserving the decisions derived from them can significantly reduce token usage. This allows systems to maintain reasoning continuity without carrying unnecessary data overhead.

Integrated Code Execution Environments

Models now include built-in code execution tools with hosted sandbox environments. This replaces complex pipelines where developers had to generate, execute, and validate code externally. The full write-run-debug loop can now occur within a single interaction, streamlining development workflows.

Separation of Local and Model Environments

The execution model distinguishes between a model-controlled sandbox and a user’s local system. This enables safe experimentation, dependency installation, and data processing without affecting local environments, while still allowing access to local resources when necessary.

Breakthroughs in Computer Use

Advances in computer interaction eliminate the need for manual image scaling and coordinate transformations. Models can now process native-resolution screenshots and generate precise click coordinates up to 1440p, simplifying automation of graphical interfaces.

Rapid Gains in Real-World Task Performance

Performance benchmarks show significant improvement in complex software interaction. On the OSWorld evaluation, task completion rates have risen from below 50% to approximately 78%, signaling growing reliability in handling real-world applications.

Autonomous Debugging and Testing

AI agents can now test user interfaces, reproduce bugs, apply fixes, and retest workflows independently. This closes the loop between development and quality assurance, enabling systems to interact with software in the same way humans do.

Declining Value of Reliability Workarounds

Code designed to compensate for model weaknesses—such as validators, planners, and retry systems—is becoming obsolete quickly. As models improve, these layers are absorbed into core capabilities, reducing their long-term value.

Rising Importance of Proprietary Context

The most durable engineering effort lies in connecting models to unique data, tools, and workflows. Unlike generic reliability fixes, this integration cannot be replicated easily and becomes a key source of differentiation.

CONCLUSION

As AI models internalize more capabilities, development is shifting away from maintaining model reliability toward building unique integrations and data-driven systems that define real competitive value.

Full transcript

Hello everybody. How are folks doing today? My name is Lucas. I'm a research PM here at Enthropic. And today I'll be talking about the expanding toolkit. But first of all, I want to say thank you everybody for joining us at our Code with Claude conference. We're very grateful you're here and we love speaking directly to our users. Cool. So what am I going to talk about today? The overarching theme of today's talk is that the scaffolding that you had to build last year actually ships with the model today. So I want you all to think of the model no longer as just an input output LLM box but rather as a series of tools around that model that expands its capabilities and leads to better performance. So in other words we see the model itself as an expanding toolkit. And so this talk will be a series of befores and afters. On the left side, you'll see basically what things look like previously. And on the right side, you'll see what that same task looks like, but now in 2026, most of this will be heavily like heavily simplified. And so, what you'll notice is that not just, you know, not just better reliability, but you'll also get much simpler development as well. So you have to focus less on the actual retries, wrappers, and so forth, and more so on just getting the outcome that you're looking for. And so we'll be covering things like tool use, context management, code execution, and computer use. Then I'll be sharing some practical tips for each of those. And then for the Cloud Code fans in the audience, I'll also have quick tips specific to Cloud Code. So a year ago, building an agent really meant building around the model. You know, you might have routers to pick the right tools. You might have retry loops. You might have output validators, context compaction. You might even have to do some coordinate math if you're doing computer use. You'd have a hundred lines of code, hundreds of lines of scaffolding before you even build any product. Now, that scaffolding hasn't disappeared, but it moved. It now ships with the model itself. So the point isn't really that the work went away. It's that you don't have to own it anymore. Cool. So now the first capability that I want to talk about is tool use, specifically tool routing and retries. On the left side you can see what this looked like previously. You couldn't really trust the model with the full tool set. it would eat into the context window and so you'd have to build a router and the way you might build this router is through things like string matching heristics you know you might say oh if the model mentioned SQL then give it the database tool and then you also needed a retry decorator on top of that because the tools failed often enough that you actually needed back off well routers like those are basically guesses about the user intent written in conditional if statements. They're brittle and they're sort of the first thing that breaks when you try actually adding a new tool. Now, on the right, we have what the new paradigm is. The model can actually search through tools and pick the right tool itself. The model's intelligent enough and tool selection accuracy is now high enough that tool routers and prefiltering usually makes things worse, not better. We want the model to make decisions about what tools are relevant in the context it's working in. And now when a tool errors, you can actually trust that Claude will see that error, recover on its own, and call the tool again. So no more pesky tool rerouters, no more heruristics around when to bring in certain tools. That's all built into the model today. Now, as promised, uh oh, can we go back one slide? Now, as promised, a quick tip for uh tool use. And this one is very powerful and one that I actually use uh quite frequently. And so, uh when you're giving when you're giving a tool to Claude, most developers typically give the input to that tool to Claude. So you might tell Claude, you know, here are the parameters you need in order to call this tool and to use this function. But actually what you can do is give Claude a description of the output schema as well. And so you can see in the example I have here in the description I actually outlined, you know, this search docs tool will search the docs and it'll return the ID, title, snippet, and score. And so by doing this um you can actually let Claude know what to expect from this tool call. So for example, if Claude wanted to rank the outputs of this tool, it already knows that a score will be returned by this tool, effectively saving it a round trip from the harness. And so by doing this, you get more efficient and more intelligent outputs from Claude when using this tool. And now for a cloud code quick tip. This is another one that I like to use frequently, you can actually use pre and post tool use hooks defined in your cloud settings. So what this means is before cla calls a specific tool or after cla calls a specific tool, you can actually have something happen programmatically. And so you might do this to block certain tool calls in specific situations or you might use this to like analyze and log outputs programmatically after the tool call is made. Um, can we go forward one on the speaker notes? Cool. So, next I'll be speaking about context management. You know, longunning agents previously meant that you had to build your own memory system. You might do something like chunking. You might even do rag, something that's uh very very popular, especially to manage those pesky context windows. You might even call another model to summarize what's going on after every end turns or end tokens. So again, the the idea is you were building the scaffolding so that you could practically extend the model's context window. And then you might even also have cash break points that you had to then move by hand in order to save on cost and cash previous turns. Well, now we've simplified all of that tremendously. And by offering 1 million context length at flat pricing, that already reduces most of the window pressure. And then you pair that with server side compaction as well as context editing and it basically turns the rest of that all into just a few lines of config which you can see on that right side. And this is how we get much closer to the feeling of an infinite context window that was mentioned at the morning keynote today. And so again, this is just another example of how a lot of scaffolding that you had to previously build in order to make the model work in the ways that you want is now just completely built into the API and is just a single API call away. So now for a quick tip on context management. So we actually recommend that every end turns you actually clear tool results. And by pruning stale tool outputs, think about things like screenshots or search results or file reads, you can actually save tremendously on on context while keeping the decisions that they informed which Claude mentions in its transcript. And so you can imagine you have a transcript here where the model read a huge file. It also had to get a screenshot. Then it made some decision based on that. And then it had a search that dumped a ton of text. By clearing those results and just keeping the core task, the decision that was made based on those tool results and the results that the agent analyzed itself, you can save on tokens in real time. And now for the cloud code fans. Uh this quick tip. Um I I suspect a lot of you might already know this one, but I like it a lot. And if you have cloud code open right now, I suggest you try it. Do slashcontext to actually get this live colored grid breakdown of what's filling your context window. And that's a great way to kind of viscerally see what I'm describing here in terms of how much space messages, tool results, systems, and MCP definitions takes in your context window. And you'll also see some optimization suggestions there as well. Cool. Next up is code execution. So, previously the write, run, and fix loop used to be the developer's job. You might find a VM provider, spin up a sandbox in the VM they provide. You would then have to have the model output some code, put that code on the VM, run that model's code, parse the feedback, parse the trace back from running that code, feed it back into the model, and then you'd have to run that on repeat until the model succeeded at the task it wanted to do on that VM. Well, we wanted to massively simplify that. And so now we actually offer a code execution tool which automatically gives Claude a hosted sandbox on the server side. So this means that that entire loop that I just described effectively happens inside a single API turn. So no more harness round trips between Claude and whatever VM you're using. Claude can just in the back end on the API side tap into that separate computer that's being used just for Claude's scratchpad and work. And now this one is maybe a little less of a tip and more so a mental model as to how to think about code execution and verse your local bash. So when we give Claude the code execution tool, it basically gets its own computer to do stuff on. So think about this in a way of like giving Claude a little calculator or something except it's an entire computer that it can actually use. And so this means that Claude can use this computer for things like stateless compute, data analysis. It can, you know, install custom libraries here, but basically it could do all this work without actually disrupting or cluttering your local file system and your local computer. Then when cloud does need to access something that only exists on your local computer, maybe it's your repo or a Python MV you have installed or just any other local context you have, it can then go back to the real bash on your computer and it intelligently knows which of the two to use. Now for another cloud code quick tip. Um you can actually use SLC schedule to schedule cron triggered autonomous runs. And so think about this selfiteration loop I described on the previous slide but now on a timer happening exactly when you need it and completely autonomously done by claude. And now last but certainly not least, this is an area where I spend a lot of my time uh working with cloud um is computer use. So previously for computer use, let's say you wanted uh cloud to drive your laptop. Most laptops have a 1080p screen and you had to send that image to claude. So in order to have reliable clicks, you needed a pile of image glue. So what that would look like is you first take your 1080p image, you downscaled it to what fit Claude's pixel limits, you tracked the factor of that downscaling. Then when the model would sample a click, you would then need to scale that click back up to your original resolution. So you'd have to write all that code and then wrap it in retries and verify statements as well. Well, this was a very big pain point and we've heard the feedback and so Opus 47 can now actually take native resolution screenshots and return 1 one pixel coordinates up to 1440p which captures the vast majority of display resolutions. So now that means that the scaling math is completely gone and you can simply send your image and trust that Claude will click exactly where it needs to. And we're really excited about this uh capability as well because computer use as a capability is something that over the last 12 months has made great leaps and bounds. So our headline evaluation here is called OSWorld. And that's basically um an eval that tracks how well the model can complete complicated tasks on professional software as well as consumer grade software. And so less than 12 months ago, Claude was scoring below 50% on this eval. So it could not complete half the tasks that was that was asked of Claude. Now we're about to hit 80% on this eval. And we just reported 78% on Opus 47. So making computer use easier to use is something that's very exciting for us as we see it as a as an aspiring capability that is just now at the cusp of broad usability. So, as I mentioned, um, you know, we support resolutions up to 1440p, but we'd actually really encourage developers, uh, to experiment across resolutions and formats. Now, if you're doing really high- res stuff like 4K, we still recommend that you downscale on your side. However, if you're doing anything up to 1440p, we recommend that you try different resolutions to see what works best for you, as well as different image formats like JPEG, PNG, or WEBP. Each of these compresses images differently and creates different compression artifacts. And so by testing it on your use case and the kinds of UIs that you're trying to automate, you can find what works best for you. Now a cloud code quick tip here is that claude code can actually leverage your Chrome browser session itself. So if you're in cloud code and you have clouding Chrome installed, which you can get on cloud.ai/chrome, AI/Chrome. This means that your Cloud Code session that that agent's harness can actually start to use and navigate the web. And this includes local development as well. And I'll show you something really cool that you could do uh with that in a second. So now we're actually going to do a short demo. I'll show a short pre-recorded demo of an agentic coding loop with cloud code. And you'll see computer use in action while with the cloud in Chrome extension. Cool. So, basically we what we'll have what we have here is a uh cloud code session and we've been working on a project management dashboard, but it has a couple bugs. The first bug is that the new button isn't actually adding a card. And the new button should be adding a card. So we ask Cloud to open it in Chrome and try it itself. And so we'll see this dashboard load as soon as Claude decides to open the browser. Cool. So you can see Cloud is connected through Cloud and Chrome here. And so the first thing that Claude's going to try to do is to reproduce the issue. So you could see it spun up Claude in Chrome and it's going to test this live board itself. So first Claude is now trying to type something. You could see in the bottom left of the uh dashboard there, Claude's trying to type something, but it's seeing some issues. And so in real time, Claude is both testing and debugging side by side. Same way that you might do uh QA and in order to fix these bugs. So now you can see it tried typing something, didn't work, and now Claude will actually go into the code and make the code changes to wire up the card creation successfully. And now you can see that Claude has successfully created the card. So it found the issue by actually testing it, tried it, and fixed it. But now with creation working, Claude is also going to check some other features. For example, whether these cards can be correctly uh dragged across columns. So Claude can actually do drag actions as well. And so Claude just tried to drag the review PR into done, but it accidentally landed in to-do. It recognizes that that that's a bug, has the insight within cloud code, and then it diagnoses that drag and drop box, and it's going to actually write the fix in real time. Then it retests the flow by once again creating it retest the flow from the ground up. So it'll once again create the new item. Great. Thank you, Cloud. That's working. And now it'll test the drag and drop flow with that review PR now correctly going into the done section. From there, Claude will recap the fixes it made and we'll summarize all its changes. Now, we think that this is a really powerful loop because most software today is created for humans and so it has to be tested in a human-like way. So giving Cloud the capability to do browser use and computer use during its development cycle allows it to close that loop where it can create human focused software and actually solve bugs itself directly without the developer needing to come in and handhold Claude to the bug and the solution. So now to wrap up the call and bring it back to the main theme and the main point that we really want to get across. The rule that you should have in your mind is any code that you are writing that is compensating for model unreliability will have a half-life of just months. You should leave that work to us. We will continue to make cloud more reliable and more capable through this expanding toolkit that comes with the model. So things like retry logic, routers, planners, verification loops, all of these are going to get absorbed in the model and they have been getting absorbed in the model as I've shown you in the prior slides in this talk versus code that connects your model to your world. That code tends to compound your custom tools, your data, your off, your specific context. The model can't absorb what it can't see. So giving it that is much more valuable as opposed to compensating for any model shortcomings today. And we believe that the ecosystem is moving in the same direction. So we believe that in the near future every agent, every piece of software will be getting a front door for agents. And so the interesting work is no longer making the model more reliable. The interesting work is what you put on the other side of your agent front door that nobody else can. Thank you very much for coming to my talk and for coming to the code with cloud conference. Uh my name is Lucas and I'll be walking around. If you have any additional questions, definitely feel free to come say hi. Uh thank you all very much.

More from Anthropic