
Tech • IA • Crypto
OpenAI’s updated Agents SDK introduces a Codex-style harness, sandboxed execution, and persistent state features to make long-running, production-grade AI agents easier to build and deploy.
AI models are increasingly capable of handling tasks over extended periods, from minutes to days. Internal tools have demonstrated agents running continuously for up to a week, completing complex workflows such as coding, data analysis, and security scanning. This shift is driving demand for infrastructure that supports sustained, autonomous operation.
Despite improved model capabilities, deploying agents in real-world systems presents challenges. Developers must balance performance with flexibility across model providers, manage state across failures, and handle orchestration logic. Issues like container crashes, state loss, and secure handling of secrets complicate scaling.
The SDK now integrates a Codex-inspired harness that automates the agent loop, including tool use, context updates, and task continuation. Features like asynchronous shell execution, command tracking, and automatic context compaction allow agents to operate continuously without manual orchestration.
A key architectural change splits the agent harness from the execution environment. Instead of coupling logic and compute in a single container, agents can run in ephemeral sandboxes while orchestration persists elsewhere. This enables recovery from failures and avoids state loss when containers terminate.
Agents can operate inside isolated sandbox environments with access to files, enabling tasks like code editing, document processing, and asset generation. These sandboxes can run locally via Docker or in cloud platforms such as Modal, Cloudflare, Vercel, and others, offering flexibility for development and production.
The SDK introduces built-in state management by snapshotting both the file system and conversation history. These snapshots can be stored locally or in cloud storage like R2 and later reloaded, allowing agents to resume tasks seamlessly even after interruptions.
A new Skills API allows developers to package domain-specific logic, instructions, and resources into reusable units. Skills can be versioned, stored centrally, and loaded dynamically, enabling agents to perform specialized tasks such as tax preparation or content editing with consistent behavior.
The Responses API now includes a hosted shell tool that spins up temporary containers for single tasks. Developers can upload files, execute code, and retrieve outputs in one call, offering a lightweight alternative to full agent deployment.
Developers can extend agents with custom tools, defined as functions that the model can call automatically. These tools support validation, guardrails, timeouts, and dynamic enablement, enabling integration with external systems such as task trackers or databases.
The SDK introduces a manifest system to define file system structure and data sources. Files can be copied into sandboxes or mounted from external storage like S3-compatible buckets, balancing performance with scalability depending on workload needs.
While still evolving, the framework supports orchestrating multiple agents working in parallel. Coordination can occur through shared storage, messaging, or supervisory agents, with expectations that large-scale multi-agent systems will become more common.
The updated Agents SDK reflects a shift toward persistent, autonomous AI systems by combining structured orchestration, sandboxed execution, and built-in state management into a flexible developer framework.
Hey everyone, welcome to OpenAI Build Hours. I'm Christine. I'm on the startup marketing team and I'm here with Steve. >> Hi everybody. >> So you might might remember Steve. Welcome back to build hours. >> Thank you. It's great to be back >> from our last session on responses API. Um but today we're going to be talking about the agents SDK. So if this is your first build hour, uh the purpose of build hours is to empower you with the best practices, tools, AI expertise to help you build with our APIs and models. Down below on your screen is our homepage. So you can find all of our past sessions including Steve's one on responses API on demand and this will also be on demand after our live session. So to give you a little snapshot on what we're going to be talking about today. So, first we'll be going through a lot of the new updates on agents SDK. There's now a codec style harness um that's really improved it and Steve will walk you through all of these new releases. Um we also have some new releases in the API that we'll walk through and then we'll move into a live demo. So, this will be the majority of the time that we'll be spending today. Um and we'll be building an agentic uh task tracker and then as always we have a lot of time set aside for Q&A. So on the right side of your screen, you'll see a Q&A tab. So feel free to submit your questions. Um we have our team who is dialing in virtually to answer them during the session as well as saving some to answer live towards the end. And we'll have a very special guest be joining us virtually uh to help answer some of these. So with that, I will turn it back over to Steve to walk through some of these slides. >> Awesome. Thanks, Christine. Uh so great to be back on Build Hour. My name is Steve. I'm an engineer on the API team here at OpenAI. First thing I want to talk about is some of the updates we've made to the agents SDK in the last couple months. Um kind of uh kind of the backstory here is that models are getting way better at doing work over long periods of time. Uh a lot of folks have seen the sort of meter chart that shows the ability of models to kind of work uh on their own for longer and longer periods of time that's just kind of going continually up and to the right. And there are a lot of great examples of this uh even within our own products. So many of you know and love Codeex, our agent coding tool that you can use to kind of build software and do tasks from end to end. You've probably seen it run for maybe minutes or up to an hour. If you've used the goal feature, maybe you've seen it run for much longer than this internally. Folks have gotten Codex to run for days, up to a week on tasks. So these models are really really able to kind of do things over really long trajectories, which is really cool. Um we also uh have this sort of security agent that's built on top of Codeex. um you know the models are getting really really good at finding security vulnerabilities especially in legacy software and so we have an agent that's powered by codecs that actually scans repos and our our own code and code that we depend on for vulnerabilities that might affect the security of our systems and then we also have an internal data agent that's kind of powered again by codeex that's connected to our data links able to answer any questions so I can say something like how many people are using the agents SDK how many requests were there in the responses API two days ago and where previously I might have spent an hour writing that SQL query, now it's just a simple prompt and a few minutes later I have an answer which is which is really awesome. Uh the reality is though that building production grade agents in your own systems is is still pretty hard. Uh it's pretty hard to kind of strike this balance between maximizing performance and getting and having sort of a really uh flexible maybe cross model provider platform. um because it more and more the models are kind of being tailored to the individual harnesses and so it's kind of hard to stay in distribution while also maximizing that flexibility and that customization. Um there's also a lot of challenges that come with sort of modern computer using agents. So we think of codeex as a coding agent uh that can do things within a sandbox. Most most of the time when you're running codeex that sandbox is just your local laptop. So those two things are together. We'll talk about how we switched that up with the agents SDK. But if you're deploying agents in production, oftentimes you want to be able to run these agents in sort of an isolated environment and then you have to deal with problems like the container dies, it expires, it goes away, what do you do? Does that thing loadbearing? Did you have state on that then rehydrate? Uh this stuff is is really really hard. Uh and then kind of the third thing is like at the end of the day, you want to be able to have this sort of like really customizable framework you can use to bring your own data in and uh maybe add your own skills and and all of this stuff like this. And so this combination of things is kind of makes deploying agents in production a little bit challenging. Uh the cool thing about the agent SDK, we've added we've made some updates to it, but sort of the baseline is this open- source customizable framework that maybe you've used since we released the agent SDK last March. It's been around for a while, continues to be open source and and super flexible. Uh we've added in this concept of sandbox using agents and you get to decide where these sandboxes are run. So if you are you know maybe a modal user or an EDB user or a cloudflare user and you're kind of already using those sandbox products or maybe you're new to sandboxes and new to sandbox using agents and you want to but you're already using one of these platforms we have provided first class support for using the sandbox products in a bunch of these uh different platforms like ETB modal cloudflare in addition versell uh blackel daya uh and just things like docker and uh if you just want to use your local laptop for testing. Those are kind of all supported first class. And then on top of this is this Codex style harness. So bringing in all of the great stuff that makes CEX Codex, all the great tools that are in distribution with our models, but work great with other models as well. And then some of the cool things like skill use, uh, computer use, memory, and we'll be demoing some of this stuff here in a little bit. So uh, yeah. So basically what the agents SDK is is it provides this sort of model native harness that the that the agent can use in order to kind of do productive work over long periods of time. So if we think about maybe what building an agent might have looked like you know a year a year and a half ago where you're just kind of using an API directly an LLM API directly you would have done something like you know you set up a loop you make a request to an LLM API you get some stuff back maybe you get a tool you execute that tool you update the context and then you kind of repeat until the model has no is no longer calling any tools you might be layering in stuff in addition to that like web search file search MCP kind of like building spending most of your time building this orchestration layer around this loop where actually you should be spending most of the time building product building stuff into your product that that makes it it much better for your end users and so agents SDK is kind of designed to help you do all that stuff and not really think about the orchestration. So the agent loop is built in again very CEX-like in its uh design. Uh also web search, file search, MCPS, code interpreter, skills, these are things are all you get out of the box or really easy to turn on. And then we've kind of added in some of this like sandbox using functionality that I'll talk about uh pretty shortly. The thing that I'm most excited about that we've kind of done here is we've split the harness from the compute. And uh what does that mean basically? So if you can kind of imagine a world where the model is working sort of directly on a uh computer that it itself is running on. So if you can kind of imagine like codecs running on your laptop is the sort of example where the harness and the compute are tied together. Uh imagine putting that into production right so you have sort of a container where the agent is both running. So like the the loop that is calling your LLM API in addition to working on the same file system those things are together. This kind of creates a few problems when you run this in production at scale. A your sandboxes become all of the sudden loadbearing and if one dies or goes away then all of a sudden all that state is gone and you don't have some sort of external place where maybe you can refresh from. In addition, you have to sort of do interesting uh sort of gymnastics to manage your secrets. You ideally don't want to have any secrets on your sandbox otherwise you might be vulnerable to prompt injection attacks or exfiltration. And so if you kind of split those things up, then you can treat the sandbox as this totally ephemeral thing that you don't really have to worry if it lives or dies. It can expire, go away. And then the uh harness, which is running maybe in a temporal job or in AWS or you know kind of wherever your infrastructure is already running can sort of handle the rehydration and snapshotting and all this stuff. And the agents SDK is going to make that really easy. And I'll kind of show some examples of this. Um, so agent SDK kind of like already fully packed with a ton of features that you might have already been using things like web search, file search, uh, you know, agent memory, text in addition to some modalities that you've probably been using, uh, text and voice. Um, there are some new things that are we're pretty excited about that we've added. So this is kind of first class support for skills. Uh, containers is kind of a new thing sort of in the API, also in the SDK. And then we've also added agent memory so your tasks can improve over time um and get better as they as they go. So some of the kind of core features here again uh code a really you know codeex inspired harness a lot of those same tools sandbox use so your models can work with files in sort of a virtualized environment you know do work over long periods of time write bash write python write whatever to kind of accomplish tasks and then as always open source super customizable socy harness what kind of concretely does that mean um the kind of things we've brought from codeex into the agency SDK are these things like uh autoco compaction uh especially you know this computer use through through uh the shell. So the model is able to write shell commands. Um we've you know the codec style loop is this sort of async shell interaction loop where it can write a command maybe wait a little bit if it doesn't finish it can go do something else and then kind of come back to those things and it kind of keeps track of what commands it has running at any given time. And so we've kind of brought that into the agent SDK to allow you know coding agents to work kind of in any domain in any infrastructure. And then sandboxes are super crucial because you know modern work means working over a lot of files. If you think about kind of the work that you and I do every day maybe you're working in a big codebase with a bunch of files editing files and and making your changes. Maybe you're working with a bunch of PDFs. Maybe you are you know creating word documents or powerpoints. a agents are no different and they need access to those files and a kind of a a a place to create those files and store them. And so that's kind of where sandboxes come into the picture. It's an isolated environment where the model has access to some files and can do stuff and and produce uh meaningful output. And uh kind of again agent SDK open source and customizable. It's been open source since day one. Super, you know, uh model agnostic. If you want to use other model providers, you're more than welcome to. Anything that kind of uses the responses API format will plug in really well. Uh it you can make this multi-tenant super easily. So if you want to have a system that's running the agents SDK where you're kind of processing many requests from users at once again super easy. Made session management really simple. So agents SDK is going to manage the snapshotting uh rehydration of tasks. You can kind of start from where you left off. All that stuff is really simple. Kind of show a demo of that. And then really configurable. You can bring your own tools, bring your own MCPs and and kind of all that stuff. So I want to switch gears a little bit and talk about some other cool things that we've shipped in the API in the last couple months. Uh first one is this hosted shell tool in the responses API. So idea here is I want to make a just a quick API call. I want to provide some files there that I want the model to do something with. Um and behind the scenes we're going to spin up a container, load that up with the files that you've asked us to. The model can write code, do some stuff in that container and then return the results back to you. So think about this as like a lightweight version of the agents SDK. It's kind of this ephemeral container. You can put a few files in there. The model can work in that container and then it can finish and you can you know create these out of band. We have a containers endpoint where you can go and create a container, upload files to it, spin it up, attach it to a responses API request or you can kind of use it in auto mode where you just add some files and then we'll spin up the container for you and then spin it back down when you're done. Really easy way to kind of get started with this, you know, sandbox using paradigm. We've also added the ability for you to kind of control network access in the containers you spin up. So if you want to do things like domain allow lists, lock it down entirely so you have no egress or ingress into that container. Uh these are kind of like all things we've added so you have a little bit more security control over your containers. And we've also added the skills API. So if anybody has used skills in the past, there's kind of like a pretty common problem that comes into play which is that you kind of need a central source of truth for these skills. whether that's maybe a GitHub or a bucket or now we have a skills API where you can kind of upload your skills too. Uh and a skill is kind of this bundle for for those who haven't used it. It's kind of a bundle of files that has sort of one skill MD file that has all the top level stuff in it, but they can also contain scripts or other resources that the model can use to kind of do a pretty specific task. So, the example I always use is like maybe you have a tax prep skill that kind of defines like, okay, here's all the things that I need to know as an agent in order to do somebody's taxes for like tax year 2025. Has all the like IRS rules and maybe some scripts to like process documents and like fill in a 1040, something like that. Um, in the with the skills API now, you could upload that, you can iterate on it over time and create versions. You can set up default version and then reference those versions with the hosted shell and all that stuff works together super seamlessly. Uh yeah, so kind of a couple curl snippets here of how you might use this thing. Pretty easy. If you already have zip files of your skills, you can just kind of upload them directly to the API. And then the last thing is we've also shipped uh the sandbox agent stuff that we will be talking about today in Typescript. This is a big ask from a lot of folks after we ship the Python version in April. Uh sand the TypeScript version is now available. So if you are a TypeScript user, uh knock yourself out. So, uh, want to switch gears a little bit and kind of go over to the demo. Uh, we're going to be kind of like building a task tracker with the agents SDK and kind of making it aic and then offloading all the work so I don't have to do anything anymore. So, we're going to Oops. Flip over here. And uh I kind of put this kind of like vibe coded this uh kind of like sort of like a linear board. Doesn't really do anything right now, but I can kind of create tasks and uh assign them to folks. So the sort of idea here is that we are going to be doing some conference planning. So you can kind of imagine this is a tool that I might use as a conference planner to put together some talks. So I might have tasks for you know creating program assets or refining somebody's talk or you know uploading stuff to the website things like that. Um my goal here is not to do any work myself. So we're going to kind of automate a lot of this with the agents SDK. So I'm going to create a fake task here and I'm going to say uh you know create program assets and we'll say we can leave this blank and we have a couple of things down here. where we have we can have a assignment. Uh I need to refresh this. Sorry guys, I need to reset my demo. Cool. Okay, there we go. So I can uh I can create a blank task here that's we can say create program assets and uh I can assign this. Right now it's Right now it's just me, but just got a sneak peek at what might be coming in a few minutes. Um, and I can upload files to this. So, if I just go ahead and grab my fake files here, can drag these in. I can create an issue. Uh, and in a pre-agentic world, I would assign this to somebody. Maybe I'd assign it to myself and I would I would do this work, but that's not really that much fun. So, we're going to go ahead and create an agent to kind of help us automate this. So, if I hop back here, uh, we are just going to comment in this agent definition. So, I want to take a sec to kind of walk through what we're looking at here. So, if you've been an agent SDK user in the past, this will look pretty familiar to you. Uh, the sandbox agent is a new type of agent. Uh, it actually is just a subclass of the agent class that you might have been using before. So, a lot of the same things will still apply. A lot of the same parameters are still going to be there. We are having we have a couple new things that we'll kind of talk about in a sec, but suffice to say that we are adding some instructions that are kind of defined up here. Uh, and then we'll be adding some other stuff later. So, if I go back to my demo, refresh, should be able to assign this to the program editor agent. And what I'll say is like, please take a look at these files and uh, edit them for clarity. I'll hit send and then we should be able to see that kickoff. So the first thing that's going to happen is uh because I've configured this first agent to use docker as its sandbox can actually see that docker container running and uh the agents SDK is going to handle uploading these files to that container. So if I pop this might be a little hard to see but uh we can see that a lot of these files have now made it into the container. So the agent is going to be sandboxed to just this folder and we'll have access to all of these files that it can then use to look at. We can as we kind of look over here, we can see that it's executing commands. It's kind of like looking through the files and figuring out what's what. And then it's going to kind of produce some output. And so it just finished and it gave us it just kind of gave us a little log of of what it did and kind of told us that it's missing some data here. That's kind of okay for now. So, uh, this is great, but we kind of want to be able to customize this agent and, uh, kind of give it some like really specific context that is pretty particular to what it needs to do. And a really good way to do that is through skills. So, prior to this demo, I kind of put together this skill that's for a conference, the conference program editor. It's got some stuff in here. Again, this is kind of entirely vibe coded, so the actual content doesn't matter. But what I want to point out is that GitHub is actually a really great place to store skills and kind of addition to the skills API. The reason GitHub is so great is because you can have a lot of this a lot of these uh kind of concerns of producing and editing skills are kind of taken care of for you. So you know if you want to have version control over your skills, if you want to have pull requests so that you can review changes to skills, this is all really great and a pretty common way that people are starting to create repositories of their skills is kind of in Git. And we have first class support for that in the agent SDK. So if I come back here and I comment this stuff back in and hop down, we can see that we have this uh skills capability. And I want to pause for a sec to kind of talk about what a capability is in the new SDK. Capability is kind of uh this object that is able to combine a bunch of different concepts all into one. So if you want to bring in uh tools, if you want to bring in additional instructions, maybe add some stuff to the manifest so that a capability can put things into the computer when it starts up. These are kind of all things capability can do. The sort of default set is file system. So this is the ability for the model to view images and run the apply patch tool. So kind of like edit files in line. And then we also have the shell tool which is actually a combination of of two tools. This is as I mentioned earlier exact command and write standard in. This is this async bash command flow. And then we have compaction which is allows the model to compact its context and keep going even when it exceeds the context window. And this allows the model to keep working for a super long period of time, hours, weeks, technically uh there's no limit to to how long it can work. So these this is kind of the default set. We're also adding the skills capability. And then what we're saying here is that we want to load our skills from this git repo. And so I'm pointing it at this git repo that I just created. Uh we're saying we want to pull from main. If you're not using GitHub, you can actually change the host here, but uh it's we're kind of like using GitHub by default. So I'll go ahead and save this. We'll flip back to our demo. I'm going to ask the question, tell me the most interesting thing from your skill file. Will it send? This is going to kind of spin back up, but I want to take a sec now to actually talk about how we do resume uh like pausing and resuming behavior between sessions at the agents SDK. When you start a sandbox agent, the agents SDK will handle spinning up that sandbox for you based on how you define it. So, if I hop back to the code here, we can go and say for by default, we're using Docker. And so, I'm kind of just providing these two classes here to kind of define where I want to run the agent. In this case, I'm saying I want to use the Docker sandbox client. the and I can provide some options. Different backends might have different options. Docker, I can provide an image. In this case, it's just uh the Python 312 image. And uh you can kind of easily swap this out for any of the backends that that I mentioned before. And we'll kind of show an example of that in a little bit. Um but what's cool is that when that when this task spins down, when it stops, the agent will automatically stop that container and then we'll take a snapshot of the file system and put it somewhere you define. Right now that place is just my laptop. So if I go over here uh we can see some of the snapshots. So this is kind of the a couple tasks that I recently created. And if I untar one of these and look, it's basically just the exact file system that the model was using. So when I resume that task, it is grabbing that tarball from the place that I defined, spinning up a new container if the last one doesn't exist anymore and then rehydrating that file system onto that new container so that it to the model looks exactly the same as it did when it left off and the model is actually unaware of the fact that it's running on a new container is totally oblivious to that. And this makes resuming really easily easy and seamless. And if you don't want to use your local file system or tarballs, we have sort of provider specific ways that you can do this. So if you're a modal customer and you use volumes or their snapshot concept, this stuff plugs right in super easily. But kind of by default, we're using this sort of like naive tarball approach to kind of help make this easy across any provider. So uh we can kind of go back and if I refresh should be done. And it was able to kind of like look at the skill file. If I look at the output here, we can see that it probably was kind of graping over this skill file. Uh, it did quite a bit quite a bit of stuff. So, um, but it was able to kind of give me some output here, kind of, uh, telling me what was the most interesting thing here. So, now we have an agent. It can use a skill. We want to kind of hoist this thing up into the cloud. We don't want to be running this in Docker forever. Docker's great for testing, especially on your laptop, because just kind of works out of the box. It uh is a really great place to kind of do local development, but it's a bit hard to deploy in production. There's a bunch of companies that have built really great first class agent sandbox products that you can use that we've added first class support for. In this demo, I'm going to be using modal, but you could also be using Cloudflare, ETB, Verscell, Daytona, Blackel, uh or you know any of the other you can kind of bring your own implementation if if you choose to. So I've already set up a uh modal kind of app here for us. So I'm going to go back to my uh code here and we'll go to step three and we will enable modal and here we're kind of returning this sandbox provider. Um but the kind of the key thing here is just this client and options which are the things that the the agents SDK expects and here we're just kind of providing this modal sandbox client. If I go look at the definition of this thing, uh this just inherits from base sandbox client just like the docker version and we are saying here that uh we are going to use the same image. So that same Python 312 image and then uh in addition we're going to be using R2 for snapshots. So that both our snapshots are now off my laptop and into the cloud. Additionally, the modal sandbox is going to be moving into the cloud as well. So we'll kind of show an example of what this looks like. So save all this. Go back to my here and we'll say uh edit these files for clarity. Go back, load this, and we'll assign it to our program editor and we'll say get started, please. And so now we're starting it. We can see that we're starting a modal sandbox instead. And if I hop over to modal, I can see that now we have a sandbox uh that's starting. Zoom in a little bit if that's hard to see. And this will take uh usually they start up in just a few seconds or around a second. So now it's running. What's happening behind the scenes is the agent SDK is handling sort of the applying all of the files that we have defined onto that sandbox. And so want to take a sec now to actually go and look at this manifest object that we talked about. And a manifest is is pretty simple in concept. You're basically saying this is what I want the file system to look like when the agent spins up. Here's how I kind of attach files to this task in a way that I can share across tasks or share across agents. So manifest is just kind of a simple class. Um we're saying the entries are this tree here basically kind of defining a directory structure and this this is a pretty simple example. So we're just saying we want to kind of establish this base directory structure and then we'll upload files to the sandbox directly. But you could also here plug in things like files if you want to copy things from, you know, wherever the harness is running. Or if you want to plug in an R2 bucket, S3 bucket, an Azure blob store account, all these things are fair game, a GitHub repo, kind of a whole flexible set of primitives here that allow you to put the files on the container that the model needs in order to kind of get started. And what happens is that we render a version of the file tree to the model when it starts up so that it can kind of see an example of its workspace. Uh so that it doesn't really have to do too much grepping or lsing to understand what's on the file system. And so you can kind of add descriptions to these things too, but we won't get too far into that. Um so now if I go back to the demo, we can see that this finished. Uh and what I want to show here is that now if I go to my Cloudflare account and refresh, we can see that we now have snapshots that are stored in the cloud. So if I want to resume this task and I dep have this app deployed in production, it's really easy for the the agent to just kind of like pull its context for the agent to pull its context and the snapshot from R2 instead of you know anywhere else. Uh it just provides like a fast pretty clear interface for this and so just want to give a quick demo of that. So now uh we kind of want to we have this agent it's deployed in the cloud. It's pulling its snapshots from another cloud provider and we want to go ahead and add our own tools into it because you know bringing our own context and our own functionality into agents is is pretty critical. So if I go back to my program editor agent, I'm going to comment this back in and we have some predefined tools that we built sort of before this demo. There's there's three of them. We have one for updating the status of a task. We have one for updating the assenee of a task. And then we have one for searching all of the assignees. So, uh, if you've used the agents SDK before, function tools work exactly the same way as they always have. They're essentially just Python functions or in TypeScript, TypeScript functions, and you decorate them using this function tool decorator. And what this does is it allows us to translate this function call into sort of the API representation of the function tool with all the parameters uh, clearly defined so the model knows what to call. And then when the agents SDK receives from the API an instruction to call a specific tool, it routes it automatically to this function, calls this function and then sends the response right back. So you don't have to sort of like be implementing this function calling loop yourself. Uh and so we will uh just show a quick demo of this. I'll go back here and I'll say um please assign this task to Steve and mark it as ready for review. Let's send. While this is spinning up, I'll kind of cover some of the other cool things you can do with the function tool. Um kind of a bunch of things you can do here. So if you want to kind of override the name, like what the model sees is different from what the name of the function is, you can do that. You can provide a description to kind of help steer the model to how to use this tool. You can provide guard rails around uh whether you know if the model's call trying to call the tool with specific arguments. You can add hooks around that to kind of you know control the execution flow. You can add timeouts and things like this which is which is pretty cool. Um and you can kind of like set whether it's enabled kind of dynamically maybe based on a feature flag or something else. Uh so you don't have to like comment and comment out these these lists of tools which is which is pretty cool. So go back to our demo here and we should see it uh assign the task to me which is cool. So we can kind of see like using these tools and this is a great way to sort of bring your own context or your own functionality from your app into your agent whether maybe you have want to give it access to your database or want to have it be able to you know uh use Slack through an MCP or something like this. These are kind of like great ways to bring in your own functionality into the SDK and into your agent. So, uh, now we are going to talk a little bit about like tool call approvals. So, ideally, um, there are certain tools that maybe kind of require a human of the loop step in order for you to have confidence that it's kind of doing what you ask and not going off the rails. If you have been a Codex user in the past, you're pretty familiar with this. It tends to ask you a lot, do you want me to run this bash command or am I allowed to access the internet? This sort of thing. You can do this in the agent SDK pretty easily by just providing a function. Uh you can I can either say you know true directly or I can provide a function that returns true or false dynamically based on the parameters. And so in this example we've said if you're trying to set the status to done then it needs my approval first. I'm not going to let you mark something as done without taking a quick look at it first. And so if I refresh this and go back and we'll go back to this other task we have and I'll say please mark this task as done. We'll send that and this is going to spin up a sandbox rehydrate the state. Uh it will kind of like examine its current state and then it should call the tool and then we should kind of see a tool call approval um pop over and over here. Uh while this is working another thing we can kind of show is the ability for the agents to kind of hand off to each other. So let's say I want to, you know, I kind of have this built-in mechanism here where we have the ability to search and update the assigne. And this is great for agents to kind of work together. So I have two agents now in my system. I have one for producing assets. Um, and then I have one for just like the program editor agent that we looked at earlier. And so I can go ahead and create an issue new issue here and say like uh please um refine and uh this content and produce some assets. We'll hit uh drop some files in our same set and then we'll assign this to the program editor agent and we'll say like please uh refine this content and then assign to the what's the other one called asset producer agent and we'll create that. Then we'll kick this task off. the program editor agent will be the first one to pick it up. It will kind of do its thing and then it will uh reassign this to the asset producer agent and then that agent will kind of kick off and then you know be able to create some assets and then put them back in the sandbox. And so kind of coming back to my uh previous task here it looks like this marked as done. If I go to activity here I can see that we now have this request uh for approval. And if I go ahead and approve this, then this will kind of kick the flow back off and then it will eventually kind of get marked as as done here. So, uh let's see. I think the last thing I want to show here is sort of the ability to to mount uh external buckets to your manifests in a way that uh allows you to kind of bring in external data. So, what are the reasons you might want to do this? Let's say you have either a ton of files, let's say you have like hundreds of PDFs that you need the model to be able to work through and it's kind of annoying to just, you know, have to copy all of those to a sandbox every time you want to spin it up and have it do some work. Or let's say you have data with a a strong freshness constraint, so the data is changing all the time and you want the agents to be able to work over data that is like super live. In either one of these cases, mounting data from a centralized source of truth instead of copying it in where it might already be out of date by the time the agent starts working is pretty critical. And we've built that natively into the agents SDK. So if I go to uh if I go here, what I'm going to do is kind of comment this in. And we've built this uh R2 attachment store. And so this is kind of doing a couple things. This is a taking our uploaded files and then instead of just putting them on my laptop and then copying them to the container, we're actually now uploading them straight to R2. So that's going to be our source of truth just for our application layer. And then back in the manifest, oops, uh we're saying if we are using this R2 attachment store, then we're actually going to use a different manifest here. And we have a similar layout. So we have task and then these four folders. But instead of the input being an empty directory that we then copy files into, we're instead saying I want to use this S3 mount type. And we're technically using an R2 bucket, but we're using the S3 mount. The reason is is because R2 and S3 are API compatible. So we have one object for them both. But as we look through the parameters here, we can see we're using this R2 bucket name passing passing we're passing the access key and secret key. And then we're using this uh modal cloud bucket strategy. A lot of these sandbox providers, modal included, have native ways to mount external volumes to their sandboxes. And so we're using the sort of built-in one for modal. But if you're using a Docker container or sort of any generic container, we provide a couple out of the box. You can either use our clone or fuse. And these both work equally well. And it will depend a little bit on your systems which one will work better for you, but they're both provided out of the box. So pretty easy to just plug this stuff in. And so now uh I will go and create another issue here. We'll say edit these assets again please. And we'll grab our files. Drop these in. Hit create. And we should see so this has been I didn't assign this but we'll sign in a sec. Uh what I want to show here though is now we have if we go back to our bucket we now have tasks uh and we have a a reference to our task here that has all the files in it. So now if I go back and then assign it to our program editor, program editor and say get started, hit send, it's going to be able to we'll watch this thing spin up and then it will be able to read and write to this volume just like it would if it were on the file system local to the container. So you can kind of bring in these places of context of you know file storage from kind of anywhere across the internet and the model will be able to access it as if it were on the local file system. It's completely opaque to the model that it's actually over operating over the network. So there are some trade-offs here. You might incur some additional latency as the model is kind of like lsing maybe a large bucket that takes you know a long time to list all the files through. Uh, but there's sort of this, you know, really nice trade-off that you get where you get to, you know, maintain your freshness constraint or have it work over a ton of files without having to sort of like copy all them all onto the container where you might be sort of like limited in space. Um, so cool. We can wait a sec for it to see the spin up, but that kind of brings us to the end of the demo here. Uh, you know, we kind of covered a bunch of cool things. Uh we showed how we can build agents that both inspect files, spin up sandboxes, and kind of any environment that you choose, maybe one that you're already using. Uh agents that can edit code and kind of like work on these long horizon tasks for maybe minutes or hours or longer uh over, you know, whatever period of time that you want and whatever context you want. So yeah, with that, I think we'll do some questions. >> Yeah. >> Cool. >> Okay. Um next slide. I'm really excited to welcome Nish on stage with us. So Nish is product for agents. Nish, do you want to give a quick intro? >> Hey guys. Uh yes. So product manager on agents. Uh really excited about the stuff we're doing with agents SDK. I've been answering some of the questions in the channel. Really great questions and uh yeah, excited to get into it. >> Awesome. Okay. Um next slide here. Perfect. Um, so the first question, when are harnessed based agents best versus building with just responses API? >> Cool. Yeah. So I think that if you the responses API is great for a huge variety of tasks, you know, I think the places where it really shines is and it's an agentic model by default and or agentic API by default and we kind of talked about that in the responses API build hour from October. um it really shines I think if you are using if maybe you're using a language where you can't find a really good uh sort of agentic framework that you like. So maybe you're building an elixir, maybe you're building in HASLL or something a little bit more say esoteric, you can always just build on the API. You can kind of recreate a lot of these concepts yourself. It might be a little bit more work. Um the you know sort of like harnesses are becoming uh sort of really key to how the models operate. So in particular, the codeex harness with openAI models is sort of like becoming more and more coupled as time moves on. And so you know in the future, it's very very possible that you will be able to get the best performance from the harnesses that the models are trained with. And uh so that's where kind of the harnesses are are really going to shine. Um and also if you you are using uh an environment where you can kind of drop these things in, it just is making if you're building agents, it's going to make a lot of this stuff super easy. Um, however, if you're doing, you know, kind of oneshot tasks like maybe a translation use case where you just need to like just oneshot translate a document or, you know, taking unstructured data and structuring it into JSON or some other format, sponsor's API just out of the box is going to be really good at that stuff. And you can use all the hosted tools to kind of have these like really light agentic loops that are uh pretty easy to to orchestrate in just one API call. >> Cool. >> Okay, next question. Does the agents SDK feature any built-in out ofthe-box persistent state management to handle this pause and resume behavior natively? >> Yeah, totally. So, I showcase a little bit that snapshot mechanism. That's totally out of the box. That works if you just start using it. Um, the only thing you have to define is kind of where you want to store the snapshots, whether it's in R2 as I mentioned or just kind of local to the to disk which is the default. Um, and so this is kind of that, you know, really out of the box thing where it will take one sandbox's file system and then zip it up and then put it somewhere else. Uh, so that another sandbox can resume from the point where it left off. Um, but also the run state is sort of managed. So you can think about two pieces of state here. It's kind of like the file system and then all of the messages in that conversation so far that the agent will need to continue from agents SDK manages handling kind of like both of those pieces in in one shot. And so you can take that thing that has both a pointer to where the snapshots stored and then also the full roll out and then you can store in your database. It's just a JSON object and then load it back up later and then you can kind of resume losslessly maybe if you have sort of a multi-node system or something like this that just has the database to share state. >> Nish anything to add there or we can move on to next question. Yeah. No, I think that covers it. Like essentially we store the roll out and we also snapshot the file system which is uh a little bit unique compared to other SDKs. Um so you should be able to uh resume very like long running uh dayong agents pretty easily. >> Cool. Okay. Next question. How would you compare the code executing harness in the agents SDK via the sandbox agent versus the actual codeex harness? Any capability? Uh Nish, you want to take this one or? >> Yeah. Yeah. Um so there are some sort of small uh differences between the sort of like actual codeex binary and the agents SDK. Agents SDK does come with a lot of those core capabilities. Uh some of those are like the bash tool. Uh the way codeex edits files, skills, memories. Um there are some things in the codeex binary that uh we are still going to like build into agents SDK eventually. uh the codex binary is like this sort of like multi-threaded thing that you can run uh multiple different sub aents in agents SDK we have this concept of handoffs so you can do something like very similar in agents SDK you should be able to do like a lot of the things that you can do in the codex binary and agents SDK um but yeah ultimately like agents SDK is like very much uh in distribution uh through the Codex harness today. >> Awesome. Thanks Nish. Um, next question. Can agents persist and restore longrunning workflows automatically? >> Yeah, definitely. This is kind of what I showed in in the demo. It's kind of like very easy with very little. We didn't look at a lot of my orchestration code, but there's not really that much to it. Just kind of kicks off the SDK, stores the the roll out, that JSON file in a file just on the laptop, and then is able to kind of resume uh pretty easily. So yeah, I think mostly most of the sort of engineering that you'll do is figuring out what tools to bring in, what what skills should I add to the agent and uh the sort of orchestration of starting and stopping and resuming tasks is is sort of uh just out of the box. >> Okay. Can the sandbox get auto destroyed after the task is completed or is it a better pattern to keep the sandbox running? >> Yeah, this is a good this is a great question. Uh so I I actually only demoed one mode of the ages SDK. But there's two ways that you can use sandboxes. You can do the sort of what we call um you know SDKowned sandbox where when you spin the task up if you just provide it the recipe for creating the sandbox. So you provide it one of the clients like a docker client or an ETB client it will handle spinning up that sandbox for you and then when the when the turn is done it will spin that sandbox down. But you can also eagerly load a sandbox. So you can kind of create a sandbox out of band and then pass the reference of the sandbox to the agent and then that's what we call like user own sandbox and so you would handle managing the lifetime of that thing. It the SDK won't automatically spin it down at the end of the turn. So you can kind of keep it warm and then use it across like multiple turns instead of spinning it up and spinning it down. Um, but the out of- thebox behavior is that we'll handle spinning it up and and and down so that you don't end up in a situation where you've built your demo and then all of a sudden you have 100 sandboxes running in the cloud that you're paying for and it's kind of providing a bad outcome. So there's a bunch of different ways that you can use this. Um, but yeah, by default we the HSDK will will handle the lifetime. >> Hey, how should I populate and persist my file system? Yeah. >> Yeah. Happy to take this one. Um, so essentially we have this concept called the manifest uh in agents SDK. Uh, the thing that I really like about the manifest concept is you can kind of choose how you want to populate and persist uh the file system into the sandbox. Uh, so you can essentially copy over all of your files from cloud storage into the sandbox or you can mount uh, you know, all of those files uh, onto the sandbox. And there's sort of like different benefits of of either if you copy over everything like during run the runtime the the model will be able to read the files a lot more quickly but you know startup will be slower because you are copying like sort of this big blob uh into the sandbox and there's sort of the second world where you can optionally mount something on the sandbox and uh startup will be like uh very very fast but like over the runtime it has to like read these files over the network. Um, but the nice thing about agents SDK is we kind of like let you do multiple different ways to kind kind of like mount or copy over things in the sandbox and then persist that uh after the sandbox spins down. >> Awesome. Thanks, Nish. >> Okay, last question here. Can a supervisor agent monitor and coordinate hundreds of specialized agents simultaneously? >> I think the answer is yes. It would definitely it would take a little bit of work to kind of spin this up. And we have some uh cool stuff around sort of like multi- aent frameworks that's kind of coming in the near like weeks and months. Um so it will get a lot easier to do this in the future, but uh there's nothing kind of stopping you from doing this now. You can kind of, you know, go crazy with agents and spin up a ton and then have one that can kind of check on the status between them. I think like ways we've seen good coordination patterns is either by through messages. You can give the parent agent a tool that allows it to understand what sub aents are running and then check in on them. So kind of like check in on their progress or you can sort of like have them all communicate via the same file system or like a database or something like this with kind of a variety of communication mechanisms you can use to coordinate them. But yeah, nothing stopping you from from doing this. And um you know, I think that this sort of massively parallel work will become sort of the default in the near future. >> Love that. Okay, so let's wrap up with some resources. So on the screen here are everything you need to know to kind of get started with agents SDK. Um we had a ton of questions also in the chat about your demo and where they can where they can find this. So on the very bottom is our build hour repo. Um we have all the code from the last build hours and we'll be we'll be posting that um very shortly right after right after we're off camera. Um then for any upcoming build hours, feel free to check back on our homepage. Um so if you hit the next slide, we show you just the link again. Um and after this session, we'll email out a survey. Let us know what topics you want to hear about next. And we'll also include the link for this recording so you can catch up if you missed the beginning. Um you can catch get up to speed um there. And that's all for now. Thanks so much for joining us and we'll see you next time.