ENFR
8news

Tech • IA • Crypto

Aujourd'huiToutes les vidéosRécaps vidéoTous les topicsTop articles 24hArchives

Donner aux agents de programmation leurs propres ordinateurs : comment Cursor a construit des agents cloud

AnthropicClaude8 mai 202614:25
0:00 / 0:00

INTRO

Le développement de l’IA passe de l’amélioration de l’intelligence des modèles à la construction de systèmes permettant à des agents autonomes d’opérer, d’apprendre et de se déployer efficacement à grande échelle.

POINTS CLÉS

Le goulot d’étranglement passe des modèles à l’infrastructure

Les progrès des grands modèles de langage ont réduit l’intelligence comme contrainte principale. La limite réside désormais dans la capacité des humains à fournir des outils, du contexte et des environnements structurés pour une action autonome. Cela recentre l’ingénierie sur l’activation des systèmes plutôt que l’exécution directe.

Évolution en trois phases des systèmes d’agents

Le développement a suivi trois étapes: d’abord doter les agents d’outils et de contexte; ensuite adapter les workflows à des modèles plus capables; enfin construire des systèmes auto-améliorants réduisant l’intervention humaine. Cette dernière phase vise des systèmes capables de résoudre des workflows complets.

Onboarding des agents comme des développeurs humains

Pour améliorer les performances, les agents sont intégrés comme des ingénieurs: accès aux environnements, à la documentation et au contexte opérationnel. Un onboarding cloud leur permet d’explorer le code, configurer l’environnement et exécuter des applications avant toute modification.

Les outils augmentent fortement la productivité des agents

Des outils sur mesure (interfaces en ligne de commande, etc.) permettent de gérer des services, surveiller l’état des systèmes et interagir avec des plateformes externes. Cela réduit les temps morts et améliore la fiabilité, créant une boucle où de meilleures performances encouragent l’usage.

Principes de l’autonomie des agents

L’autonomie repose sur la visibilité (“yeux”), des outils actionnables et des entrées de qualité. Les agents doivent observer les états, accéder aux interfaces et interpréter les changements en temps réel. La qualité des sorties dépend étroitement de celle des entrées et du code.

Essor des modèles capables d’utiliser un ordinateur

Une nouvelle classe de modèles opère des interfaces graphiques via souris et clavier. Contrairement au code, cela exige un raisonnement adaptatif proche du jeu vidéo: information partielle, actions irréversibles, environnements dynamiques.

Parallélisation et passage à l’échelle du travail

Les développeurs délèguent de plus en plus tâches et projets à des agents concurrents. Les tâches leur sont assignées directement, avec des résultats accompagnés de démonstrations, accélérant l’itération et réduisant la revue de code brut.

Les environnements cloud améliorent sécurité et usage

Exécuter des agents dans des environnements isolés réduit les risques liés aux données sensibles et aux accès système. Cela simplifie la gestion des ressources et permet de se concentrer sur l’orchestration, renforçant productivité et confiance.

La gestion des échecs devient centrale

Déboguer les échecs et corriger les causes racines est essentiel. Les problèmes non résolus se répliquent à grande échelle et nuisent à l’adoption. Les corriger apporte des gains cumulatifs et améliore la fiabilité.

Écosystèmes d’agents auto-améliorants

Des systèmes où les agents signalent, catégorisent et contribuent aux corrections émergent. Les problèmes sont classés (lacunes techniques, permissions, connaissances) et résolus conjointement par humains et agents, avec moins d’intervention humaine à terme.

Validation itérative et tests de fiabilité

Plutôt que des tentatives uniques, les agents lancent des exécutions parallèles pour tester la robustesse. Cela augmente la confiance avant revue humaine, surtout pour les cas complexes ou intermittents.

L’“expérience agent” devient prioritaire

Comme l’expérience développeur, l’optimisation de l’expérience agent devient clé. Les systèmes suivent les points de friction et affinent en continu les workflows, créant une amélioration continue.

CONCLUSION

À mesure que l’IA mûrit, l’accent passe de modèles plus intelligents à des écosystèmes autonomes et auto-améliorants capables de faire évoluer le travail de façon fiable avec une supervision humaine minimale.

Transcription complète

Hello. Um, okay. So, uh, models are getting really good and for more and more work, the bottleneck is no longer the model intelligence. The bottleneck is humans giving the models the tools and the context and the increasingly ambitious uh tasks and objectives to go flex their potential. So um at cursor we've been working on removing that bottleneck and I'm going to share a bit about how we've been doing that. Um and hopefully it's helpful to some of you in in your work. Um, with models this good, we kind of see the job that we have as setting our agents free um safely uh uh and and having them go off and work on bigger bigger tasks. So, we've kind of gone through three stages of uh the process. The first was um giving our agents the tools and the context to be more autonomous. The second is um we had to learn how to take advantage of these more capable models. And there's actually a lot of work and figuring out how to update your your sort of patterns uh and behaviors. And as you do that, as you're leveraging more more agents and uh getting more work done, you start to see the places where uh there's stress and strains and where the agents fall over and you work to to uh improve that. And that brings you to the third stage, which is building the system that builds the system. So instead of spending your day uh handholding agents from task A to D, you take that time to uh build the system that can solve for A to Z. So we started thinking about how to make our agents more capable, how to make them work more like we do. Uh we started with uh looking at what we do with humans to onboard people to to uh cursor. So when a developer joins cursor, they get a computer, they are we help them set up their dev environment. We have a ton of documentation, maybe too much about how to do all the different things that uh developers need to do in their daily work. Um so we give them all this all this context and all this material and and uh crucially their computer. If you imagine if we thought think about what the that we were doing with the models, we were kind of just like throwing the onboarding experience was you throw them into a codebase and from the model's perspective, you have thousands of lines of code whizzing by. You don't know like what you're doing. You just have some task you have to go solve. Um, and it's kind of incredible that this works as well as it does. Um, I know if like I had to work this way and I was just site readading code and I couldn't test the application, um, I would also probably frustrate and annoy, uh, the humans who did have to do the testing for me. So, um, we decided to build an onboarding agent and we built a cloud onboarding agent. Anybody can go to, um, uh, cursor.com/onboard. You can set up your repo and the cloud onboarding agent gets to work and it starts exploring your codebase not to make a change but to figure out how to run it. And this is a little bit of how it goes. So it's exploring, it's looking around. Um, and depending on how fast this one's playing, you'll see it can start uh using the app and it will come back with a demo. And it's not just looking to figure out like what to run, what the services are. Um but there are sort of these complex things it needs to do like figure out their different environment variables, what permissions does it need. Um so it's this interactive process with with the uh developer to make sure the services are are running properly. So once we have the agents onboarded, we realize there's still quite a bit of work to do because you have all these developers uh running many cloud agents every day. So any problems are multiplied out across all those those runs. And so every time the agents run, they have to start their dev environment from scratch. When you're doing local dev, you probably leave things running. Um, not true for the cloud cloud agents. So we wanted to optimize their devx. They were spending a lot of time sleeping uh waiting for things to start up. They didn't uh have good ways of waiting for things. So we built an anev CLI tool. Um, so they were able to start services, wait for the services to start, check on statuses. They had sort of a Swiss Army knife for things like uh creating test accounts and signing in for services. thirdparty services, things like that. So what we saw was uh as we made the dev environments better, more developers started running more cloud agents and getting more out of it. So there's sort of a positive feedback loop that starts to take place. Um and then of course we also had to give them uh all the sort of documentation that we give our our human developers. So we made sort of simplified versions of this so they could get uh when they find themselves in sort of these edge cases or hard problems, they have the materials there to to lean on. So uh in doing this process we sort of derived these kind of basic principles of autonomy. Um the first is you have to give your agents eyes. So everything that you can see the agents need to be able to see too. Uh basic version of this is if you know the agent's running an app you should be able to see that app. If you go and make changes and you use that app yourself and change something in the state the agent needs to be able to see what you changed. Another sort of interesting one is uh that we found is the the agent chats where you wind up needing to debug what happened in a chat and you want the agent to be able to see the same uh the other agents chat just like you do. Um the next thing is giving the agents uh tools so that they can do all the things that you can do with sort of reasonable security constraints. Um this is like you know basic basic one really important one is they should be able to run the applications that you can run and they should be able to use the services that you should you can run. Um agents are auto reggressive. So it is also very important that you have a sort of high quality code base and and instructions and everything else so that uh you get high quality and high quality out. Um the foundational primitive we believe for human autonomy is computer use. Um so the first domain where agents have really excelled has been coding and we think computer use is is the next really important domain. Um and this is raw pixels in mouse and keyboard out. Um, and the claude family of models are really good at this. And this is this is not the hard part of this is not clicking in the right place. The hard part is uh sort of better explained as like if coding is like chess where you can see all the pieces out on the board. Navigating these gooies is more like a video game where you can only see a little slice at a time. There are oneway doors. There are game over states that you can get into. Um, and so you need this sort of higher level metacognition and backtracking and this sort of general general intelligence uh, uh, skills that the cloud family is really good at. So, uh, Cloud's Cloud47 is our computer use model. Here's an example. It was told to go off and build this private marketplace feature. Um, so it goes off and records this demo where it's implemented all the little pieces. We have like uh this URL input and there's CSVs it needs to add. Um, and what's great is that it's not just using that to do its own end testing to make sure that the changes that it made works. You as the developer get to uh have a really high bandwidth method of reviewing the work of the agent before you get into code. And this becomes really really valuable when you are running many of these agents um simultaneously in the cloud and bouncing between them. So that's the next thing is once you have these autonomous agents onboarded, you need to learn to give them more work and give them bigger bigger challenges to go tackle. Um two sort of basic modes. The first is uh you have like a lot of tasks and a lot of bugs and issues and you kind of have to learn when to just uh like stop putting those in your notes or start stop putting them in your two tracker and just kick off prompts. Put them right into a prompt box and kick them off. And when you do, uh, you then have a bunch more of these agents running. And again, it's great to be able to have those demos that come back instead of having to review lots and lots of code. Um, and the sort of second pattern is much bigger projects, um, where you can give a cloud agent, an autonomous agent, much larger units of work and have it work for much longer. Um, one of the things that we were surprised by, we kind of knew that if we made the agents more powerful, developers would do more exciting things. Um and we were sort of surprised by the extent to which this was true. And um one of the sort of really important things was the security that the cloud provided for for developers. Um there's sort of this idea of like security through freedom which sounds like Orwellian propaganda but uh in the same way you're sort of setting your agents free to go take things on in the cloud um on their own. You also sort of set yourself free as a developer where you don't have to worry as much about uh you know resource management and context switching. um you don't have to worry about the agents going and doing things that they really should not with your environment variables. Um and there's there's a way in which we were all surprised by how much more enjoyable this made programming. Um really really important skill that we we are still working on learning. um when the agents fail, it's really worth it to take time out to figure out what went wrong, debug it and implement a fix that's again going to be sort of multiplied out across your whole team and your company. And it's the same sort of dynamic where so failure modes if they persist make the agents sort of continually uh uh compounding failure across the entire company. Um and people don't want to use the the agents as much and they don't lean on them as much. But the inverse is also true. When you invest in them uh and you you make them work better, everybody wants to use them more. People are building trust with the agents. They're finding that they can sort of oneshot bigger and bigger tasks and then they want to invest more in it. So this is a really important uh change in how you think about your work where you're kind of programming programming the system. And so you've onboarded your agents. We gave them the tools to make them really powerful. Uh we started learning how to leverage them and scale them up. and we were seeing where they they fell over and we started working on this sort of loop of how you teach them to work better and we realized that it does start to feel like coding again. Um so of course we thought well the agents should do this. Um we've got a few ways that we're trying to have the agents do this. This is uh sort of the most important one and the one I'm going to talk about today um is agents iteratively improving their own workflows. And so we call this the agent experience. There's the developer experience. So you have the agent experience and you need to care just as much if not more about the agent experience. Um the way that our system works is agents uh go about their business and they are told to report issues as they come up. Um it's really just like what we do with uh the pattern I was just describing for humans where if you see something wrong, say something, uh report it and and try to work on a fixed. All those reports are accumulated in a system of record. It's really again very uh similar to how human human systems work. um managers will go in and review all the issues and categorize them and ddup them um and bucket them into there issues that can be sort of there are technical problems that can be addressed. Then there are issues of um permission where the agents just don't have access to something. Uh so they need to get permission. There are issues of ignorance where the agents need uh they don't don't know what to do and they need the humans to come in and tell them what the right right way to solve the problem is. Um so then in the last step of course these are then fixed by both agents and humans and the goal is to over time have a less and less human involvement. So we can have those issues that the agents solve are solved end to end. We don't really need to review. We can have really high trust in their work. Um and they have all the context that they need and they don't need uh anything else from us. So I'm going to go into a couple examples. So this is our most important skill. It's the WTF skill which of course stands for work on the factory. And you can see it's like uh every every cloud agent gets this skill. Um and it's almost the exact same prompt that I was just showing in the previous slide. Work on the factory is the idea that when something is annoying, broken or confusing, you take a moment to report it so we can improve the tools and workflows rather than just grinding through. um models a while ago would not have been doing a very good job of this but today uh they get annoyed and they find things broken and they get confused and they like realize that and they report it. Um so we get this this uh system of record where they go and report all those issues. Um, of course, you then have to solve the issues and this is its own uh really important challenge if you want this whole system to be sort of uh self-improving. And so we were finding with these these agent development experience problems, having the agent just go solve them and and try to oneshot was not that effective because there's a lot of flake in those problems. So something only happens once in a while. And so we taught it to actually have the cloud agent that goes and solves it kick off a bunch more cloud agents with the change in the development experience and make sure that across that sort of eval set that there's a robust solution and so then when the problem when the solution uh the PR comes back to the humans to review there's a high degree of trust that actually this uh is a well validated way validated well validated solution. Um, this is slightly premature. Um, um, yeah. So, I think those are both really good examples of where there's of how these sort of skills can work as um, uh, sort of background a garbage collection processes or cleanup processes in the operating system. Um, and we think that these while these are sort of con clearly applied for the the coding domain, in the same way we sort of speedrun a lot of AI development in coding, these patterns are going to generalize uh well outside of coding into other domains. And even if you think about it already, cloud development experience, that's not really a coding problem. There are a lot of variables in that and very few of them really are are just sort of coding issues. Um, okay. So now now the thank you slide. Um, if if you'd like to talk more about uh self-improving identic systems, um, please talk to me. You can find me on on X at my name. Um, and if you'd like to set your agents free, you can go to cursor.com/onboard and have Claude onboard to your codebase and maybe ship your next ambitious feature. So, thank you very much. Hey. Hey. Hey.

Sur le même sujet : Anthropic