
Tech • IA • Crypto
Anthropic souligne que l’augmentation du « compute au moment de l’inférence » via des contrôles d’effort et un raisonnement adaptatif améliore fortement les performances de l’IA sur des tâches complexes, tout en introduisant de nouveaux compromis de coût et de latence.
Les progrès des grands modèles de langage reposent de plus en plus non seulement sur la taille d’entraînement, mais aussi sur le compute à l’inférence, où les modèles passent plus de temps à raisonner lors de l’exécution. Un compute plus élevé à ce stade améliore les performances dans des domaines comme le génie logiciel, le raisonnement académique et la recherche agentique. Les résultats montrent une corrélation directe entre temps d’inférence accru et meilleurs résultats.
Des expériences avec Claude Opus 4.7 montrent qu’augmenter le niveau d’effort améliore mesurablement les résultats. Un mode faible effort a produit une simulation de trafic basique en environ 50 secondes avec 4 600 tokens, tandis qu’un effort plus élevé a doublé temps et tokens, mais avec un comportement plus réaliste. À effort maximal, le système a utilisé environ 10× plus de compute, générant des graphismes nettement améliorés et des dynamiques de conduite complexes.
Le système répartit le compute entre trois catégories: tokens de réflexion pour le raisonnement interne, tokens d’appel d’outils pour interagir avec des systèmes externes, et tokens de texte pour communiquer avec l’utilisateur. Ensemble, ils déterminent comment le modèle planifie, agit et répond. Bien gérer ces tokens est essentiel pour optimiser performance et expérience utilisateur.
Les nouvelles capacités de réflexion adaptative permettent aux modèles de décider dynamiquement quand et combien raisonner, plutôt que de suivre des séquences fixes. Cela remplace les approches où le raisonnement n’avait lieu qu’au début ou entre appels d’outils. Le résultat est un comportement plus flexible, mêlant en temps réel raisonnement, usage d’outils et communication selon la complexité.
Les utilisateurs peuvent influencer les résultats via un réglage d’effort, qui ajuste le compute alloué. Un effort élevé améliore généralement les résultats mais augmente latence et coût, tandis qu’un effort faible privilégie rapidité et efficacité. Une fonction complémentaire, les budgets de tâche, fixe des limites strictes sur les tokens, le temps ou le coût.
Bien que l’effort accru améliore les performances, les gains ne sont pas toujours linéaires. Les évaluations montrent des rendements décroissants à haut niveau, où plus de compute n’apporte que des améliorations marginales. Des réglages moyens à élevés sont souvent plus rentables, surtout en code et workflows agentiques.
Un effort faible ne signifie pas toujours un raisonnement inférieur. Dans une évaluation sur Pokémon Red, le modèle a adopté des stratégies de speedrun — éviter des combats, optimiser les objets, réduire les interruptions — pour atteindre ses objectifs plus vite. Les contraintes d’efficacité peuvent donc stimuler des solutions créatives.
De grands modèles à faible effort peuvent surpasser de petits modèles à effort élevé sur des tâches complexes, offrant un meilleur équilibre entre vitesse et intelligence. À l’inverse, les petits modèles restent avantageux pour des tâches volumineuses et peu complexes comme la classification ou le résumé, ainsi que pour des réponses rapides.
Il est conseillé de mener des évaluations structurées, en traçant la performance par rapport au coût, au temps ou aux tokens afin d’identifier les configurations optimales. Examiner les sorties en détail est aussi crucial, car les modèles peuvent adopter des raccourcis inattendus à faible effort.
Les systèmes futurs devraient étendre davantage l’inférence, potentiellement sur des problèmes durant des jours, des semaines ou plus. Cela positionne l’IA comme un agent de résolution persistant, capable de traiter des défis complexes sur le long terme avec peu de supervision.
L’augmentation du compute à l’inférence, via le raisonnement adaptatif et les contrôles d’effort, redéfinit l’équilibre entre intelligence, coût et vitesse, et oriente l’IA vers des modèles plus autonomes et efficaces.
falls down. >> All right. Hello everyone and welcome. Uh my name is Matt Bleifer. I am a product manager on anthropics research team and today I will be sharing a little bit how uh Claude leverages compute at inference time otherwise known as test time compute in order to break down and solve some of your hardest software engineering challenges. Along the way I'll share a little bit about what levers you have at your disposal uh in order to influence how Claude spends tokens. and I will also share some best practices to help you be able to get the most out of it. So, one of the key developments in large language models over the last couple of years has been the scaling of test time compute, creating something that we've all come to know as reasoning models. Similar to how we can scale compute at training time by training bigger models over longer time horizons using more data, we can also scale compute at test time by allowing those models to spend more time working on a problem. So if you look at this graph on the left, you can see that when we move from haiku to sonnet to opus, as the model gets more intelligent, it's able to get a better score on our agentic coding evaluation. And then similarly in the graph on the right, as that same model, opus actually just spends more time working on a problem, it's able to correspondingly get better and better scores. This is what we mean by scaling test time compute. Now, this isn't true just of software engineering. It's really true of a whole variety of knowledge work domains. Whether it's agentic search, computer use, or PhD level academic reasoning. If we can allow models to spend more time working on a problem, they can achieve better and better results. So looking at a bunch of charts and graphs is always great for understanding the data and the different correlations. Uh but nothing really beats seeing a tangible example of what this looks like in practice. And so what I went ahead and did is I ran Opus 4.7 on a few different effort levels, scaling the amount of time that it works on a given prompt, where in this case I asked it to create a realistic simulation of cars going down a one-way street at a traffic light. So the first result we have here is Opus 4.7 running on low effort. You can see in the simulation, it took about 50 seconds to produce a result for us and worked for about 4600 output tokens. And I'd say it accomplished something that was fairly reasonable. We do in fact have cars going down a one-way street. Uh they are stopping at the traffic light, but overall it's a pretty basic simulation. The traffic flow is fairly basic. Uh the graphics are limited. And for some reason, Claude thought it would be a great idea to put the traffic light right in the middle of the road, which maybe wasn't the best design decision, but we will still call it functionally passing. So, the next thing I did here is I went ahead and I cranked that effort dial up a bit. So, when I moved effort up to high for Opus 4.7, it took about twice the time working on our traffic simulation and double the output tokens. But as you can see, it was able to achieve a better result. It has cars of different types. It smartly moved the traffic light uh over to the side of the road. And Opus told me that it even implemented what it called an intelligent driver model where every car would more uniquely respond to the dynamics of the car around it, doing a better job simulating a realistic traffic pattern. So again, twice the amount of time, better results. Now, the last thing that I did here is I cranked that effort dial all the way up to max. And in this setting, Opus 4.7 took 10x the amount of time that it did when executing this same prompt on low effort and 10x amount the amount of tokens. But as you can see, it was able to achieve the best results yet. We have the best graphics, my favorite traffic light of all of them, uh, and really realistic driving patterns. This is all an example of how when you allow Claude, even on the same model, to just spend more time working on a problem, it can get better results. As we continue to scale test time compute further and further, Claude isn't just going to work for seconds or minutes or hours on a problem, it's going to work for days, weeks, months, even years spending tokens to try to solve some of humanity's toughest challenges. So when I talk about test time compute uh I really mean any form of claude spending tokens at inference time in order to solve your problem. However, we can break these token types down uh into three kind of distinct buckets. The first bucket that we have is thinking tokens. This is the classic form of tokens that underline uh underlines what we know as reasoning models. Thinking tokens represent Claude's internal monologue. It's Claude's space to reason step by step to consider different potential options, do some chain of thought reasoning, create a scratch pad where it needs to work through a problem, and ultimately spend time thinking through what it needs to do in order to take the best actions and deliver the best results. The second form of tokens that Claude can spend when taking on a task is tool calling tokens. Tool calling is Claude's way of interfacing with the rest of the world. Whether it's using tool calls to execute a search in this example, giving me more information about the code with Claude conference uh or reading and writing files in order to build out software engineering projects. There's really millions of different tools that Claude can call, but in all of these scenarios, tool calling tokens are Claude's way of interfacing with its environment. The last type of tokens that Claude can spend is text. And this is Claude's way of interacting with you. whether it needs to give you updates as it's working on a really really tough problem and let you know how it's progressing, give you a summary at the end to explain all of the things that it did in response to the tough task that you gave it. Uh or simply just responding to a simple question that you have. Text tokens are Claude's way of communicating with an end user. So again, we have three different types of tokens, thinking, tool calling, and text. And all three of these we think of as really fundamental to the way in which Claude works and the way in which Claude responds to problems. But all of these tokens that we're spending have really direct costs to users in the form of both practical token costs that we pay for uh as well as waiting time. When Claude spends more tokens, it means that we as users have to wait longer for our result. And so we think it's really important that we give users the ability to influence or constrain how Claude spends tokens using Claude. Users can express their preferences and constraints in a couple ways. The first way is with that effort dial that I talked about. Effort is a way for you to tell Claude how you want it to trade off time, cost, and quality when responding to your task. Should Claude spend more time in order to get a better answer? Should it spend less time in order to get a faster answer? These are all preferences that you can give to Claude as a user that it will take into account when it goes and it spends these tokens. Another form of constraints that users can provide is in the form of budgets. Recently, we launched a feature that we call task budgets, which allows you to tell Claude an upper bound of the amount of tokens that it will spend when working on a task. So, you might say, "Hey, I want you to build out this particular software engineering feature for me, but I don't want you to spend more than 100,000 tokens before you stop and check in with me as a user." Budgets could come in the form of tokens, but they could also come in the form of time uh or cost. And I think this will get increasingly important as we continue to move up that exponential and claude is working for days, weeks, months, or more to be able to give some guidelines about how long it should work on a particular problem before it stops to check in. Now given all of these preferences and constraints, it's really up to Claude to figure out the best way to spend those tokens in order to maximize outcome. So given the user's effort settings, given a potential budget setting, how does Claude allocate tokens across thinking tool calling and text in order to maximize its performance and its user experience? When reasoning models were first introduced, they followed a really specific pattern in terms of how to spend these tokens. The first thing they would do is they would think and they would spend those thinking tokens to work through a problem. And then after that they would move on to tool calling and then lastly they would move to text. We improved on this when we introduced interleaf thinking which allowed Claude to actually use thinking and reasoning in between tool calls. In this mode, Claude could call a tool, get a result, think about that result, then determine what it wants to call next, so on and so forth all the way until it decides to give a final answer. Recently, we launched adaptive thinking, and adaptive thinking is the next evolution on top of interled thinking. In this new paradigm, Claude is free to think whenever appropriate. There's really no constraint on when Claude needs to think, how much it needs to think, or in what order it spends any of these tokens. It can leverage thinking, tool use, and text in whatever order is needed in order to best meet the requirements of your task. Claude could choose to start with a text response in order to acknowledge the user request. It could stop to call a tool. It could then think about that tool, respond to the user to give an update, continue to call tools, so on and so forth all the way until it provides that final answer of the work that it did. Claude could also choose not to think at all for simple queries that don't require it. Now, in practice, Claude will typically think more often and longer in response to higher effort levels, but everything is really prompt dependent. You can imagine that if I were to survey someone in the in the crowd here and say what is 2 plus 2 and I ask them to spend a little bit amount of time on the problem versus a lot of time on the problem. You're roughly going to spend the equivalent amount of time working on that problem. However, the story would change a lot if I asked you to conduct a really sophisticated research task. The difference between your thinking on low effort and high effort there could be quite dramatic. Now, adaptive thinking is not a model router and it's not an automated thinking toggle. So, it's not taking your query, classifying it based off of difficulty, and figuring out whether it should use a thinking version of the model or a non-thinking version of the model. Rather, it's the difference between telling Claude, "You must spend at least one thinking token at the start of this response and telling Claude," you can spend thinking tokens whenever and however needed in order to solve this problem. It's really about Claude having the option to think at every single step of the process. We run all of our benchmarks on adaptive thinking uh since Opus 4.6, And it's really our intelligence maximizing setting uh that shows performance priority or better with interle thinking while delivering a better user experience. So I want to dig a little bit more into effort and contrast it to the ways in which we've used thinking in the past. Historically users have used thinking toggles a lot like an effort dial. So if you wanted Claude to spend more time working on a problem, you might turn on thinking inside of claude.ai or Claude code, and you would expect that it would spend more time working in order to give you a better result. That's a pretty reasonable instinct. However, thinking toggles are kind of a poor proxy for an effort dial. Rather than expressing how hard you want Claude to work, what you're really doing is you are turning on and off a really core capability of the model, you're constraining how it's allowed to work, not how hard you want it to work. An effort dial is a much better expression of the idea of spend more tokens in order to get a better answer. It moves thinking, tool use, and output text all together instead of just toggling one of them. As an analogy with tool use, we don't tell Claude to always search or never search. We tell Claude to figure out when it should search based off of the problem at hand. And that's really what allows Claude to be agentic in response to your query. In a similar way, when we work with all of our teammates, uh we don't ask them to turn their inner monologue on and off in response to a question. uh we ask them how hard to try and then our teammates decide how hard they are going to think on that problem and what actions that they will take in response. So I want to dig in a little bit more on effort and give you some practical guidance of how you should think about setting these effort levels depending on your use case at hand. First, whenever possible, it's always best to run evals and then chart performance where you compare on your x-axis here something like total tokens time or cost and on your yaxis performance. This allows you to create an effort curve like this and get a better idea of what trade-offs you might make by selecting a given effort level. Higher effort is going to improve performance on most intelligencebound tasks, but it also may show diminishing returns. For your use case, you might look at a graph like this and say, I will spend whatever tokens I need in order to get the best intelligence. Or you might say, the relative improvement in performance between extra high and max is not worth the difference in tokens that I will spend, and so extra high is a better setting for my use case. Low effort can instead help accomplish a task much quicker and save you a lot of tokens, but it's also going to limit how thorough Claude is in accomplishing the task at hand. As a quick tip, my advice is that when using low effort, Claude is really trying to save tokens as much as possible. And so sometimes you may catch it taking unexpected shortcuts that you didn't expect it to. And so in addition to looking at evals, we always think it's a best practice to spend time reading your transcripts and better understand exactly how Claude is responding at a given effort level for the thing that you're asking it to do. On the flip side, uh low efforts have also surprised us in some really interesting ways. So, one of my favorite evaluations that we've created is called Claude Plays Pokemon, where uh Claude gets the opportunity to work its way through the original Pokemon Red game that many of us uh grew up knowing and loving. When we ran Claude Plays Pokemon on low effort, something really interesting happened and it actually ended up treating the game much like a speedrun. So, it would skip trainer battles in order to save itself some time. It would use healing items that it stocked up on instead of wasting time going back to Pokemon healing centers. And it would spam an item called a repel that would limit disruptive encounters with other Pokemon, making it through caves much more quickly. And what I find most interesting about this is often times we might correlate low effort with lower intelligence. But for any of us that grew up playing this game, what you really realize is this is a super clever strategy. uh it takes a certain amount of intelligence to figure out how you might minimize token spend in order to get through these levels as fast as possible. And so it was interesting to see how Claude's interpretation of loweffort translated to beat the game as fast as possible, employing actually very clever strategies along the way. So evalion them any time that you have them. Um, but I wanted to just give you some quick rules of thumb on how you might select an effort setting uh in the absence of eval or even if you have them, this is just a good way to think about things. First, uh, max effort, no surprise, can deliver gains on your hardest tasks, but as I mentioned before, it can sometimes show signs of diminishing returns. I recommend testing it for your most intelligence demanding use cases, but don't always assume that this is going to be either the ceiling on performance uh or really the best bang for your buck. It could be the case that a level down is going to give you roughly equivalent performance at a real fraction of the cost. Extra high effort is a new setting that we introduced with Claude Opus 4.7 and we found this to be the best setting for most coding and agentic use cases. This is currently our default in cloud code and cloud.ai for opus 4.7. And like I said, it really does a good job maximizing intelligence without kind of going overboard. High effort is a great setting if you're trying to balance token usage and intelligence. And this is probably the value that I would recommend for any intelligence sensitive use case. High is a good place to start and test up from there. Medium is good for costsensitive use cases where you're willing to trade a little bit of intelligence. uh in order to get a much faster result. And then low is good for reserving for kind of your short scope tasks and latency sensitive workloads. Uh however, as I mentioned before, it's always good to just put it in practice and see what actually happens because it might surprise you. I mentioned at the start of the talk that test time compute is a second way of scaling intelligence as compared to training time compute. So it kind of begs the question if both give similar trade-offs with respect to performance, speed and cost. Uh when should I use a smaller model or when should I use a lower effort level on a bigger model? As some quick guidelines I would say first low effort on a bigger model is good for an intelligence demanding use case where you're trying to optimize for speed. Going back to our example here of our traffic light simulation, you can see that Opus 4.7 on low effort spent about the same amount of output tokens and only took a little bit longer than Haiku 4.5 on max effort, but I would say it achieved a much better result. So often the low effort on the larger, more intelligence model can give you a better bang for your buck when trading off speed versus intelligence on uh an intelligence demanding use case. On the flip side, smaller models uh can be really good if you're trying to optimize cost and your use case is not too intelligence demanding. So if you have some simpler LLM tasks, especially if you need to do them in bulk, something like classification, information extraction, basic summarization, that's where small models are going to come in handy, and they're going to be able to save you a lot of cost when you don't need peak intelligence. Another case where small models are really useful is if your application demands a really low time to first token. So if you want claude to give responses as fast as possible in response to a user query, the nature of the smaller models means that they will produce tokens oftent times much sooner sooner and give you a better time to first token. The way that I think about this is use small models for a fast time to first token. Use bigger models at lower effort for a fast time to last token. Wherever possible, as I said before, I recommend evaluating both. It's good to build these eval curves across a few different model types and across various effort levels and then look at what the trade-offs give you for your use case that you're trying to optimize. All right, so before closing the talk, I wanted to just summarize three key actionable items that I hope that you take away from this one. enable thinking whenever possible in order to give Claude that space to reason. Thinking, like I said, is really core to how Claude works and gives it that space and that inner monologue to be able to work through your problem as efficiently as possible. If you want to modulate the amount of time that Claude is spending thinking on your problem, I recommend using effort levels or budgets as your way of influencing Claude's behavior. Second, I might sound like a broken record here, but if you have evals, use them. Use that to find your ideal balance. Chart your curves. Test on different effort settings. Test with different budgets. Test with different models. Look at what the performance gives you and decide what makes sense for your use case without forgetting to always dig in and read those transcripts. And lastly, if you're not going to do any of that and you just need to make a choice and you're working on anything coding and software engineering related, my advice would be go with extra high. It's a pretty good setting and gives a great bang for your buck while delivering great intelligence. Our northstar for Claude overall is that it allocates compute incredibly well when asked for it and that you can set a quality bar and a budget and Claude will just go ahead and figure out the rest and give you the best performance for your use case. Adaptive thinking and effort levels and budgets are all a step in this direction. Uh but they're really just the beginning. And there's a lot more to come. I'm excited to share more with you in the future. So stay tuned and thanks so much for taking the time. If you want to chat more about this, uh I'll be around the conference in the audience. I'm always happy to uh nerd out about these things. So thank you.