
Tech • IA • Crypto
A surge of AI developments led by xAI, DeepSeek, and Alibaba signals an intensifying global race in autonomous coding systems and research agents.
Elon Musk’s xAI has completed training of Grok V9, a 1.5 trillion parameter model—three times larger than its predecessor—with a public release expected within weeks. The model represents a major escalation in scale aimed at closing the gap with leading systems in coding and reasoning. Despite the size increase, Grok currently trails competitors in benchmark performance and enterprise adoption.
xAI reportedly trained Grok using extensive data from Cursor, a widely used AI coding platform adopted by over 67% of Fortune 500 companies. This dataset includes real-world developer prompts, debugging sessions, and multi-file collaboration patterns. The approach targets a key limitation of current models: moving from syntactic code generation to practical software engineering capabilities.
A $60 billion acquisition option tied to Cursor underscores Musk’s focus on the coding market, with a $10 billion fee even if the deal fails. xAI has also launched Grok Build, a command-line AI programming agent supporting parallel sub-agents, code editing, and execution. Pricing reaches $300 per month, positioning it as a premium enterprise tool.
On the SWE-bench Verified benchmark, GPT-5.5 scores 88.7%, Claude Opus 4.6 reaches 80.8%, while Grok models sit around 72–75%. Enterprise usage reflects a similar gap, with OpenAI at 55%, Anthropic at 47%, Google at 39%, and xAI at just 6%, highlighting the challenge Grok V9 must overcome.
A 46-page paper led by DeepSeek researcher Deli Chen was 99% generated by an AI agent, completed in roughly six days with only two hours of human input. The system processed 648,000 tokens and verified over 100 references, demonstrating rapid acceleration in academic output and raising questions about authorship and research inflation.
The paper proposes a five-level autonomy scale, from basic autocomplete tools to fully self-directed research systems. Current leading systems operate at Level 4, capable of multi-step autonomous work within defined constraints. Key unresolved challenges include self-evaluation, long-term memory, reproducibility, and avoiding failure loops.
Qwen 3.7 Max ranked fourth globally on the Code Arena leaderboard with a score of 1541, surpassing GPT-5.5 and Gemini 3.5 Flash. It is the first Chinese model to reach this level, joining top performers dominated by Anthropic’s Claude series.
Qwen’s architecture emphasizes sustained task execution, reportedly running for 35 hours with over 1,100 tool calls without losing coherence or entering loops. Tests show it producing functional applications—such as games and simulations—on first attempt with minimal debugging, indicating strong real-world usability.
Multiple major releases are expected within the same period, including GPT-5.6, Claude Opus 4.8, and Gemini 3.5 Pro, setting up a concentrated wave of competition. At the same time, regulatory constraints are shaping partnerships, particularly around xAI’s interactions with Cursor during acquisition negotiations.
Rapid advances in model scale, training data, and agent autonomy are converging to reshape software development and research, with competition intensifying across U.S. and Chinese AI leaders.
Elon Musk just pulled the curtain back on what looks like Gro 5, a massive 1.5 trillion parameter model that has already finished training. And it could be XAI's biggest move yet in the AI coding race. And XAI reportedly trained it with massive amounts of cursor programming data, meaning Grock is learning from how real developers actually build, debug, and fix software. Deepseek just showed a 46-page research paper that was 99% written by an AI agent. While Alibaba's Quen 3.7 Max suddenly broke into the global top tier of coding models, beating GPT 5.5 and Gemini 3.5 Flash. But let me start with the Musk stuff because it's probably the most immediately attention-grabbing. Late at night on May 24th, Elon announces that Grock V9 with 1.5 trillion parameters has completed training. That's exactly three times the size of the current model. And he says it'll be released to the public in 2 to 3 weeks. But here's where it gets really interesting. Almost simultaneously, it comes out that during training, XAI fed a massive amount of cursor programming data into the model. Now, Cursor is that insanely popular AI coding tool that over 67% of Fortune 500 companies are using. It's expected to hit $6 billion in annualized revenue by the end of 2026. And Jensen Hang from Nvidia has publicly called it his favorite enterprise level AI service. So, feeding cursor data into Grock is basically like studying for an exam with the answer sheet, except the exam is how do professional engineers actually write code? And the answer sheet is millions of realworld interactions. What makes this so powerful is that we're not talking about basic syntax here. Current language models can already spit out code that looks correct. The real challenge is understanding complex engineering logic, navigating multifile code bases, debugging in realistic workflows, and collaborating with humans effectively. Cursor has all of that data. the prompts developers use, how they modify code, their debugging sessions, multifile collaboration patterns. It's the exact type of training data you need to make an AI that doesn't just write code, but actually engineers software the way humans do. Someone actually asked Grock directly what the cursor data contains and it answered that it includes highquality real programming interactions with developers prompts, code context, editing operations, and task completion records. So yeah, they're basically teaching Grock to think like a senior developer by showing it how senior developers actually work. The current V8 small model with 500 billion parameters will also be open source by the end of the year, which is interesting because it shows XAI is trying to play both sides, keep the cutting edge stuff proprietary while building goodwill in the open- source community. And this is where you realize Musk is not just trying to make Grock smarter. On April 21st, SpaceX made a $60 billion move around Cursor, one of the most important AI coding tools right now. They're getting an option to acquire cursor. And if they don't exercise it by the end of the year, they still pay a $10 billion cooperation fee. That's how much Musk values the AI programming field. Step one, lock down cursor with money. Step two, feed their data into your model. Step three, launch your own programming agent called Grock Build on May 14th. Grock Build is pretty interesting actually. It's a terminal level AI programming agent that runs on the command line, supports code generation, file editing, dependency management, and shell command execution. The biggest selling point, it supports up to eight sub aents working in parallel. They're charging 300 bucks a month for the super Gro heavy subscription, though there's a promotional price of $99 for the first six months. And get this, Grock Build is natively compatible with the configuration file format that Claude Code uses. That's XAI building compatibility with their competitors ecosystem right into their product. It's practical but also kind of telling about where they stand in the market. Because let's be real, Grock is behind. On the SWE bench verified benchmark, which is what developers actually care about for measuring AI programming capability, GPT 5.5 is at 88.7%. Claude Opus 4.6 is at 80.8% and Gro 4 series is sitting around 72% to 75%. In terms of enterprise adoption, as of March 2026, OpenAI has 55% of enterprise users. Anthropic jumped from 20% a year ago to 47%. Google's at 39% and Grock has a measly 6%. So yeah, tripling the parameters and adding cursor data might bring about a qualitative change, but Musk's got a lot of ground to cover. The timing of all this is super deliberate, too. SpaceX is listing on NASDAQ on June 12th with a target valuation of $1.75 trillion, the largest IPO in history if it goes through. The $60 billion cursor acquisition is expected to complete within 30 days after the IPO, and V9 Medium's public release is scheduled right before the IPO. But Musk isn't the only one making moves in June. OpenAI's GPT 5.6 has been leaked in the Codeex background with a 1.5 million token context window successfully tested. Poly Market is predicting over 85% probability it releases before the end of June. Anthropic Claude Opus 4.8 8 has appeared in the Google Vertex background. Google's Gemini 3.5 Pro is also scheduled for June. Four leading labs having a head-on confrontation in the same month. This June is going to be absolutely brutal. But while all this is happening, there's this legal situation brewing. Bloomberg reported that XAI's general counsel sent guidelines last week asking employees to limit interactions with cursor staff to only what's necessary for implementing their technical partnership. This is standard procedure when acquisition talks are public. Antirust rules prohibit merging parties from intermingling assets or making joint business decisions before a deal is approved. The partnership was announced on April 21st and Curser posted about leveraging XAI's Colossus infrastructure to dramatically scale up the intelligence of their models. They said they've been bottlenecked by compute and this partnership solves that. So, right now it's this careful dance where they're technically collaborating but legally have to keep walls up until regulators sign off on any acquisition. Now, let's get to the absolutely fascinating part. Delichen's paper. This is where things get meta in the best way possible. Deli Chen is a senior researcher at Deepseek, one of the core contributors to Deepseek V1, V2, V3, V4, Deepseek R1, which was on the cover of Nature, Deepseek Coder, the Deepseek Architecture. He's legitimately a heavy hitter in the field and he just published a 46-page survey paper titled from co-pilots to colleagues, a survey of autonomous research agents where he openly admits that approximately 1% was written by him and 99% was written by his autonomous research agent framework called Delhi Auto Research Skill. The statistics on this are kind of insane. The paper went through six iterations total. four for V1, one for V2, one for V3. The first draft took 76 minutes. Total time spent was six days across approximately 108 rounds of agent interaction, consuming about 648,000 tokens, producing 2,234 lines of Latte. All 103 references were verified. The paper has seven figures and four tables totaling 46 pages at 538 kilob file size. And Deli Chen said the actual CPU time he spent thinking was less than 2 hours. His take, code agents are causing crazy inflation in computer science papers. Work that used to take at least a month can now be done in days. The two co-authors listed are Deepseek V4 Pro handling the text and GPT Image 2 handling the images. So yeah, a human using AI to write a comprehensive review about AI conducting scientific research. The irony is not lost on anyone and that's kind of the point. This paper is both a demonstration and an analysis of exactly what it's describing. The paper itself is actually super valuable though. It proposes this fivelevel autonomy taxonomy for research agents similar to how we classify self-driving cars. Level one is autocomplete stuff like GitHub co-pilot where the human drives every step and the agent just suggests completions. These systems give you a 30% to 55% productivity boost but have no autonomy. Level two is task execution where the human specifies the task and approves each action. Think chat GPT with tools or clawed chat. Level three is multi-step operation with checkpoints where the agent sets the goal and reviews at specific stopping points. This is where claude code and cursor agents sit. Level four is full autonomy within bounded domains where humans provide the goal and evaluate the final output. This is where Devon, AI scientist, and SUI agent operate. Level five is self-directed research where the human just sets the research area and the agent chooses its own problems. This is still mostly hypothetical. The paper identifies four dominant architectural patterns. Single agent loops are the simplest. Plan, act, observe, reflect in a cycle. Multi-agent collaboration has multiple agents with different roles reviewing and supplementing each other. Hierarchical orchestration has a supervisor agent decomposing tasks and delegating to worker agents. Tool augmented execution gives agents access to external tools like code execution environments, web browsers, database queries, even robotic lab equipment. Most powerful systems combine multiple patterns. What's really honest is the paper identifies six fundamental problems that still aren't solved. First is the cognitive loop trap where agents get stuck repeating failed strategies without recognizing the failure. AutoGPT is notorious for this. Entering infinite loops is its most common issue. Second is context window limitations. A long research session can generate over 100,000 tokens and early information gets lost. Third is novelty evaluation. How do you judge if AI generated research is actually novel? Citation prediction is influenced by social factors. Semantic similarity can't distinguish between novel and obscure. Fourth is reproducibility. Language model inference with nonzero temperature produces different outputs each run and agent behavior is highly sensitive to prompt variations. Fifth is safety and ethics. The same capabilities that make research agents valuable also create dualuse risks. Sixth is cost and accessibility. A single SWE resolution can cost $5 to $50 in API calls, creating economic barriers. The paper surveyed over 95 papers and analyzed 17 major systems across a six-dimensional feature matrix. The conclusion is pretty clear. Current frontier systems operate at L4, meaning multi-step autonomous execution within bounded domains, while L5 remains aspirational. The most critical barriers to L5 aren't raw capability, but persistent knowledge accumulation across sessions, reliable self-evaluation without human oversight, and principled scaling of agent architectures that doesn't break down as complexity increases. And speaking of programming capability, we need to talk about what just happened with Quen 3.7 Max. The Code Arena leaderboard just came out and Quen 3.7 Max scored 1,541 points, landing in fourth place globally. That puts it ahead of GPT 5.5 and Gemini 3.5 Flash. Only Claude Opus 4.7 and Opus 4.6 are ahead of it. This is the first time a Chinese model has reached this position in programming. Alibaba is now the only Chinese manufacturer in the global top five and they're the only non-clawed model up there. Before the official leaderboard, developers were already testing it. One comparison had Opus 4.7, GPT 5.5, and Quen 3.7 Max write a self-training Tetris AI. Quinn 3.7 Max not only beat both competitors, but did it with a token cost of just $1.32 while improving performance by 56%. Another developer used it to build a 3D model of the universe, and the results were impressive. When generating a 3D pixel style miniature Pigota model, Quen 3.7 Max outperformed in both output speed and quality. A more practical test came from another developer who gave Quen 3.7 Max a prompt to create a racing game, and the result was honestly pretty impressive. It generated a playable HTML file. There was one small bug in the first version where the left and right steering keys were reversed, but after one quick follow-up to fix it, the whole thing was running properly. The final result had four cars, a threelap circular track, more than 100 gold coins scattered around, obstacles that slowed the car down when hit, and a post-race results panel with rankings, lap times, gold coin count, and fastest lap. But two details stood out the most. First, Quen 3.7 Max created a proper start page. You actually had to click start to begin the race. The other three models tested just started running immediately with no title screen. Second, the original prompt also asked for engine sounds and gold coin collection sounds. That was more like a bonus requirement at the end of the prompt. Yet, Quen 3.7 Max was the only model that actually implemented it. By comparison, Gemini 3.5 Flash had noticeably lower visual quality and scattered UI with dashboard info in all four corners, making it hard to focus. Claude Opus 4.6 6 had very few gold coins, and the three AI cars drove almost in perfect sync, like they were copy pasted. GPT 5.5 had better graphics and smoother operation, but made the gold coins look like yellow donuts for some reason, and both it and the others needed multiple rounds of debugging before everything worked properly. Only Quen 3.7 Max was basically playable on the first generation. The reason Quen 3.7 Max performs so well in programming is actually built into its design philosophy. Alibaba positioned it as an agent foundation model specifically designed for long-term autonomous task execution. Internal test data showed it running continuously for 35 hours executing 1,158 tool calls on an autonomous programming task. The generated code achieved a t-fold geometric mean acceleration compared to the Triton reference implementation. After 30 hours of deduction, the model still remained sharp and kept discovering new optimization opportunities with zero context degradation, zero instruction drift, and zero infinite loops. That last part is crucial because calling tools 1,000 times isn't that uncommon anymore, especially with protocols like MCP. The real challenge is staying coherent for 35 hours without losing the goal, forgetting earlier decisions or getting trapped in the same failed loop. Most models start breaking down on tasks that long. So Quinn 3.7 Max holding the thread for that many hours is a serious signal. The training method may explain it. Quinn 3.7 Max was reportedly trained with environment expansion where the same programming task is tested across different execution frameworks and verification methods like clawed code, open claw and others. So instead of learning shortcuts for one specific setup, the model is forced to learn general problem solving pattern. That could be why it performs well across different agent frameworks instead of only looking strong inside its own ecosystem. All right, let me know your thoughts in the comments. Subscribe for more AI updates. Hit the like button if you enjoyed the video. Thanks for watching and I'll catch you in the next one.