ENFR

Tech • IA • Crypto

Briefing Today's Video Video briefings Topics Today's Top 50 Daily Summaries

China’s New Self Improving Open AI Beats OpenAI

AIAI RevolutionApril 12, 202614:50

0:00 / 0:00

Summary

TL;DR

MiniMax released M2.7, a powerful self-optimizing open-source AI model for software engineering and office tasks; alongside, major AI updates emerged from Runnable, Google Mixboard, OpenAI, and Meta’s Muse Spark, highlighting rapid advances in AI agent capabilities and multimodal systems.

Key Points

MiniMax Launches M2.7: A Strong Open-Source Model with Self-Evolution

MiniMax has fully open-sourced the weights for M2.7, its most capable model to date, hosted on Hugging Face. The model leverages a mixture of experts architecture, activating only necessary subcomponents during task execution to improve efficiency. It excels primarily in software engineering, office work, and managing “agent teams” — essentially coordinating multiple AI agents to collaboratively tackle complex assignments end-to-end. In software engineering benchmarks close to real-world scenarios, M2.7 scores competitively at 56.22% on SWE-Pro (near GPT-5.3 Codex level), 57.0% on Terminal Bench 2, and 39.8% on NL2 Repo, demonstrating a strong understanding of entire codebases rather than isolated snippets. Its broader engineering skill benchmarks were also impressive, including 76.5% on SWE-Multilingual and 52.7% on MultiSWE-Bench.
Production Debugging and Real-World Utility

Beyond coding, M2.7 stands out by assisting in live production debugging, identifying root causes for issues like missing database migration indexes, correlating monitoring spikes to deployment actions, and delivering fixes that can drastically reduce incident recovery to under 3 minutes. This situates it closer to a Site Reliability Engineer (SRE) role than typical code generators.
Self-Improvement through Autonomous Optimization

A pioneering aspect of M2.7 is its internal self-evolution framework, where it autonomously conducted over 100 optimization cycles analyzing its failures, refining scaffold code, tuning parameters (like temperature and penalties), and enhancing loop-detection to avoid redundant agent actions. This process yielded a 30% uplift in internal evaluation performance, marking a significant AI-driven improvement loop rarely seen publicly.
Handling Complex Workflows and Office Tasks

M2.7 supports extended, multi-agent workflows with a 97% skill compliance rate over tasks exceeding 2000 tokens and an overall accuracy of 62.7% on MiniMax’s MM-Claw evaluation. It also addresses professional office workloads, scoring an ELO of 1,495 on GDP-Val AA, ranking as the top open-source model for expert-level tasks, able to analyze financial documents, forecast revenues, and generate detailed presentations and reports akin to junior analysts.
Strong Performance on Machine Learning Competitions

In the MLE-Bench Light—a collection of 22 machine learning competitions executable on a single Nvidia A30 GPU—M2.7 demonstrated competitive prowess by earning 9 gold, 5 silver, and 1 bronze medals in a 24-hour run, averaging a 66.6% medal rate, closely rivaling models such as Gemini 3.1 and just behind GPT-5.4 and Opus 4.6.
Runnable Introduces Run Claw: Cloud AI Agent Integration in Team Chats

Runnable launched Run Claw, a cloud-based AI agent accessible via messaging platforms like Slack, Telegram, and Discord. This system moves beyond traditional prompt-response interfaces by querying users for clarification, crafting strategic plans, and autonomously executing tasks. Runnable’s platform already supports creating websites, videos, pitch decks, and more, integrating with tools such as Google, Slack, Notion, GitHub, and Shopify. Having surpassed $2 million Annual Recurring Revenue (ARR) with frequent feature rollouts, Runnable is positioning itself as a unified AI service enabling seamless task delegation directly through team communication channels.
Google Evolves Mixboard with Full Voice Control and Collaborative Features

Google announced significant enhancements to Mixboard, transforming it from an image-focused canvas into a hybrid collaborative workspace incorporating stickers, voice notes, geometric shapes, and markers. Most notably, they introduced full voice-control capabilities allowing users to generate images, rearrange elements, and operate the entire interface hands-free, built using the infrastructure underlying Google’s Stitch voice tool. Additionally, Mixboard supports experimental PDF exports to automatically convert brainstorming sessions into formal documents, streamlining the transition from ideation to documentation. These features might debut at the upcoming Google I/O conference.
OpenAI Developing a Unified Codex Agent Application

OpenAI is designing a comprehensive Codex agent app to merge ChatGPT, the Atlas browser, and coding capabilities into one platform. The app includes a "scratchpad" interface allowing multiple Codex tasks to run concurrently and harnesses managed agents that persistently operate in the background with periodic updates. Evidence of a “heartbeat” system indicates these agents maintain active long-running processes, resembling functionality seen in Runnable's Run Claw. This aligns with broader industry trends as Anthropic introduces a similar multi-agent system called Conway. Speculation surrounds a forthcoming model, possibly GPT-5.5 “Glacier,” hinted through snowflake emojis by OpenAI staff, which may coincide with this platform rollout.
Meta Releases Muse Spark: A Natively Multimodal AI from Scratch

Meta unveiled Muse Spark, developed by its Super Intelligence Labs as a multimodal AI model designed jointly for text and images rather than adding vision on top of a language-only model. In benchmark tests focusing on UI element recognition (Screen Spot Pro), Muse Spark achieves 72.2% to 84.1% accuracy, outperforming Opus 4.6 Max and GPT 5.46 High in base setups. Its training advances include revamped pre-training for tenfold better compute efficiency compared to Llama 4 Maverick, stable reinforcement learning improvements, and innovative “test-time reasoning” where multiple agents generate, refine, and merge answers in parallel (“contemplating mode”) to boost capability without long latency. This mode delivers top marks on complex tasks, scoring 58.4 on Humanity’s Last Exam (tool-augmented), narrowly trailing GPT 5.4 Pro.
Domain-Specific Strengths and Weaknesses of Muse Spark

Muse Spark demonstrates outstanding performance in health-related benchmarks (Health Bench Hard: 42.8) massively surpassing competitors, bolstered by data curated with input from over 1,000 physicians. In software engineering, it performs strongly but trails leaders with a 77.4% on SWE Bench Verified. However, in abstract reasoning tasks (ARC AGI 2), Muse Spark scores 42.5, considerably behind Gemini and GPT 5.4 models which score over 76, signaling ongoing work needed in this area.
The AI Landscape is Rapidly Evolving Towards Agent-Based, Collaborative, and Multimodal Systems

These updates collectively emphasize a shift from simple prompt-based AI generation to fully autonomous AI agents capable of multi-step workflows, complex coordination, self-optimization, and multimodal understanding. Models and tools are increasingly integrated into real-world workflows spanning code debugging, financial analysis, collaborative ideation, and continuous learning. This fast-paced evolution highlights growing competition among AI providers striving to deliver versatile, generalist agents embedded deeply into users' professional ecosystems.

The AI field continues to witness swift advancements with MiniMax’s M2.7, Runnable’s Run Claw, Google’s Mixboard, OpenAI’s unified Codex app, and Meta’s Muse Spark leading innovation in agent-based and multimodal AI technologies.

Full transcript

MiniMax just released a self-evolving model. Google is adding full voice control to Mixboard. [music] OpenAI is building an all-in-one Codex agent app. And Meta dropped Muse Spark with parallel thinking. Several big updates just rolled out. Let's talk about it. All right. So, MiniMax just did something pretty big. It open-sourced M2.7 for real, not in that half-open way where you only get an API or some limited version. The actual model weights are now up on Hugging Face. And this is actually MiniMax's strongest open-source model so far. >> [music] >> Under the hood, it uses a mixture of experts setup, which basically means the whole model does not fire all at once every time you ask it to do something. Only the parts that are needed get activated. But honestly, the architecture is only part of the story. What really makes M2.7 stand [music] out is what it is built to do. MiniMax says the model is designed around three main areas: serious software engineering, serious office work, and something called agent teams. [music] That last part is interesting. It is built to work like a small team. It can coordinate multiple agents, split up work, use tools, and handle bigger tasks from start [music] to finish. And the benchmark scores actually back that up. On SWE-Pro, which is much closer to real engineering work than basic coding tests, M2.7 scores 56.22%. That puts it right up there with GPT-5.3 Codex. These tasks include things like digging through logs, troubleshooting bugs, reviewing code for security issues, and fixing broken machine learning workflows. Then on Terminal Bench 2, it gets 57.0% [music] and on NL2 Repo, where the model has to understand full codebases instead of just isolated snippets, it [music] scores 39.8%. On Vibe Pro, which tests repo-level code generation across web, Android, iOS, and simulation tasks, it hits 55.6%, which puts it nearly next to Opus 4.6. It also does really well on broader engineering style tests. On SWE-Multilingual, it scores 76.5 and on MultiSWE-Bench, it gets 52.7. So, this thing is clearly doing more than just spitting out code. It is showing signs that it can actually understand how software systems work, how issues spread across files, and how real development work usually plays out. MiniMax also shared some examples from production debugging, and this is where it starts sounding a lot more like an actual engineering assistant than a code bot. They describe cases where M2.7 can jump into a live production issue, connect monitoring spikes with deployment timelines, reason through possible causes, analyze traces, check databases, spot things like missing index migration files, and then suggest fixes that can stop the damage before things get worse. In some of these cases, they say it helped cut recovery time to under 3 minutes. That is a very different level. At that point, you are talking about something that is starting to behave more like an SRE or systems engineer. Then comes the wildest part, the self-evolution setup. MiniMax had M2.7 work on improving its own programming performance using an internal scaffold. The model ran through more than 100 autonomous rounds where it looked at where it failed, planned changes, modified the scaffold code, ran evaluations, checked whether the changes helped, and then either kept them or rolled them back. And it was not just randomly trying stuff. It found useful improvements on its own. It tuned things like temperature and penalty settings, improved workflows by checking whether the same bug pattern appeared in other files, and even added loop detection so the system would stop getting trapped in repetitive agent behavior. That process led to a 30% performance boost on internal evaluations. That is a huge part of why this release stands out. This is one of the clearest real examples of an AI system helping improve the system around itself in a structured way. MiniMax also says M2.7 is already handling 30 to 50% of parts of its own reinforcement learning team workflow end-to-end, with humans mainly stepping in for important decisions and discussions. So, the model is already being used inside the company for real work. They also tested it on MLE-Bench Light, which is an open-source set of 22 machine learning competitions that can run on a single A30 GPU. For that, they gave the model a simple setup with short-term memory, self-feedback, and self-optimization. After each round, it writes down what happened, criticizes its own results, and decides what to try next. Each run lasted 24 hours. In the best run, M2.7 earned nine gold medals, five silver, and one bronze. Across all three runs, its average medal rate was 66.6%. [music] That ties Gemini 3.1 and puts it just under GPT-5.4 at 71.2% and Opus 4.6 [music] at 75.7%. And this model is not only built for engineers. On GDP-Val AA, which measures professional office work and expert-level task delivery across 45 models, M2.7 gets an ELO score of 1,495. That makes it the highest-ranked open-source model there, behind only models like Opus 4.6, Sonnet 4.6, and GPT-5.4 overall. [music] On Toolathon, it scores 46.3% and on MiniMax's own MM-Claw evaluation, it keeps a 97% skill compliance rate across 40 complex skills, each longer than 2,000 tokens, while hitting an overall accuracy of 62.7%. So, it is not just capable, it is also staying on task across long and complicated workflows. MiniMax also says M2.7 can handle finance style work, too. It can read annual reports and earnings call transcripts, compare multiple research reports, build revenue forecast models, and then turn all of that into things like PowerPoint decks and written reports. Basically, it can work through the kind of tasks you would normally hand to a junior [music] analyst. So, overall, MiniMax is showing an open-source system that can code, debug, reason through production issues, handle office work, work across multiple agents, and even improve the workflow around itself. That is why M2.7 feels important. All right. So, the next one. Runnable just dropped Run Claw. And this one is interesting because it feels like another sign that the AI tool race is quickly turning [music] into an AI agent race. Run Claw is basically a cloud-based AI agent that lives inside Slack, Telegram, and Discord. So, instead of opening some separate app, setting up servers, or building some complicated workflow, you just message it like it's another person on your team, give it a task, and it goes off and handles it. Now, a lot of AI products still live in that old prompt box world, where you type something, get an output, then keep fixing it over and over until it's usable. Run Claw is clearly aimed at the next step, where the system asks questions, figures out what you actually want, builds a plan, and then starts doing the job for you. What also makes this launch hit harder is the context around it. Runnable says it just crossed $2 million in annual recurring revenue while also shipping new updates basically every single day. In AI right now, that's a pretty serious combination. Plenty of startups can go viral for a minute. Fewer can show real traction and still move this fast. [music] If you haven't really been following Runnable, the platform was already trying to do way more than one thing. It can generate websites, videos, pitch decks, images, carousels, reports, and documents, all inside one system. It also has file uploads for context, a chat mode for research, a plan mode for bigger [music] builds, model selection, memory for things like brand style and preferences, [music] and connectors for tools like Google, Slack, Notion, GitHub, and Shopify. On the website side, it also goes further than the usual AI demo stuff. Runnable says it can build live deployed sites with databases, Stripe payments, custom domains, SEO, analytics, version history, [music] and even an AI voice agent for support or lead capture. So, when Run Claw shows up on top of all that, it starts to make more sense what they're trying to build here. They're clearly trying to turn Runnable into one system where AI can create the assets, connect to your stack, remember how you work, and now actually execute tasks for you in the background. So, yeah, Run Claw is the headline. Though, the bigger story is that tools like this are starting to move past simple generation and into real delegated work. And if Runnable is already at 2 million ARR while shipping at this speed, then this is definitely one to watch. At the same time, Google is taking a very different angle with Mixboard. Originally, Mixboard was more of an image-based canvas, kind of experimental, nothing too serious. Now, it's evolving into something much closer to a full collaborative workspace. They're adding an experimental section with elements like stickers, voice notes, geometric shapes, and markers. You can layer these alongside generated images, which basically turns it into a hybrid between a generative AI tool and something like Miro or FigJam. The interesting part is how they're integrating voice. There's voice note support already, so you can capture ideas without typing. On top of that, they're working on a full voice mode where you can control the entire board through speech. That includes generating images, rearranging elements, swapping content, all without touching the interface. This seems to be built on the same infrastructure as Google's Stitch tool, which already supports voice interactions. And [music] then there's PDF export. You can take an entire board session and convert it into a structured document. That's actually a big deal in team workflows because it bridges the gap between brainstorming and documentation. Instead of manually summarizing everything, the system handles it automatically. It's still experimental though. Google hasn't confirmed how it will ship this. It could stay standalone or it could get folded into Gemini or workspace. With Google IO scheduled for May 19th to 20th, the timing suggests we might see a reveal there. Now shifting to Open AI. They're working on a unified Codex app that could basically merge Chat GPT, the Atlas browser, and coding tools into a single platform. One of the new features is called scratchpad. It allows users to trigger multiple Codex tasks in parallel from a dedicated interface. That ties into a broader concept of managed agents where tasks run in the background, check in periodically, and handle multi-step workflows without constant input. There's also evidence of a heartbeat system inside the app. That's important because it suggests persistent connections with long-running processes. So instead of one-off interactions, you have agents that stay active and keep working. This is very similar to what Open Claw has been doing and the connection gets even more interesting when you consider that Open Claw's founder joined Open AI. Anthropic is also building something similar with its Conway system. So this is clearly becoming a competitive area. The idea is to reduce friction. Instead of jumping between tools for chat, coding, and browsing, everything happens in one place powered by agents that can manage tasks end-to-end. There's also speculation about a model called Glacier, possibly GPT 5.5, hinted at by Open AI employees posting snowflake emojis. If that lines up with the app launch, it would follow their pattern of combining platform updates with model releases. So you're looking at a system where you give a goal and the platform coordinates everything in the background. >> [music] >> And then Meta comes in with Muse Spark. This is the first model from Meta's Super Intelligence Labs and it was built from scratch as a natively multimodal system. So instead of bolting vision onto a language model later, Meta built this to handle text and images together from the start. That already shows up in benchmarks. On Screen Spot Pro, which tests how well a model can identify UI elements inside screenshots, Muse Spark scores 72.2 or 84.1 with Python tools. That's ahead of Opus 4.6 Max at 57.7 and GPT 5.46 High at 39.0 in the base setup. Meta says Muse Spark improves through three main levers: pre-training, reinforcement learning, and test-time reasoning. They spent months rebuilding the pre-training stack, improving architecture, optimization, and data quality. According to Meta, that made Muse Spark more than 10 times more compute-efficient than Llama 4 Maverick. Then there's reinforcement learning where Meta says the gains are stable and predictable with steady improvements in pass at one and pass at 16. Test-time reasoning is where it gets even more interesting. Muse Spark is trained to think before answering, though with pressure to stay efficient, that leads to what Meta calls thought compression where the model learns to solve problems with fewer tokens while still improving performance. Then there's contemplating mode. Instead of one model thinking longer, multiple agents run in parallel, generate answers, refine them, and combine them into one final output. That helps increase capability without the same latency hit you'd get from one long reasoning chain. In that mode, Muse Spark scores 58.4 on Humanity's Last Exam with tools, just behind GPT 5.4 Pro at 58.7 and ahead of Gemini 3.1 Deep Think [music] at 53.4. On Frontier Science Research, it reaches 38.3, beating GPT 5.4 Pro at 36.7 and well above Gemini 3.1 at 23.3. Its strongest results show up in health. On Health Bench Hard, Muse Spark scores 42.8 compared to Opus 4.6 Max at 14.8 and Gemini 3.1 Pro High at 20.6. Meta says that is partly because the training data was curated with help from more than 1,000 physicians. On coding, it's strong, though not leading. On SWE Bench Verified, it scores 77.4 behind Opus at 80.8 and Gemini at 80.6. The biggest weakness right now is abstract reasoning. On ARC AGI 2, Muse Spark gets 42.5 while Gemini and GPT 5.4 are both above 76. Also, if you want more content around science, space, [music] and advanced tech, we've launched a separate channel for that. Links in the description. Go check it out. Anyway, that's where things stand right now. This space is moving ridiculously fast. Drop your thoughts in the comments and let me know which of these moves feels biggest to you. Thanks for watching and I'll catch you in the next one.

More from AI