ENFR

Tech • IA • Crypto

Briefing Today's Videos Video briefings Topics Today's Top 50 Daily Summaries

New DeepSeek V4 Shocks The World: China Fires Back Hard

AIAI RevolutionApril 25, 202615:30

0:00 / 0:00

Summary

TL;DR

DeepSeek’s V4 models combine near-frontier performance with radically lower costs, signaling a major shift in AI economics, infrastructure competition, and long-context capabilities.

Key Points

Dual Model Release

DeepSeek V4 Pro and V4 Flash launched as a two-tier system targeting different workloads. V4 Pro uses a 1.6 trillion-parameter mixture-of-experts design with 49 billion active parameters per query, while V4 Flash is smaller at 284 billion total and 13 billion active. Both are text-focused models with 1 million token context windows and up to 384,000 output tokens, positioning them for large-scale reasoning and agent workflows.

Aggressive Pricing Strategy

Pricing is the defining disruption. V4 Flash costs $0.14 per million input tokens and $0.28 output, while V4 Pro costs $1.74 input and $3.48 output. Comparable systems are far more expensive, with GPT‑5.5 reportedly at $5/$30 and premium tiers reaching $30/$180, and Claude Opus 4.7 around $5/$25. This places V4 Pro up to 98% cheaper than top-tier competitors, dramatically lowering the cost of large-scale AI deployment.

Competitive Benchmark Performance

Early benchmarks show strong but not dominant results. V4 Pro ranks third among open models and 14th overall in coding evaluations, while other tests place it near the top of all systems, sometimes trailing leaders by fractions of a percent. It achieves 90.2% on Apex math benchmarks, outperforming some rivals, but still lags models like Gemini 3.1 Pro on reasoning-heavy tests such as GPQA Diamond and Humanity’s Last Exam.

Strength in Coding and Agents

Coding and agent workflows emerge as V4’s strongest domain. Internal testing shows over 90% of developers ranking V4 Pro among top coding tools, with more than half ready to adopt it as default. It integrates with frameworks such as Claude Code, OpenCode, and Code Buddy, and supports complex multi-step agents for research, data analysis, and software generation.

Interleaved Reasoning for Agents

A key technical feature is interleaved thinking, which preserves reasoning state across tool calls. This reduces context loss in multi-step workflows, improving reliability in long chains of actions where earlier models often degraded or reset intermediate reasoning.

Breakthrough in Long-Context Efficiency

DeepSeek introduces a hybrid attention system combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). These methods compress token groups and selectively focus computation, enabling efficient scaling to 1 million tokens without prohibitive cost. Compared to earlier versions, V4 Pro cuts compute usage to 27% and memory to 10%, while Flash reduces them further.

Engineering and Training Advances

Additional innovations include manifold-constrained hyperconnections for more stable signal propagation and the Muon optimizer for efficient large-scale training. These changes reportedly deliver up to 2× inference acceleration, reinforcing the model’s cost-performance advantage.

Hardware Strategy and Global AI Stack

V4 is designed to run on both Nvidia GPUs and China’s domestic chips, particularly Huawei Ascend NPUs. Nvidia supports deployment on Blackwell and Hopper systems, while Huawei reports up to 1.73× inference acceleration on Ascend hardware. This dual compatibility reflects a broader competition over AI infrastructure and supply chains.

Impact of Export Restrictions

U.S. restrictions on advanced chip exports have pushed Chinese developers toward efficiency and domestic alternatives. While training still partly relies on Nvidia hardware, inference is increasingly shifting to local chips. This suggests the emergence of a parallel AI ecosystem, rather than a complete break from Western infrastructure.

Economic Implications for Developers

The pricing shift changes the feasibility of large-scale applications. Tasks like legal analysis, financial research, codebase review, and enterprise automation become significantly cheaper with million-token context. Smaller teams benefit even more from V4 Flash, enabling low-cost development of chat systems, summarization tools, and lightweight agents.

Open-Weight Advantage

Released under an MIT license, the models can be downloaded, modified, and self-hosted. This gives companies control over customization and deployment, contrasting with closed API-only systems and strengthening the open-weight ecosystem.

Limitations and Gaps

V4 remains text-only, leaving competitors ahead in multimodal capabilities involving image, audio, and video. It also trails leading models in some general reasoning benchmarks, with an estimated 3–6 month gap to frontier systems.

Mixed Early Reception

Initial user feedback varies. Some report performance close to top-tier systems at a fraction of the cost, while others find improvements over previous versions less noticeable in everyday use. This highlights the gap between benchmark performance and real-world experience.

Market Significance

Rather than outperforming all rivals, V4 reshapes expectations around cost and accessibility. By combining strong performance, extreme efficiency, and open deployment, it challenges premium pricing models and signals a broader shift toward cheaper, scalable AI infrastructure.

Full transcript

Open AAI just dropped GPT 5.5 and only a few hours later, Deepseek showed [music] up with V4 and they actually released two models with a 1 million token context window, MIT license, extremely low pricing, strong coding performance and support for both Western GPU stacks and China's [music] domestic chip ecosystem. So this launch is about benchmarks, but also about cost, [music] infrastructure, long context agents, and the bigger fight over who controls the AI stack. The new family has two versions, DeepSeek V4 Pro and Deepseek V4 Flash. V4 Pro is the big one. It has 1.6 trillion total parameters with 49 billion active parameters per inference pass. So the full model is massive, but it does not wake up the whole thing every time you ask it something. Obviously, it uses a mixture of expert setup where only the relevant parts activate for [music] each task. V4 flash is the smaller and faster version with 284 billion total parameters and 13 billion active [music] parameters. Both are text only for now. Both support 1 million tokens of context [music] and both can produce up to 384,000 output tokens through DeepSseek's API docs. And the pricing is where Deepseek is trying to punch the market in the face. V4 Flash costs $0.14 per million [music] input tokens and 0.28 per million output tokens. V4 Pro costs $1.74 input and $3.48 output. For comparison, GPT 5.5 reportedly launched at $5 input and $30 output, with GPT 5.5 Pro going as high as $30 input and $180 output per million tokens. Claude Opus 4.7 is also far more expensive, around $5 input and $25 output. So when people say V4 Pro is 98% [music] cheaper than GPT 5.5 Pro, that is the point. And V4 flash being 0.28 output means it is over 99% cheaper than something like Claude Opus 4.7 output pricing. [music] That is why developers are paying attention. A model does not need to beat every closed source system in [music] every category to change the market. Sometimes it just needs to be good enough, open enough, [music] and cheap enough. Now the early benchmarks are already causing a lot of noise. Arena.ai AI said DeepSseek [music] V4 Pro in thinking mode ranked third among open- source models and [music] 14th overall in its code arena. They described it as a significant jump over DeepSeek V3 3.2. Val's AI went even harder, saying V4 became the number one open-source weighted model in its Vibe Code benchmark, [music] beating Kimmy K 2.6 and even closed source models like Gemini 3.1 Pro. Val also said V4 made about a 10-fold jump over V3.2 on that benchmark. V3.2 only scored five points there and V4 moved far beyond it. In Val's broader index, V4 came second overall, only 0.07% behind Kimmy K 2.6. Now, Deepseek's own wording is more cautious, which is actually interesting. In its own material, the company says V4 Pro has passed mainstream open-source models and is close to closed source systems like Gemini in knowledge and reasoning, but still has a gap of around 3 to 6 months compared to the most advanced frontier models. So, DeepSeek is not pretending [music] it destroys everything everywhere. They are basically saying in code, agents, math, and STEM, we are very close, sometimes ahead. And in general reasoning, the best closed models still have an edge. And that's reasonable because most AI launches cherrypick the five graphs where they win and pretend the rest does not exist. On code forces, V4 Pro scored 3,26, which places it around 23rd among actual human contest participants. On Apex Short list, a difficult math and STEM benchmark, [music] it hit 90.2%. Beating Opus 4.6 at 85.9% and GPT 5.4 at 78.1%. On S.Verified, which tests real GitHub issue resolution, it scored 80.6% matching Claude Opus 4.6, but it still trails in some areas. On MLU Pro, Gemini 3.1 Pro scored 91.0% 0% while V4 Pro scored 87.5%. On GPQA Diamond, Gemini scored 94.3 while V4 Pro scored 90.1. On humanity's last exam, Gemini 3.1 Pro reached 44.4% while V4 Pro scored 37.7%. So, the real story is not Deep Seek beats every model. The real story is that an openweight model is now competing near the top while costing dramatically less. The coding and agent side may be the strongest part of the release. Deepseek says V4 has become the main agentic coding model used internally by its own employees. In an internal survey of 85 experienced developers, more than 90% included V4 Pro among their top choices for coding tasks. Another internal result said 52% considered it ready to become their default coding agent. 39% leaned yes and fewer than 9% said no. Deepseek also says V4 Pro works well with agent frameworks like Claude Code, Open Code, Open Claw and Code Buddy. Nvidia also mentioned agentic workflows like Nemoclaw, AIQ Blueprint and Data Explorer agent where DeepSseek V4 can be used as the LLM for longunning assistance, deep research systems, data analysis agents and code generation workflows. One technical feature behind this is called interled thinking. In older agent workflows, when the model made a tool call, searched something, ran code, then came back, parts of the reasoning state could get lost between steps. The model had to rebuild context again and again. V4 is designed to retain reasoning across tool calls, which matters a lot for 20step agent workflows where one mistake halfway through can ruin everything. And that brings us to the biggest technical part of V4, the new attention system. Long context is expensive because standard attention scales badly. When context gets longer, the model has to compare more and more pieces of text against each other. Double the context and the compute can grow roughly four [music] times. That is why many models advertise huge context windows but then throttle them, slow down or become expensive when people actually use them. Deepseek's answer is a hybrid attention architecture built around compressed sparse attention and heavily compressed attention. CSA compresses [music] groups of tokens, for example, every four tokens into smaller information blocks. Then it uses sparse retrieval to focus only on the most relevant content instead of paying attention to everything equally. HCA is more aggressive. It compresses larger groups around 128 tokens into a single entry, giving the model a cheaper global view of the whole context. So V4 gets both detail and overview. It can keep nearby text more complete while compressing older or less important context. That is [music] how Deepseek is trying to make 1 million token inference actually practical. The efficiency numbers are pretty wild. At 1 million tokens, V4 Pro uses only 27% of the single token inference [music] compute required by V3.2 and its KV cache memory burden drops to 10%. V4 Flash goes even further using just 10% of the compute and 7% of the memory compared to V3.2. Nvidia described this as a 73% reduction in per token inference flops and a 90% reduction in KV cache memory burden for the pro model. That is the core reason the pricing can be so aggressive. It is not only a marketing trick. The architecture is designed to make long context inference cheaper. Deepseek also introduced other engineering changes including MHC manifold constrained hyperconnection which upgrades traditional residual connections to keep signal propagation more stable and the Muon optimizer replacing Atom W for large-scale and low precision training. Deepseek says full engineering optimization can deliver almost two times inference acceleration. The hardware story is just as important. Deepseek v4 is being positioned as a model that works across both Nvidia infrastructure and Chinese domestic chips. On one side, Nvidia published launch day support for Deepseek 54 on Blackwell, including GPU accelerated endpoints on build.invidia.com, NIM deployment, VLLM recipes, and SG lang serving recipes for Blackwell and Hopper systems. Nvidia said Deepseek V4 Pro on GB200 NVL72 showed over 150 tokens per second per user in early out-of-the-box tests. They also tested Blackwell B300 with VLLM's Day Zero recipe using the model's native MXFP4 format. Nvidia's point is clear. Even if DeepSeek is part of China's AI rise, Nvidia still wants developers running it on Blackwell, Hopper, NIM, VLM, SG Lang, and the whole CUDA ecosystem. But on the other side, V4 is also a major step toward China's domestic AI stack. Deepseek verified fine-grained expert parallel optimization on Huawei Ascend NPU platforms with acceleration ratios between 1 times and 1.73 times in general inference workloads. Huawei also said its Ascend super node products based on the Ascend 950 series would support DeepSeek V4. This matters because the US has restricted high-end Nvidia chip exports to China since 2022. The goal was to slow Chinese AI progress. But Deepseek is showing a different outcome. The restrictions push Chinese labs to optimize harder, rely more on domestic chips, and build models that are cheaper to run. That does not mean China has fully replaced Nvidia. MIT technology review noted that deepseek appears to use Chinese chips mainly for inference while parts of training may still rely heavily on Nvidia. Singua professor Liu Xi Yuan said the technical report suggests only part of the training process was adapted for Chinese chips and it is unclear whether some long context features were fully adapted. Multiple sources also said Chinese chips are still weaker than Nvidia chips for training though better suited for inference. So this is not a clean break from Nvidia. It is more like the first serious proof that China can start building a parallel AI infrastructure. Deepseek even ties future V4 pricing to that hardware shift. The company says V4 Pro throughput is currently limited because of high-end compute constraints, but prices could fall significantly after Huawei Ascend 950 super nodes begin shipping at scale in the second half of 2026. That is a big statement. V4 Pro is already cheap and Deepseek is basically saying it may become even cheaper once domestic hardware capacity expands. There is also a market psychology angle here. Deepseek's R1 release in January 2025 shocked the industry so hard that Nvidia reportedly lost around $600 billion in market value in one day. V4 probably will not create the same kind of instant panic because the market is more prepared now, but it may matter more for actual builders. For enterprise users, V4 Pro changes the economics of large-scale AI workflows. Legal review, financial research, codebase analysis, document processing, support automation, and internal agents all become cheaper when you can feed 1 million tokens at a time and pay $1.74 input and 3.48 48 output. For solo developers and smaller teams, V4 Flash may be the more interesting one. At 0.14 input and 0.28 output, it becomes extremely cheap to build chat summarization, routing, coding helpers, and lightweight agents. And because the models are MIT licensed and available on HuggingFace, companies can download, modify, and self-host them. That openweight part is crucial. You are not only renting the model through an API, you can build around it, customize it, optimize it, and deploy it on your own infrastructure if you have the hardware. There are limitations though. The models are texton right now. So, OpenAI, Google, Xiaomi, and others still have an edge in multimodal systems. Xiaomi just launched Myo V 2.5 Pro with text, image, audio, and video support. OpenAI and Google are also pushing hard on multimodal agents. Deepseek says multimodal capabilities are coming, but for now V4 is mainly a text, code, reasoning, long context, and a gentic model. There is also some early disagreement about real world experience. Many users on X called it a market shattering release because of the price performance ratio. Some claimed V4 Flash feels close to GPT 5.4 level capability at a tiny fraction of the cost. Others were less impressed, saying V4 Flash did not feel clearly better than the already mature V3.2 in daily use. That difference makes sense. Benchmarks often show what a model can do under ideal conditions. Realworld usage shows how it behaves across messy prompts, vague instructions, long conversations, and personal workflows. V4 may be excellent for code and agents while still feeling uneven in some everyday chat situations. Deepseek is also retiring the old DeepSeek Chat and Deepseek Reasoner endpoints on July 24, 2026. For now, those endpoints already route to V4 Flash in non-thinking and thinking modes. So, API users may already be interacting with the new system without treating it as a separate model. The bigger takeaway is simple. V4 is a pricing attack, an open-source attack, a long context engineering attack, and a hardware strategy move at the same time. Open AI still has stronger frontier performance in several areas. Gemini still leads on some reasoning and expert knowledge benchmarks. Claude still has advantages in certain long context retrieval and premium coding workflows, [music] but DeepSeek is making the gap look smaller while making the bill look ridiculous. And that's why this launch is important because once developers can build serious agents with 1 million token context, strong coding ability, open weights, and output pricing under $4 per million tokens for pro, the premium model question changes completely. Also, if you want more content around science, space, and advanced tech, we've launched a separate channel for that. Links in the description. Go check it out. So yeah, this release may not create the same shock wave as R1, but for developers, startups, and enterprise AI teams, it may be one of the most important model launches of the year. If you found this useful, drop a like. Thanks for watching, and I'll catch you in the next one.

More from AI