
Tech • IA • Crypto
Les modèles V4 de DeepSeek combinent des performances proches de l’état de l’art avec des coûts radicalement plus bas, signalant un changement majeur dans l’économie de l’IA, la compétition en infrastructure et les capacités de long contexte.
DeepSeek V4 Pro et V4 Flash ont été lancés comme un système à deux niveaux visant différents usages. V4 Pro utilise une architecture mixture-of-experts de 1,6 trillion de paramètres avec 49 milliards de paramètres actifs par requête, tandis que V4 Flash est plus petit avec 284 milliards au total et 13 milliards actifs. Les deux sont centrés sur le texte, avec des fenêtres de contexte de 1 million de tokens et jusqu’à 384 000 tokens en sortie, adaptés au raisonnement à grande échelle et aux workflows d’agents.
Le prix est la principale rupture. V4 Flash coûte 0,14 $ par million de tokens en entrée et 0,28 $ en sortie, tandis que V4 Pro coûte 1,74 $ en entrée et 3,48 $ en sortie. Les systèmes comparables sont bien plus chers : GPT‑5.5 serait à 5 $/30 $ et jusqu’à 30 $/180 $, et Claude Opus 4.7 autour de 5 $/25 $. V4 Pro peut ainsi être jusqu’à 98 % moins cher, réduisant fortement le coût de déploiement.
Les premiers tests montrent de bons résultats sans domination. V4 Pro est troisième parmi les modèles ouverts et 14e au total en code, et proche du sommet sur d’autres tests. Il atteint 90,2 % sur Apex (maths), mais reste derrière Gemini 3.1 Pro sur des tests de raisonnement comme GPQA Diamond et Humanity’s Last Exam.
Le code et les agents sont ses points forts. Des tests internes indiquent que plus de 90 % des développeurs classent V4 Pro parmi les meilleurs outils, et plus de la moitié prêts à l’adopter par défaut. Il s’intègre à Claude Code, OpenCode et Code Buddy, et gère des agents multi-étapes pour recherche, analyse et génération logicielle.
Une innovation clé est le raisonnement intercalé, qui conserve l’état entre appels d’outils. Cela réduit la perte de contexte dans les workflows longs, améliorant la fiabilité là où d’autres modèles perdaient le fil.
DeepSeek introduit une attention hybride combinant Compressed Sparse Attention (CSA) et Heavily Compressed Attention (HCA). Ces méthodes compressent les tokens et ciblent le calcul, permettant 1 million de tokens efficacement. V4 Pro réduit le calcul à 27 % et la mémoire à 10 %, encore moins pour Flash.
D’autres innovations incluent des hyperconnexions contraintes par variété pour stabiliser les signaux et l’optimiseur Muon pour l’entraînement massif. Elles apporteraient jusqu’à 2× d’accélération en inférence.
V4 fonctionne sur GPU Nvidia et puces chinoises, notamment Huawei Ascend NPU. Nvidia supporte Blackwell et Hopper, tandis que Huawei annonce jusqu’à 1,73× d’accélération. Cela reflète la compétition sur l’infrastructure IA.
Les restrictions américaines ont poussé vers plus d’efficacité et des alternatives locales. L’entraînement utilise encore Nvidia, mais l’inférence bascule vers des puces locales, suggérant un écosystème IA parallèle.
Les coûts rendent viables des usages à grande échelle : analyse juridique, recherche financière, revue de code, automatisation. Les petites équipes profitent de V4 Flash pour des systèmes de chat, résumé et agents à bas coût.
Sous licence MIT, les modèles peuvent être téléchargés, modifiés et auto-hébergés, offrant plus de contrôle que les API fermées.
V4 reste texte uniquement, laissant les concurrents devant en multimodal. Il accuse aussi un retard de 3 à 6 mois sur certains benchmarks de raisonnement.
Les retours varient : certains voient des performances proches du haut de gamme à moindre coût, d’autres notent peu de différence au quotidien.
Plutôt que surpasser tous ses rivaux, V4 redéfinit les attentes en coût et accessibilité, combinant performance, efficacité et ouverture pour challenger les modèles premium.
Open AAI just dropped GPT 5.5 and only a few hours later, Deepseek showed [music] up with V4 and they actually released two models with a 1 million token context window, MIT license, extremely low pricing, strong coding performance and support for both Western GPU stacks and China's [music] domestic chip ecosystem. So this launch is about benchmarks, but also about cost, [music] infrastructure, long context agents, and the bigger fight over who controls the AI stack. The new family has two versions, DeepSeek V4 Pro and Deepseek V4 Flash. V4 Pro is the big one. It has 1.6 trillion total parameters with 49 billion active parameters per inference pass. So the full model is massive, but it does not wake up the whole thing every time you ask it something. Obviously, it uses a mixture of expert setup where only the relevant parts activate for [music] each task. V4 flash is the smaller and faster version with 284 billion total parameters and 13 billion active [music] parameters. Both are text only for now. Both support 1 million tokens of context [music] and both can produce up to 384,000 output tokens through DeepSseek's API docs. And the pricing is where Deepseek is trying to punch the market in the face. V4 Flash costs $0.14 per million [music] input tokens and 0.28 per million output tokens. V4 Pro costs $1.74 input and $3.48 output. For comparison, GPT 5.5 reportedly launched at $5 input and $30 output, with GPT 5.5 Pro going as high as $30 input and $180 output per million tokens. Claude Opus 4.7 is also far more expensive, around $5 input and $25 output. So when people say V4 Pro is 98% [music] cheaper than GPT 5.5 Pro, that is the point. And V4 flash being 0.28 output means it is over 99% cheaper than something like Claude Opus 4.7 output pricing. [music] That is why developers are paying attention. A model does not need to beat every closed source system in [music] every category to change the market. Sometimes it just needs to be good enough, open enough, [music] and cheap enough. Now the early benchmarks are already causing a lot of noise. Arena.ai AI said DeepSseek [music] V4 Pro in thinking mode ranked third among open- source models and [music] 14th overall in its code arena. They described it as a significant jump over DeepSeek V3 3.2. Val's AI went even harder, saying V4 became the number one open-source weighted model in its Vibe Code benchmark, [music] beating Kimmy K 2.6 and even closed source models like Gemini 3.1 Pro. Val also said V4 made about a 10-fold jump over V3.2 on that benchmark. V3.2 only scored five points there and V4 moved far beyond it. In Val's broader index, V4 came second overall, only 0.07% behind Kimmy K 2.6. Now, Deepseek's own wording is more cautious, which is actually interesting. In its own material, the company says V4 Pro has passed mainstream open-source models and is close to closed source systems like Gemini in knowledge and reasoning, but still has a gap of around 3 to 6 months compared to the most advanced frontier models. So, DeepSeek is not pretending [music] it destroys everything everywhere. They are basically saying in code, agents, math, and STEM, we are very close, sometimes ahead. And in general reasoning, the best closed models still have an edge. And that's reasonable because most AI launches cherrypick the five graphs where they win and pretend the rest does not exist. On code forces, V4 Pro scored 3,26, which places it around 23rd among actual human contest participants. On Apex Short list, a difficult math and STEM benchmark, [music] it hit 90.2%. Beating Opus 4.6 at 85.9% and GPT 5.4 at 78.1%. On S.Verified, which tests real GitHub issue resolution, it scored 80.6% matching Claude Opus 4.6, but it still trails in some areas. On MLU Pro, Gemini 3.1 Pro scored 91.0% 0% while V4 Pro scored 87.5%. On GPQA Diamond, Gemini scored 94.3 while V4 Pro scored 90.1. On humanity's last exam, Gemini 3.1 Pro reached 44.4% while V4 Pro scored 37.7%. So, the real story is not Deep Seek beats every model. The real story is that an openweight model is now competing near the top while costing dramatically less. The coding and agent side may be the strongest part of the release. Deepseek says V4 has become the main agentic coding model used internally by its own employees. In an internal survey of 85 experienced developers, more than 90% included V4 Pro among their top choices for coding tasks. Another internal result said 52% considered it ready to become their default coding agent. 39% leaned yes and fewer than 9% said no. Deepseek also says V4 Pro works well with agent frameworks like Claude Code, Open Code, Open Claw and Code Buddy. Nvidia also mentioned agentic workflows like Nemoclaw, AIQ Blueprint and Data Explorer agent where DeepSseek V4 can be used as the LLM for longunning assistance, deep research systems, data analysis agents and code generation workflows. One technical feature behind this is called interled thinking. In older agent workflows, when the model made a tool call, searched something, ran code, then came back, parts of the reasoning state could get lost between steps. The model had to rebuild context again and again. V4 is designed to retain reasoning across tool calls, which matters a lot for 20step agent workflows where one mistake halfway through can ruin everything. And that brings us to the biggest technical part of V4, the new attention system. Long context is expensive because standard attention scales badly. When context gets longer, the model has to compare more and more pieces of text against each other. Double the context and the compute can grow roughly four [music] times. That is why many models advertise huge context windows but then throttle them, slow down or become expensive when people actually use them. Deepseek's answer is a hybrid attention architecture built around compressed sparse attention and heavily compressed attention. CSA compresses [music] groups of tokens, for example, every four tokens into smaller information blocks. Then it uses sparse retrieval to focus only on the most relevant content instead of paying attention to everything equally. HCA is more aggressive. It compresses larger groups around 128 tokens into a single entry, giving the model a cheaper global view of the whole context. So V4 gets both detail and overview. It can keep nearby text more complete while compressing older or less important context. That is [music] how Deepseek is trying to make 1 million token inference actually practical. The efficiency numbers are pretty wild. At 1 million tokens, V4 Pro uses only 27% of the single token inference [music] compute required by V3.2 and its KV cache memory burden drops to 10%. V4 Flash goes even further using just 10% of the compute and 7% of the memory compared to V3.2. Nvidia described this as a 73% reduction in per token inference flops and a 90% reduction in KV cache memory burden for the pro model. That is the core reason the pricing can be so aggressive. It is not only a marketing trick. The architecture is designed to make long context inference cheaper. Deepseek also introduced other engineering changes including MHC manifold constrained hyperconnection which upgrades traditional residual connections to keep signal propagation more stable and the Muon optimizer replacing Atom W for large-scale and low precision training. Deepseek says full engineering optimization can deliver almost two times inference acceleration. The hardware story is just as important. Deepseek v4 is being positioned as a model that works across both Nvidia infrastructure and Chinese domestic chips. On one side, Nvidia published launch day support for Deepseek 54 on Blackwell, including GPU accelerated endpoints on build.invidia.com, NIM deployment, VLLM recipes, and SG lang serving recipes for Blackwell and Hopper systems. Nvidia said Deepseek V4 Pro on GB200 NVL72 showed over 150 tokens per second per user in early out-of-the-box tests. They also tested Blackwell B300 with VLLM's Day Zero recipe using the model's native MXFP4 format. Nvidia's point is clear. Even if DeepSeek is part of China's AI rise, Nvidia still wants developers running it on Blackwell, Hopper, NIM, VLM, SG Lang, and the whole CUDA ecosystem. But on the other side, V4 is also a major step toward China's domestic AI stack. Deepseek verified fine-grained expert parallel optimization on Huawei Ascend NPU platforms with acceleration ratios between 1 times and 1.73 times in general inference workloads. Huawei also said its Ascend super node products based on the Ascend 950 series would support DeepSeek V4. This matters because the US has restricted high-end Nvidia chip exports to China since 2022. The goal was to slow Chinese AI progress. But Deepseek is showing a different outcome. The restrictions push Chinese labs to optimize harder, rely more on domestic chips, and build models that are cheaper to run. That does not mean China has fully replaced Nvidia. MIT technology review noted that deepseek appears to use Chinese chips mainly for inference while parts of training may still rely heavily on Nvidia. Singua professor Liu Xi Yuan said the technical report suggests only part of the training process was adapted for Chinese chips and it is unclear whether some long context features were fully adapted. Multiple sources also said Chinese chips are still weaker than Nvidia chips for training though better suited for inference. So this is not a clean break from Nvidia. It is more like the first serious proof that China can start building a parallel AI infrastructure. Deepseek even ties future V4 pricing to that hardware shift. The company says V4 Pro throughput is currently limited because of high-end compute constraints, but prices could fall significantly after Huawei Ascend 950 super nodes begin shipping at scale in the second half of 2026. That is a big statement. V4 Pro is already cheap and Deepseek is basically saying it may become even cheaper once domestic hardware capacity expands. There is also a market psychology angle here. Deepseek's R1 release in January 2025 shocked the industry so hard that Nvidia reportedly lost around $600 billion in market value in one day. V4 probably will not create the same kind of instant panic because the market is more prepared now, but it may matter more for actual builders. For enterprise users, V4 Pro changes the economics of large-scale AI workflows. Legal review, financial research, codebase analysis, document processing, support automation, and internal agents all become cheaper when you can feed 1 million tokens at a time and pay $1.74 input and 3.48 48 output. For solo developers and smaller teams, V4 Flash may be the more interesting one. At 0.14 input and 0.28 output, it becomes extremely cheap to build chat summarization, routing, coding helpers, and lightweight agents. And because the models are MIT licensed and available on HuggingFace, companies can download, modify, and self-host them. That openweight part is crucial. You are not only renting the model through an API, you can build around it, customize it, optimize it, and deploy it on your own infrastructure if you have the hardware. There are limitations though. The models are texton right now. So, OpenAI, Google, Xiaomi, and others still have an edge in multimodal systems. Xiaomi just launched Myo V 2.5 Pro with text, image, audio, and video support. OpenAI and Google are also pushing hard on multimodal agents. Deepseek says multimodal capabilities are coming, but for now V4 is mainly a text, code, reasoning, long context, and a gentic model. There is also some early disagreement about real world experience. Many users on X called it a market shattering release because of the price performance ratio. Some claimed V4 Flash feels close to GPT 5.4 level capability at a tiny fraction of the cost. Others were less impressed, saying V4 Flash did not feel clearly better than the already mature V3.2 in daily use. That difference makes sense. Benchmarks often show what a model can do under ideal conditions. Realworld usage shows how it behaves across messy prompts, vague instructions, long conversations, and personal workflows. V4 may be excellent for code and agents while still feeling uneven in some everyday chat situations. Deepseek is also retiring the old DeepSeek Chat and Deepseek Reasoner endpoints on July 24, 2026. For now, those endpoints already route to V4 Flash in non-thinking and thinking modes. So, API users may already be interacting with the new system without treating it as a separate model. The bigger takeaway is simple. V4 is a pricing attack, an open-source attack, a long context engineering attack, and a hardware strategy move at the same time. Open AI still has stronger frontier performance in several areas. Gemini still leads on some reasoning and expert knowledge benchmarks. Claude still has advantages in certain long context retrieval and premium coding workflows, [music] but DeepSeek is making the gap look smaller while making the bill look ridiculous. And that's why this launch is important because once developers can build serious agents with 1 million token context, strong coding ability, open weights, and output pricing under $4 per million tokens for pro, the premium model question changes completely. Also, if you want more content around science, space, and advanced tech, we've launched a separate channel for that. Links in the description. Go check it out. So yeah, this release may not create the same shock wave as R1, but for developers, startups, and enterprise AI teams, it may be one of the most important model launches of the year. If you found this useful, drop a like. Thanks for watching, and I'll catch you in the next one.