
Tech • IA • Crypto
A multi-model “Fusion” approach is emerging as a lower-cost alternative to top-tier AI systems, matching near-frontier performance on research tasks while highlighting trade-offs in long, complex workflows.
A new approach to AI, known as Fusion, combines multiple models instead of relying on a single system. Prompts are sent to several models simultaneously, each producing independent answers using tools like web search and code execution. A separate “judge” model then synthesizes these outputs into a single response, identifying agreement, contradictions, and missing insights.
On the Draco benchmark, which evaluates 100 research tasks across domains like law, finance, and medicine, individual models showed strong but limited performance. Claude Fable 5 scored 65.3%, GPT 5.5 scored 60%, and DeepSeek V4 Pro reached 60.3%. Fusion combinations outperformed all single models, with a top configuration scoring 69%.
Even using the same model twice improved outcomes. Running Opus 4.8 in parallel and synthesizing the results increased its score from 58.8% to 65.5%, suggesting that performance gains come not only from model diversity but also from comparing multiple reasoning paths and consolidating stronger elements.
A “budget” Fusion panel combining Gemini 3 Flash, Kimmy K 2.6, and DeepSeek V4 Pro, with Opus 4.8 as the synthesizer, achieved 64.7%, just 0.6 points below Fable 5. This led to claims of “Fable-level intelligence at half the price,” especially relevant for production environments handling large volumes of AI queries.
Fusion setups are estimated at $1.50–$3 per million input tokens and $4–$6 per million output tokens, compared to $3–$6 input and $9–$15 output for Fable 5. For workloads generating 10 million output tokens daily, monthly costs could drop from roughly $90,000–$150,000 to $40,000–$60,000.
Fusion reduces a key weakness of single-model systems: hidden blind spots. By forcing multiple models to independently analyze a question, the system surfaces disagreements and overlooked assumptions. The final synthesis stage integrates these perspectives into a more balanced and complete answer.
All models in testing used identical tools, including web search and fetch via Exa and bash execution, ensuring that improvements came from orchestration rather than better tooling. Responses were evaluated using a weighted rubric emphasizing factual accuracy, reasoning depth, and citation quality.
The Draco benchmark is limited to text-based, English-only tasks and does not cover long, multi-step workflows. Some evaluation inconsistencies remain, including partial task completion for Fable 5 and differences in judging models. A contamination issue involving access to grading criteria was later corrected by blocking specific domains.
Fusion struggles with long-horizon workflows, where tasks require many dependent steps and consistent memory. Sequential processes—such as coding large systems or managing extended plans—benefit from a single model maintaining coherence. Fusion’s parallel structure can fragment state and reduce reliability in these scenarios.
Fusion is best suited for research-heavy, variable, and high-volume workloads, where cost efficiency and diverse perspectives matter. Single-model systems like Fable 5 remain preferable for compliance-sensitive, reproducible, or deeply sequential tasks requiring stable context and consistent behavior.
Fusion systems demonstrate that combining multiple AI models can rival top-tier performance at significantly lower cost, but they remain complementary rather than a full replacement for single-model systems in complex, long-duration tasks.