
Tech • IA • Crypto
Replit unveiled ByBench, an end-to-end benchmark for AI “vibe coding,” alongside a continuous evaluation system that uses real user data and automated testing to improve coding agents daily.
Replit is moving beyond one-off benchmark scores toward a continuous evaluation loop driven by production data. With models, prompts, and tools changing rapidly, static metrics are insufficient to track real-world performance. The new approach combines offline benchmarks with live system feedback to guide daily improvements.
The platform targets users who provide only natural language prompts and expect fully working applications. Unlike traditional coding benchmarks, there are no predefined tests, frameworks, or partial codebases. This requires evaluating whether an application actually works as intended, not just whether code patches pass tests.
ByBench is a new open-source benchmark designed to evaluate AI agents building applications from scratch. It uses around 20 real-world product requirement documents (PRDs) and measures functional correctness through automated evaluation. The benchmark supports multiple scenarios, including building from zero, extending existing apps, and modifying agent-generated code.
Instead of human scoring, ByBench uses AI evaluators that read code, launch the application in a browser, and execute natural-language testing plans. These evaluators simulate user actions such as logging in or toggling features, producing scores based on task completion rather than static test suites.
Established benchmarks like HumanEval and SWE-bench rely on predefined tests and patch-based workflows. These fail to capture real-world usage where applications are built from scratch. Replit identifies a “functional correctness gap” and positions ByBench as a more realistic measure of agent performance.
Early results show roughly a 2× performance gap between frontier proprietary models and open-weight alternatives. The most difficult scenario is extending previously generated code, where compounding errors degrade performance significantly.
Replit processes millions of daily sessions, extracting insights from real usage. Metrics include execution time, cost, user sentiment, and whether users publish their apps. A/B testing is used extensively, though results often involve trade-offs rather than clear winners.
The system clusters execution traces to detect recurring failure patterns, including rare issues that affect as little as 1% of cases. By embedding and grouping semantically similar failures, engineers can identify and prioritize fixes that would otherwise be invisible in logs.
An internal system called Telescope automates the loop: detect issues, generate code fixes via agents, validate with ByBench, and deploy or A/B test changes. Many fixes are generated automatically, though human oversight remains critical for prioritization and decision-making.
Despite automation, product decisions rely on human “taste,” especially when metrics conflict. Engineers must balance cost, speed, and user satisfaction, shaping the overall experience rather than optimizing a single metric.
Replit’s approach reframes evaluation as a continuous, data-driven system, with ByBench and production feedback forming a loop that enables rapid, incremental improvement of AI coding agents.