ENFR
8news

Tech • IA • Crypto

TodayBriefingVideosTop 24hCryptoArchivesFavoritesTopics

Evaluating and Improving Replit Agent at Scale

8/10
AnthropicClaudeMay 8, 2026 at 06:50 PM27:23
Audio player
0:00 / 0:00

TL;DR

Replit unveiled ByBench, an end-to-end benchmark for AI “vibe coding,” alongside a continuous evaluation system that uses real user data and automated testing to improve coding agents daily.

KEY POINTS

Shift from static to continuous evaluation

Replit is moving beyond one-off benchmark scores toward a continuous evaluation loop driven by production data. With models, prompts, and tools changing rapidly, static metrics are insufficient to track real-world performance. The new approach combines offline benchmarks with live system feedback to guide daily improvements.

Unique demands of “vibe coding”

The platform targets users who provide only natural language prompts and expect fully working applications. Unlike traditional coding benchmarks, there are no predefined tests, frameworks, or partial codebases. This requires evaluating whether an application actually works as intended, not just whether code patches pass tests.

Introduction of ByBench

ByBench is a new open-source benchmark designed to evaluate AI agents building applications from scratch. It uses around 20 real-world product requirement documents (PRDs) and measures functional correctness through automated evaluation. The benchmark supports multiple scenarios, including building from zero, extending existing apps, and modifying agent-generated code.

Automated evaluation via AI agents

Instead of human scoring, ByBench uses AI evaluators that read code, launch the application in a browser, and execute natural-language testing plans. These evaluators simulate user actions such as logging in or toggling features, producing scores based on task completion rather than static test suites.

Gap with traditional benchmarks

Established benchmarks like HumanEval and SWE-bench rely on predefined tests and patch-based workflows. These fail to capture real-world usage where applications are built from scratch. Replit identifies a “functional correctness gap” and positions ByBench as a more realistic measure of agent performance.

Performance insights across models

Early results show roughly a 2× performance gap between frontier proprietary models and open-weight alternatives. The most difficult scenario is extending previously generated code, where compounding errors degrade performance significantly.

Online evaluation through user data

Replit processes millions of daily sessions, extracting insights from real usage. Metrics include execution time, cost, user sentiment, and whether users publish their apps. A/B testing is used extensively, though results often involve trade-offs rather than clear winners.

Trace clustering to identify failures

The system clusters execution traces to detect recurring failure patterns, including rare issues that affect as little as 1% of cases. By embedding and grouping semantically similar failures, engineers can identify and prioritize fixes that would otherwise be invisible in logs.

“Telescope” system for automated improvement

An internal system called Telescope automates the loop: detect issues, generate code fixes via agents, validate with ByBench, and deploy or A/B test changes. Many fixes are generated automatically, though human oversight remains critical for prioritization and decision-making.

Human judgment still निर्णative

Despite automation, product decisions rely on human “taste,” especially when metrics conflict. Engineers must balance cost, speed, and user satisfaction, shaping the overall experience rather than optimizing a single metric.

CONCLUSION

Replit’s approach reframes evaluation as a continuous, data-driven system, with ByBench and production feedback forming a loop that enables rapid, incremental improvement of AI coding agents.

Full transcript

More from Anthropic