
Tech • IA • Crypto
Base44 scaled from a solo founder to an 80-engineer organization by using simple AI-driven workflows to automate onboarding, code review, testing, and product validation while maintaining rapid development speed.
Base44, a “vibe coding” platform designed to let both technical and non-technical users build software, was launched in late 2024 and achieved profitability by April 2025. Strong traction from building in public attracted acquisition interest, leading to integration with Wix, which shared a similar user base and growth ambitions.
After acquisition, the team expanded ઝડપ from 2 to 15 engineers, creating bottlenecks in onboarding, code review, and product validation. The company prioritized speed, aiming to preserve the founder’s development velocity while distributing responsibilities across a growing team.
Instead of maintaining static onboarding documents, new engineers used simple AI prompts to analyze commit history and generate real-time system overviews. Additional prompts created architectural diagrams on demand, eliminating the need for manual documentation updates and enabling immediate productivity.
The founder initially reviewed every pull request, creating a scalability issue. The team extracted patterns from historical PR feedback and used AI to replicate review standards, effectively multiplying review capacity without introducing complex governance processes.
These lightweight systems led to major efficiency gains. A complex WhatsApp integration, expected to take one to two weeks, was completed by a newly onboarded engineer in under three days, including onboarding, development, and review, before being shipped to production.
Instead of building a traditional evaluation suite, the company analyzed live user interactions. AI classified chat messages by “frustration level,” allowing teams to detect failures in real time. New features were gradually rolled out and evaluated based on whether they increased or reduced user frustration.
The organization later scaled from 40 to nearly 80 engineers overnight through hiring and internal transfers. This growth required more structured experimentation, evaluation systems, and quality assurance processes without slowing down development.
A system was built to analyze past experiments and generate guidelines for new feature releases. Each pull request received automated recommendations on whether to ship directly, roll out gradually, or run an A/B test, including suggested duration and key performance indicators.
The company connected tools like BigQuery, GitHub, and experimentation platforms into a unified dashboard built with its own product. This allowed teams to track metrics such as conversion rates, AI costs, and feature adoption in real time.
To validate AI-generated applications, Base44 built a simulation system that mimics real user behavior. The system tests app functionality, iterates on failures, and measures metrics like latency, cost, and interaction steps, forming a continuous integration pipeline for AI features.
Quality assurance was scaled באמצעות AI agents trained on common workflows. These agents could navigate the product, set up test conditions via internal tools, and execute complex test scenarios. This approach handled roughly 80% of QA needs, reducing reliance on manual testers.
Across all stages, the company emphasized minimal, high-impact solutions rather than heavy processes. Complex systems like evaluation frameworks were delayed until necessary, while simpler AI-driven methods delivered immediate value during earlier growth phases.
Base44’s growth demonstrates how AI can replace traditional engineering processes with lightweight, adaptive systems, enabling rapid scaling without sacrificing speed or product quality.
Hello everyone. My name is Yav. I lead product at B 44. And going to join me on stage later on is Gabrielle who leads our AI. And we're going to talk about how B 44 scale from a solar founder engineer all the way up to 80 engineer and how cloud code help us facilitate that growth while maintaining our velocity. We split this talk into a short intro and then two phases going from one engineer to 15 engineer and then going from 50 engineers to 80 engineers. So let's talk a little bit about the first phase which is mostly an intro to base 44 and our solo founder. So B 44 is a vibe coding platform but this is a new term a year ago it was Mar thinking I want to build a platform that will let anyone build software non-technical user technical user let's build up the speed. He started the the platform at the end of 2024 and by 2025 you already had a working product started building in product in in sorry building in public on LinkedIn and Twitter gain a lot a lot of traction and by April 2025 the product was already profitable that's the moment I joined because money was starting to flowing in and getting a lot of traction and because this was a profitable product AI focused user base and a crazy founder started getting the focus of a lot of uh companies and acquisition opportunities which leaves us in the next phase which is our post acquisition. So Wix has very similar user base as base and so they saw base 44 as a big bet and they wanted to maintain the velocity of B 44 but it's expanded dramatically. So we basically went from a two member team into a 15 engineer team and we needed to scale and we needed to scale fast as possible and we had four major challenges. One is onboarding doesn't scale. We can't have Mo onboard each engineer to the team. Code review doesn't scale. Mo was really really cautious about what goes inside the back end of base 44. So he wanted to review each PR on his own. We can't have each engineer sit with our beta tester to understand whether the product is working as expected. So we need to find a way to automate that as well. And an interesting part about the fact that you have like very um immediate product market fit is there's a lot of product surface you need to cover. Whether it's integration, whether it's the identic flow, whether it's the visual editor, there's so many areas and you need the engineer to ramp up really really quickly. So let's jump in. How do we solve each one of the challenge? And the key takeaway I want everyone to get come out of here especially for those with small teams is the fact that you need to keep everything very very simple. Okay. The meetings when we try to tackle those challenges would start with hey let's build this process where we review everything and then build an onboarding dock and we'll do like a nightly that that uh update that. We're thinking actually no let's keep it very simple. Every new engineer that comes into the company will give him a task to basically use two prompts before he starts working on his task. One go over all the commits and tell me what everyone is what everyone cares about. So after we were like three four engineers and people started like building their knowledge in each area like the fifth and sixth engineer came wrote this prompt and they already get like this map of the organization and you don't need to kind of like think about how do I keep like these onboarding docs updated as new engineer come up no a simple prompt gimps you in real time the entire map organization the second thing is before you dive into each area is basically ask claude hey can you give me a mermaid chart of how this component works. And again, this works in real time because because everything keeps evolving. You don't want to kind of like try, hey, I need to keep this document up to date. I need to keep this document up to date. No, claude keeps it for you. Very, very simple. One prompt gives you everything an engineer needs to know in order to start working inside of B 44. The second thing is as I mentioned Maul was very very cautious about what code goes inside our agent and what code goes inside the back end of base 44. So we needed a way to amplify MA's PR abilities. So after about one or two weeks we already have a big pool of PR comments M add inside our repo. So again, instead of kind of like sitting down and thinking of brainstorming, okay, what's the instruction that we need, let's have Claude review the Pas say, what's the most important things and what's the most crucial things we need to keep in mind while engineers are writing their uh new code and we put it in destruction, run it every couple of days and have more PR reviewer inside of B 44 without having us to build a sophisticated and complicated process. The cool thing about it is when we really started to see kind of like velocity picking up. Okay, so one of the uh PR that we kind of like remember and we keep referring to is we wanted to do a WhatsApp integration inside B 44 to kind of like communicate with the aging using WhatsApp and we handed it over to a new engineer. We assumed a new engineer working on this kind of feature. It requires an integration. It requires working on the agentic flow. It requires a new meta API. We assumed it's going to be going to be a one to two weeks uh endeavor. And it was really really awesome to see that we gave that Thursday night, Sunday morning, everything was ready. He onboarded on Thursday with uh using those simple prompts. He sent it over to PR. The PR model review had kind of like two three small comments and we were ready to move on to production. Okay, so we we managed to resolve most of the issues. Now we have the issue of how do we make sure that what goes into production especially our agent works really really good for our customers. Previously when we were a tiny team, we would just sit with customers and hear like how they interact with base 44. But now we need to find a way to scale. And like almost every naive AI company out there, we will say, "Hey, let's build an evil suite. We'll make sure that everything that comes out, we'll run it through our evils. It will work perfectly and we'll understand what's going on." And I don't know if you tried to build evil um mechanism before but usually 15 people team is not ready for it. It's a much bigger endeavor. So we sat down and we said okay we already have a tremendous amount of traffic in production. How do we use that traffic in order to understand whether the model is working for our customers or not? We have conversion rate which is nice but we want to understand whether the agent itself especially for paying customer is working as expected. So we started looking at the conversation and a very simple pattern emerged and that if you look at the conversation when everything working well well the user doesn't say anything. It just goes to the next feature to the next feature to the next features. But when things start to break, that's when users get really really loud inside the chat and say, "Hey, why is this broken? I can't believe it's not working." It's really really easy to see and manifest the fact that things are broken. So we said, "Okay, we have a very strong signal signal when things aren't working. Why don't we use that and leverage that and ask Claude using a simple model using an IQ model to classify each message on whether it's the frustration level of the user is high or low. Once we have that then every single version of the agent that we want to that we want to release. We basically put a small percentage of the customers on that uh version and we can track the the frustration level. And this works whether we're changing the infrastructure, we're changing the prompt or we're changing the model. And we can understand whether this works as well as expected for after the change for our users. And the key takeaway again is just keeping everything super simple without building a sophisticated process around it. uh like we hear a lot about like let's build an agent for this and and agent orchestration but when you're a small team you have very simple way of getting the almost the same amount of value while keeping processes really really really lean but when you scale from 15 to 18 it becomes a little bit of a different challenge and that's when Gabrielle is going to walk you through thank you very Hello everyone. My name is Gabriel and I lead the app builder agent for base 44. I had a lot of time to watch you have behind the scenes so I got a little bit nervous. So, so you I've just told you about the first two phases of our growth and last couple of months we reached a new point of growth like we started hiring more externally. We had more internal movers moving from weeks to base 44. And then we even merged a different product working on vibe coding and in one single night we doubled our ad count from 40 people to almost 80. And that brought a new set of challenges that we had to solve. So we had many new challenges. I'd like to focus on the three most interesting ones like the first one is how do we do experimentation at scale. Now you have just shared how we did the the frustration metric and how we AB tested in in production but you can't expect any new hire to understand exactly which KPIs to test how long do we want to test things whether you can just be brave and and ship it and like not everything needs an experiment right so we knew we wanted to shift left product management decisions in AB testing so we also uh needed better evolves now Again back to what Yav just said, we had we we were before in a point where we knew that evils is not the best uh ROI for us but now it became something we really need to focus on. And the last thing is how do you do QA QA properly in a company that's very consumer oriented without growing your uh testers in a linear way with the other headcount. Let's start with experimentations. Okay. So we had we started with a general shell of what we wanted to have like we knew we wanted a process that runs when a pull request is ready. We knew that eventually we want like a a bot commenting on GitHub saying like for a developer whether she could or not just ship it. If she needs an AB test, how long should it run? Which KPIs does the experiment need to monitor? And we also wanted it to post to open the experiment on postto that was like the shell was the easy part but we also needed the guidelines the actual logic of how do how do we work like how do we how do we operate we never sat and and articulated that we didn't have a guideline committee we just like had really good product sense and intuitions so we had one option like get a multistakeover committee and like enter a lot of meetings but we really hate meetings so We figured out that our past actions they could convey our guidelines in the best way possible. So we thought like wait we can just take like the 100 last experiments we had on posto the matching pool request and distill our guidelines from that. So we spawn cloud code hooked it up to the posto mcp. Posto is an AB testing experimentation. Pretty great product by the way. Uh and and had Claude u um suggest the first iteration of the guidelines and it was it did a great job. It wasn't perfect, very rough on the edges, but we had like a working document. we can just iterate and a couple of hours later we had like something working like uh uh each pull request opened has like a clear verdict whether you can just ship it gradual roll it a gradual uh roll out it or do an AB test and how long some features deserve seven days of of testing for our scale some need to have a full a full month because you might uh you might affect uh uh conversion rate and premium rates in very little percentages. And to wrap all of that up, we needed a central place that everyone could just see what's going on. So it was a great opportunity for us to dog food our own product base 44 connect it to BigQuery our data warehouse to posttog to GitHub to everything and have a central place where you can everyone could see which experiments are running uh how they're how are they u uh how they're moving the needle if something's causing more AI cost if something's reducing like rate of published apps like all the things we cared about and this for now kind solved us the problem and allows us to open up a new paradigm in how we uh scale our experiments. Okay, so the next part is evolves like this could be like a easily a full onehour talk and maybe next year depending team here will even do that but our challenge was very short term like we needed something to give us real value. We didn't want it to be like a three months project and we didn't we couldn't afford uh um taking their top AI engineers. We need them to work on features and improve the product. So we asked ourselves what do we really need to be build? Do we want to just evaluate the output of the model or do we want to check correctness of the apps that our users are building? And eventually we had to build a user simulator. Now for base 44 when a user types in like a request they want like an app and some small part of it won't work that doesn't mean that the evil uh fails. That was a great epiphany moment for us. It means that our evil suite needs to pipe back the rejection and and ask the our our agent to to to fix the the the missing parts. And then we ended up looking at uh latency, how many turns things took, how much every uh uh how much it cost to us, how many credits we took for our users, and we got into like a a working CI/CD pipeline where any change in our AI code spins up real a real base 44 app instance and we use stage hand to simulate us real user actions like if like there's like a automated QA engineer spun up in a small box. That's how we look at it. And this is how the internal app we built to support that looks like. Again, a great opportunity for us to dog food our own platform. You can see here the example of like the the most canonical evil we have is like the hello world app like uh it it doesn't mean that the app is doing like that B 44 is is performing the way we want. It's like a smoke test. It's just making sure we didn't break anything. So the way we'll do that, we'll ask B 44 to build us a simple hello world app. Assert that the right text is visible and there's like it looks good and it's very subject subjective but we trust AI on failing. If not, then we ask for a very small change, uh, text change, and then we ask for a small feature. And as you can see, most of them just pass. And fun fact, these eval will pass on the smallest model you can think of, which is really cool. And of course, we have many more complex evals. For example, we have scenarios where we start with an existing app and do many changes. we have scenarios where we get to to check our compaction mechanism which is very complex and requires a lot of user messages. So this is kind of brought us to a new paradigm in in in evil. It's not perfect yet. We're constantly working on it but it was like the right time the right moment uh to to build such a system. And the third thing I want to share is how did we uh streamlined QA? So we do believe in shifting left quality of course like all of us like unit testing end to end tests it's obvious everyone uh working at base 44 needs to have complete ownership of what they build but most of the times you're working on really deep features that have a lot of edge cases. For example, imagine testing a feature that only affects users at a specific sub subscription tier when they reach a specific point of their credit limits. Like, and imagine your feature has a lot of permutations that affect that. Like, it will be very tedious for everyone to test it manually. And so, that's a classic case where we would hand off to a QA engineer, but then we'd have longer feedback loops and you have to wait for someone to be available. And and and and that was wasn't ideal. We knew that cloud code need what could operate a browser right playright MCP browser use like there's ton of tools out there but it was missing critical pieces of how to do it well for example each time it had it had to relearn the platform the selectors the flows um each time uh it had to uh um understand which events to look for in in in our database and then mix panel so we started wrapping um are are common uh flows in skills. For example, we have one skill that taught uh cloud code combined with the browser how to uh go over all the major user flows that most of features will touch and of course for new features like cloud code can just understand what the feature does and how to get along. So we don't need to cover 100% in the skill. You just need to maximize the 80% so you have enough context. It's like a a thin trade-off between like the right abstraction level and what do you tr just trust close code. The second challenge we had is like how do you do a proper setup testing like for for your test. Let's take the let's take the the example from before like when you want to test a very specific edge case. Now you could just click and and do it very manually like like like a QA a manual person a human person like could just do the clicks but that would be very very slow right so what uh a good engineer will do a QA engineer will do is just go to the database and override the the the the setup so that they can just test that that case we needed cloud to be able to do the same thing so we created CLI tools that abstract our APIs and database cases specifically for the use case of setting up tests. And we uh built skills that taught uh Claude how to use that those properly. And eventually we combined all of these uh efforts and skills into one like meta skill of how to do proper QA and we got into a flow where a PR pull request opens the agent triggers it creates a test plan also great opportunity to dog for their own product sends it to an base 44 app starts testing and reports back and this is like how it looks like for a single test like you get screenshots you can know what what it tests, what it didn't test. And I sometimes I will get cases where like I know it it it's I'm stretching the boundary of what it can do and then it was just like right like I couldn't test that and like surface the the missing capabilities but that works for 80% of the time and it allowed us to shift left deep and edge case quality assurance and move faster. Okay, so that's all for the challenges and I'd love to share a little bit about the the the the common thread around all of all of our like challenges and solutions. Like just as you have said before, we really value simplicity. like we really think about like we we we try to think about like bold and and and simple and sometimes like we we'll take like we we'll work very hard not to to to build complex things when they're not when it's not the right time. Evils is a great example. We hold it off until it was the right moment to build it and then we went all in. The second thing is that taste is a big word, right? like recently like everyone's talking about taste and like it's the last mode of us humans against machines. So I'm I I believe in that too. But I do think that you can encode a big chunk of your team's company's taste by looking at your past actions like just and and that kind of pipes back to to the the memory talk from the last session where like you can just look at what you actually did like in the last uh week or so and understand what your guidelines are like for code reviews for for AB testing you name it. The third thing is like if you're lucky enough to work on a product that can that you yourself can use uh that's also like a huge win like I think the uh the team at Entropic constantly speaks about how magical it is to be working on cloud codework and all the product suite and how how you get the feedback and insight loop going like in a magical way. So if you can do it like sometimes you have to stretch a little bit if you're working like on like I don't know an finance app but find ways to do it. it will be of value. And the last thing is that the bottleneck will keep moving. Like for example, for now, our current challenge is like first of all, how to continue and scale all of the processes I just shown, but also how do we do post validation correctly? Like once a pull request reaches uh production, how do you make sure it's moving the right needle? For example, is a bug really reducing a support tickets? You don't want a human to keep it on his head like is a feature really being used by users? Is it of course you want it to raise business metrics but not everything will will show that fast. So we we want to automate that. And that's it for today. Uh we really appreciate you coming and I really hope you found at least one thing you can take back to your company organization. Thank you.