
Tech • IA • Crypto
A growing set of tools allows companies to run multiple AI models locally with private data, reducing hallucinations through structured knowledge bases and cross-model validation.
Businesses are increasingly adopting local AI deployments to build internal “second brain” systems that store and process proprietary data. These setups prioritize data sovereignty by keeping sensitive information on local machines or private infrastructure. The approach contrasts with cloud-based AI tools that may reuse data for training, raising compliance and confidentiality concerns.
The Open WebUI platform enables organizations to replicate ChatGPT-like environments while maintaining control over data. Users can configure system prompts, enable web search, generate images, and manage multiple AI workspaces. The interface supports connections to over 1,000 models, including private and locally hosted options.
Through Nvidia’s AI ecosystem, users can access more than 157 models via API without contributing data for training. Available models include families such as Anthropic Claude (Haiku, Sonnet, Opus), Qwen 2.5, Mistral, and DeepSeek V3 variants. While free tiers may introduce latency, they provide a cost-effective entry point for enterprise experimentation.
The system supports integration with multiple providers, including Nvidia, Mistral, OpenAI, and local runtimes like Ollama. By configuring API endpoints and keys, companies can run several models simultaneously. This flexibility allows teams to compare outputs, switch between providers, or combine paid and free solutions within a single workflow.
Tools such as Ollama enable fully local execution of models like Qwen 120B, Gemma 2, and Nemotron, ensuring that no data leaves the machine. This setup is particularly valuable for regulated industries handling confidential or legal data. Hardware requirements typically include at least 64–128 GB RAM and GPUs with 8 GB VRAM or more.
A key component is the creation of structured knowledge bases using vector databases. These systems rely on Retrieval-Augmented Generation (RAG) to ground AI responses in verified internal data. Documents are segmented into “chunks,” indexed, and retrieved באמצעות similarity search and ranking algorithms like BM25.
Raw document ingestion often leads to poor AI performance due to lost formatting and missing context. Applying OCR extraction and structuring data into formats such as JSON with metadata and tables significantly improves comprehension. Clean, well-segmented datasets directly increase answer accuracy and reliability.
AI systems remain probabilistic and prone to hallucinations. One mitigation strategy involves running two or more models in parallel, comparing their outputs, and merging results. Divergences between models can reveal inaccuracies, prompting further verification queries against the knowledge base.
Differences between reasoning models and faster inference models affect output quality. Fast models like DeepSeek V3 Flash may produce less precise answers, while reasoning models better decompose complex queries. Combining both can balance speed and accuracy in enterprise workflows.
Advanced workflows include prompting models to compare responses, identify contradictions, and validate claims step by step. This layered verification reduces the risk of errors propagating into business decisions, especially in high-stakes domains such as legal or financial analysis.
Local AI systems combining private data, structured knowledge bases, and multi-model validation offer a practical path to improving accuracy while maintaining strict data governance.
In this video, we're going to talk about two very important topics. The first is data governance. What if I taught you how to deploy AI locally to store your data, creating a "second brain" AI system—what's called a second brain—within your computer? And even better, I'm going to show you that it's possible to use interfaces for free from Nvidia. Nvidia has made available AI chips that don't train the models, and you'll have access to more than 157 models, including, as we speak, models like those from Anthropic (Haiku, Opus 4.7, Sonnet 4.6) and many more, such as the Qwen 2.5 or the latest model from Mistral, the DeepSeek V3 Flash and Pro—all for free. So, I'm going to give you all these tips in this video. The main objective is to alert you to one thing. You'll realize: AIs are not machines of absolute truth. This is a tool that can be mind-blowing. At the end of this video, I'm going to show you how you can address this problem—AI hallucinations — by building a knowledge base, a kind of second brain. I'll give you some advice and tips, data governance, and all of that in this video. We're going to implement AI professionally for businesses and entrepreneurs. What's absolutely exceptional about this type of interface is that you 'll be able to create exactly the same thing as with ChatGPT: a workspace where you can use an AI model that you select from a database and customize. You'll set up a system prompt, define model behavior, and use saved prompts. You can enable internet search and image generation. All of this is done from the Open WebUI interface. For all companies concerned about data privacy, and for those who want to use models that don't train their AI with their data, Open WebUI is a major tool. Firstly, because you can use over 1,000 models, a large portion of which can be completely private. I'm going to show you how to integrate different endpoints—what we call API connection points—using interfaces that are either free and don't train models, or paid. In the latter case, you have enterprise SSL connections. You'll be able to choose the model you want to use for your business from among 1,000 models, and specifically, you'll be able to use models locally or running on private processors, for example, using the Ollama interface we saw in a previous video. This will allow you to use the very latest models: DeepSeek V3, DeepSeek V3 Pro, Mistral, and I'll show you how to integrate different API connections into this system. All this to ensure data privacy, work privacy, and the ability to use different AIs simultaneously. To install the Open WebUI interface, go to the address above. Those who have taken the courses will have the links and resources in the course descriptions. You will then open the command line in your Windows terminal. You will use this command line and run it by executing the command ` open-webui serve`. The installation will take approximately fifteen minutes. Pause the video and resume the tutorial immediately afterward. You can also install it using Docker. If you are familiar with Docker, follow the process using the information at this address for deploying Open WebUI. Now we will configure each account to have free AIs that do not train your data for use in your Interface. AI is now an essential driver of productivity and efficiency. Instead of wasting time trial and error, I offer you the chance to go from casual user to professional in less than 15 days. You will be able to build useful, clean, auditable, and enterprise-ready AI systems. Stop playing around with artificial intelligence. Become the person companies call to deploy, structure, and make it profitable. Today, it takes a company more than 7 months to find a candidate who knows how to use artificial intelligence. Be among those who will remain in the workforce, not those who will be replaced by someone who knows how to use AI. The training includes more than 80 hours of regularly updated courses because artificial intelligence is a dynamic field. We are certainly number one in the field of skills updates. You learn at your own pace, and you earn Google and Anthropic certifications at your own pace. We're preparing you to be the best in the market. All the information is in the description. To do this, go to this address. I'll put the exact address again in the course materials so you can install and create an Nvidia account. Here, you'll be able to access 157 Nvidia models. Click on your account, go to the "API Keys" section, and generate an API key. We'll activate this key and use it for integration. So, how does Nvidia's system work? When you click on the interface, you see the code here. This is what we call the endpoint. This is the endpoint structure that we'll retrieve to make our request, and you'll replace the API key located here. So, we'll retrieve this endpoint. We'll go to the Open WebUI interface in the "Connections" section. There's a section called "API". You'll click on "Add a connection," and this is where you'll enter the section's endpoint. Enter your endpoint, your API key, and then test the connection. You 'll receive a message indicating that the connection is established. From this point on, the models will be downloaded. If you want all the models, leave the model IDs blank. But if you only want certain models, you'll retrieve the name used. In this specific case, it will be Qwen 2. It must be entered exactly the same way. Ideally, you would repeat this sequence here if you want to select only the model in question. So, here, you can add a specific model by simply adding its name and saving. When you send the API connection from your interface, all the models will be downloaded to the "Models" section. This is where you'll suddenly discover that 157 templates will appear. Don't worry, just click the "Disable or Enable All" button, and you'll be able to manually select the one you want. Among the templates, you saw that I have Anthropic Haiku; there are 23 pages of templates. As I mentioned, Anthropic templates currently run on Nvidia processors. Some templates may also be in high demand. Therefore, there might be some latency on certain free templates, but you can still enable them by toggling the switch. I'll show you how to integrate domains like Mistral's in the same way. So, click the "Try Studio" button. In this section, go to "API Keys." Generate a key by clicking "Create a new key." Once you've generated your key, we'll... Go back to the main "Connections" interface. Click "More," and you'll see a pre-generated URL: the Mistral API URL (api.mistral.ai/v1). Check the box, enter the Bearer key, and specify whether you want only certain models ( as we did previously) or save them. You 'll then have access to the models. If you want Mistral-specific models, go to the interface and type "Mistral." This will display all Mistral models. I've only activated Mistral Large in this section. These are embedding models, so they're not useful. To get the latest model, choose either the "Latest" model or the default operating model. This system will also allow you to integrate paid systems like those from OpenAI. So, go to the OpenAI interface. In the "API keys" section, click on "Create a secret key." Activate it, retrieve it, return to the interface, and that's how you 'll be able to access all the ChatGPT models from your interface, whether they're paid or free. Models like GPT Qwen 120B, for example, are completely free, and I'll show you how. Go to the Ollama interface. We've already made videos on how to deploy Ollama. As usual, go to the "Download" section, download the Mac, Linux, or Windows interface, install it, and launch the Ollama interface. Once Ollama is running, you'll have models accessible in the sidebar. We've already covered this in other videos I 've made. From now on, everything you type, create, and write will be private. None of this information will affect the templates, allowing you to have a large number of templates, some of which are free. This includes the most advanced Ollama templates, such as the latest Qwen 2.5 and Qwen 2.5 Max, which are only available on plans starting at $20 per month. However, there are also free templates available, including Gemma 2 and Nemotron, as well as older versions of Gemini Flash, which are free within the interface. Of course, if you have a dedicated graphics card, you can also install these templates locally. To activate them completely privately and anonymously, you enable the "Ollama API." Simply toggle the switch here. This means I now have access to the Ollama 120B Qwen templates in a private system. That is to say, I won't be training these models; they will be private. These are reasoning models developed by Alibaba. This is part of a set of models, but it will also allow you to access other models, including Qwen models if you wish. So you activate it using the side toggle. From now on, you have everything configured to be able to use these models with your data. This system will allow you to have what is called a workspace. Now, this is extremely advanced because workspaces are based on the same principle as projects in OpenAI. That is to say, you will select a model. So you choose the model you want to use. For example, I can use the DeepSeek V3 Flash. You can give it a system prompt. Let's keep it very simple: "respond only in French." And so, I will I'll tag it to give it "V3 FR". So, I know it's a... it's a template I've automated to respond in French. I'll enable web searches on top of it, and I'll be able to switch directly to either a workspace or a conversation. The minus button is for removing a template. The dropdown menu system will give me access to DeepSeek V3 FR. I have the internet search enabled: "What are the latest news stories about Claude from Anthropic this week?". So the template will make an internet query and will therefore manage the tools. We could do better. We could create a prompt system with the use of tools, and we could even connect Perplexity for a more advanced system. But you can see that the system was relatively quick. So, in the system we're using, we have the loading of internet requests, and, just like you're used to using search systems, the information is sent in context, and you work with the system. The fundamental advantage of these tools is that they maintain data privacy. Here, we'll have a particular focus: document management. In the document section, you'll be able to create knowledge bases. For those who have already watched my previous videos where I talked about second brains with Obsidian and the need to create data structures optimized for AI, you have a system that will allow you to integrate either the default content extraction engine (I 'll show you how it works) with the option of default image extraction. So, I'll show you what will happen when we create one. What we're going to do now is use two AIs and see how an AI can make a mistake when responding. You trust an AI, but if that AI leads you straight into an error and you realize it, you won't be able to detect it. The mirror image of the problem is that the lack of a working method puts you in a situation where it can cause errors and therefore have consequences for your work and your company. What I'm providing is a framework, a path, tools, AI agents, working methods, and that changes everything in terms of the result. To learn how to do this in more detail, go to the training courses. And now, let's move on to the tutorial section. You have the option of either retrieving files (so you send a document, a PDF, a knowledge base file), or attaching a vector database. Therefore, the information remains local, and will use AIs that don't train your data. So we maintain data privacy and data sovereignty, and we can work with vector databases. We essentially have our second brain, which is directly integrated into the system. I can therefore ask the model to work solely on my data, based on a local, vectorized knowledge base database. The system will then retrieve sources from my knowledge base, extract the data using a free model, and show me the citations. Here we have values, what we call "chunks," meaning the retrieval of relevant data classified according to a reranking system. So we have a true RAG system that we've established locally. We've created a second brain where data confidentiality is preserved. To access this function, go to the "Space" section within the "Knowledge" area. This is where you will generate new knowledge. You will then access the knowledge base. I'm going to show you the impact of data quality on content. that we're going to obtain. We're going to do a "raw data" test, meaning we'll take a PDF and work like most people do. People take a PDF, add content, and upload it to a system. I'll use the 58-page study from DeepSeek, and by default, this is what all the videos on the internet do: they show you how, and you set up a system like this. Here's what happens. Just as I've shown you with other systems like NotebookLM, sending data directly loses all the image content, the formatting, and the model will have a very hard time understanding your data. That's what happens when you use raw data. All the data must be structured and optimized beforehand. And for that, at least the first step that must be taken is to use at least an OCR extraction function. I recommend using Docling-type OCR functions where you can add JSON structures. I've already mentioned this in the Mistral interfaces. You can also use the data extraction feature directly from this interface, allowing you to retrieve structured and optimized data directly within the interface using your parameters. And you can, of course, include some of the images. This needs to be configured in the generation interface, so you set your endpoint to Mistral OCR. However, Mistral OCR doesn't natively support image export by default. This feature is only available in the official interface. We've already covered this in other videos. I 've shown you how to optimize your data for business. It's essential to create structured data and extract that structured data to get all the content. Here's what this changes. In terms of results, if I repeat the same request, save the change, and using an optimized OCR system, I'll upload a new file to the interface. This time, it will be sent to an optimized OCR system. You'll see that we'll already significantly improve the understanding of the data. I'll retrieve... You have the entire structure, which this time is correctly set up, integrating the images and tables, and which will be much easier for the model to understand because the data is optimized for comprehension, and we have the integration of a "tables" section. Now, I've shown you in other videos that I use, of course, more advanced systems to do this. If you really want to perform well with your data, it's essential to create a cleaned, optimized dataset. So, we have a step where we clean, compact, and put the metadata into the system, and everything is perfectly structured for working. This is the best data for working locally and very quickly with your AI. On the one hand, you 'll be able to work with your data securely, with optimized data. And that will change everything in terms of understanding the models, because you'll be able to see the difference on top of the structured data. You have the tables, you have the metadata structures. And what you still need to do is define what we call the segments. There are different segmentation methods; we've talked about them in other videos that use either tokens... that's what I usually use, token structures, but I do semantic segmentation. And then, we send that into vectors, and you have your own vector database on your computer. Everything stays local. You work locally with your data. We've created a second brain. We're going to work with... AI that isn't trained on your data. You maintain privacy, you maintain control, and on top of that, you optimize by implementing a hybrid search system, a reranking system where you optimize the parameters based on the type of data you're working with. So, BM25, which I've already discussed in other videos, is a tool I use for hybrid searches or grep searches. So, technically, if I were working with sensitive data, data that shouldn't leave my company, and I didn't have much processing power or a sufficient graphics card, I would use a model that I know isn't trained on my data, and it's free. Plus, it's either hosted on Ollama's servers or Nvidia's. I'd be able to work with it. If I need a second model running simultaneously to work on the data and get a different perspective on the subject, then I'll activate a second parallel model. I'll work with my data, but not with a PDF as we saw, but with a vector knowledge base, and therefore I'll use my vector database system instead. We'll have recognition by similarity and by a hybrid system with a BM25 on top. So we'll have a much more precise database, and I'll be able to start working using two simultaneous models. Let's say I need to use complementary approaches. You'll see that by sending two AIs to a knowledge base, what will happen is that I'll also be able to merge the responses, and therefore I'll be able to use the two AIs' method to optimize my final response. I'll show you what will happen. I only send an initial response: it's DeepSeek V3 that responds first, which will therefore use systems. You see, I asked for "The formula used for the Sparse system part of DeepSeek V3." I retrieve all the sections I need from my chunk system. Then, I have Moonshot with Qwen 2.5 running. One runs on Nvidia chips, so completely free and independent. They don't train the data. DeepSeek V3 Flash will run on... also Nvidia, but you have the connection with Ollama if you want. You have two different approaches. What interests me is: on the same problem, I'll have two approaches, and I'll be able to retrieve the answer from both AIs in the end. So I optimize further and reduce the risk of hallucinations. So I reduce the reasoning part. Here, you have the details of the reasoning system that's running. The source is still my database. The database, as I'm showing you, is ranked by a ranking system. So, the probability of accuracy is 70% for the first source and 65% for the second. Therefore, it's crucial to optimize your database, as you can see, to allow the model to identify anything with a high value and inject it into a context. We can set a cutoff, for example, at 75%. We consider that anything below 75% is not used to select only high-quality content. So, you have two responses generated by the LLMs. Now, I have two different approaches. I can either choose one over the other, or I can use the merge function, where I merge the two responses, inject the second model's response into the conversation, and create a merged response from the two. And so, we should again reduce the risk of hallucinations because both AIs have worked on the databases. They did research and statistically, the more independent research you do, particularly with reasoning models compared to an immediate response model, the more you reduce The potential pitfall is having a single model that might be retrieving data from sources and making a mistake. But by adding your data, the RAG system that's in place, and a second model on top, we can be almost certain that everything we write today is based solely on the two contextual elements we have here. So, if there were an error in one of the two AIs, having a second model take over will generally allow you to significantly reduce and correct this problem. This is also possible when you have the response from two AIs: "Compare each AI's response using bullet points, check for contradictions. We can also proceed step-by-step." In this type of situation, it's best to work with a reasoning model so it can break down each AI's response. Here you have the two responses: the Moonshot that was performed and the DeepSeek. So, mechanically, the model will compare the responses of AI 1 and AI 2 by identifying the key points of the first response, the key points of the second response, and placing them side by side. These are, in fact, processes of contradiction and divergence within a linear response. This is what will lead to deviation and hallucinations. Therefore, this type of tool will allow you to simultaneously use different models and architectures and highlight any potential contradictions. That's exactly what just happened. The first response mentions a local binary mask. Diagonal banding, global tokens, context... no local mask and compression. So, if we have doubts about an element, the best thing to do is to perform a further search. We take our system again, ask it to verify: "Formulate in two separate questions, use only the knowledge base and..." and give it the query part. So, as you've seen, an AI is not a truth machine. An AI is a probability machine. So when an AI makes a potential error in the tokens it generates, what absolutely must be reduced and "driven" is the probability that it will run away with tokens in a direction that isn't factual. The goal of building a knowledge base is, in fact, to re-drive a model in the right direction in terms of the probability of the answers. And when your data is important—that is, when your data has consequences for the company, for the legal aspects—what we can do is what we just did: we put two AIs side-by-side and then compare whether the two formulations point in the same direction. Here, we detected that one of them, ultimately, lacked precision. We asked it to retrieve the three points and verify them with three questions. That's what it did. She sent three queries, points 1, 2, and 3, to the data. And now, we're going to get the missing precision, because the goal of a company, let's be very clear, is to limit the risk of hallucination and error as much as possible. That's why we build databases. But you realize that with the same databases, we just got two answers that weren't absolutely identical. Why? AIs are probabilistic systems. Consequently, there's no such thing as absolute truth in AI. It's a probability of token injection. It depends on how they're trained and on the data you've sent into them. That's why I told you, preparing your data is crucial. And now, we have the answer to our points, which tells us that answer number 1 contained an error. Don't forget that answer number 1 was This is done using an immediate response model, the "fast" systems, DeepSeek V3 Fast. So it's not a reasoning model, which means that, consequently, it will be less able to break down a problem block by block, and that leads exactly to what we just saw here: imprecision and a lack of depth in understanding the problem. This is typical of MoE systems to begin with. So, we have two points that will allow us to correct the problem. In this tutorial, I've given you a lot of strategies for using AI to maintain data sovereignty, data governance, access developer interfaces, and have accounts that don't train your data. If you want, you can then create knowledge bases and work on your data locally and store all your vectors on your computer. The only thing, of course, is having a graphics card. Regarding memory, I'd say it's good to have at least 128 GB, 64 GB is fine, in DDR5, sorry. For the graphics card, try to have at least 8 GB of DDR5. If you have more... for graphics cards, we go up to GDDR7, it's better, it will be smoother. But keep in mind that the bulk of the work, in the end, isn't your computer doing it, since we 're using Nvidia or Ollama processors to run the LLM. The memory we'll be using is to store the knowledge base, the vector database that will be stored on your computer. So, the data remains local and isn't used by the AIs to train them.