ENFR
8news

Tech • IA • Crypto

TodayVideosVideo recapsAll topicsTop articlesArchives

I built a REAL RAG Claude 4.7 + Obsidian | Claude Second Brain!

AIParlons IAMay 8, 202629:02
0:00 / 0:00

TL;DR

A structured workflow combining Obsidian, vector databases, and RAG pipelines can create a secure, cost-efficient “second brain” while avoiding performance loss from excessive context.

KEY POINTS

Obsidian is not a RAG system

Obsidian functions as a Markdown-based knowledge database, similar in concept to Notion, enabling document storage and link visualization. However, it lacks the retrieval and vector search capabilities of a true Retrieval-Augmented Generation (RAG) system. Treating it as RAG leads to overload, reduced performance, and inefficient use of AI context windows.

Context overload degrades AI performance

Excessive input, described as “context rot,” significantly reduces model efficiency and increases cost. Sending entire documents directly into systems like Claude or ChatGPT exhausts token limits and slows responses. Optimized workflows focus on retrieving only relevant data instead of full datasets.

RAG relies on vector databases

A proper RAG pipeline converts documents into vector embeddings, representing semantic relationships between concepts. When queried, a retrieval system selects only the most relevant chunks. This process ensures faster, more accurate responses compared to brute-force document loading.

Data preparation is the critical step

Building an effective system requires multiple stages: data extraction, cleaning, chunking, embedding, and storage. Raw inputs such as PDFs or web pages must be processed to remove noise and irrelevant metadata before use. Poor preprocessing directly harms output quality.

Distractors sharply reduce accuracy

Irrelevant elements in documents, known as distractors, can reduce comprehension accuracy by 8–30% with one distractor and up to 70% with several. Cleaning data before ingestion is essential for maintaining reliable results.

OCR and structured extraction improve quality

Tools like Mistral Document AI enable extraction of text, tables, and images into structured formats. Converting visuals into machine-readable formats such as JSON ensures that no critical information is lost during preprocessing.

Metadata enables efficient navigation

Adding metadata—such as document title, sections, keywords, and version—helps AI systems quickly locate relevant information. Metadata structures vary by domain but are essential for both RAG retrieval and local database navigation.

Automated chunking and validation pipelines

Advanced prompts can automate chunk creation, metadata tagging, and validation in a single workflow. These pipelines include self-check mechanisms, logging, and error correction, enabling auditable and repeatable processing.

Low-cost vector storage is accessible

Cloud-based vector databases can cost as little as $0.10 per GB, with initial free tiers available. This removes the need for high-end hardware while enabling scalable storage and retrieval of embeddings.

Indexing reduces reliance on AI context

Creating compact index files summarizing document structure allows rapid navigation without loading full content. These indexes guide retrieval, reducing token usage and improving response speed.

Hybrid retrieval systems enhance efficiency

Combining vector search with traditional methods like BM25, TF-IDF, and keyword search provides faster local retrieval. Conditional logic can escalate queries to more advanced systems only when necessary.

Human-in-the-loop systems improve reliability

Incorporating HITL (Human-in-the-Loop) mechanisms allows intervention when errors occur. Logging and audit trails ensure transparency, making workflows more controllable and adaptable.

Separation of retrieval and reasoning reduces cost

Queries can first retrieve relevant chunks using low-cost systems, then pass only necessary data to premium models like Claude. This separation significantly reduces computational expense while maintaining answer quality.

CONCLUSION

Efficient AI knowledge systems depend less on tools and more on structured data preparation, with RAG pipelines, metadata, and indexing enabling scalable, accurate, and cost-effective “second brain” architectures.

Full transcript

Each of you works with your own documents and company data. So, how do you create a secure second brain for your data? In this video, I'll explain how to create a database with Obsidian in the Claude interface. But what I'll teach you for Claude also applies to ChatGPT. The goal is to optimize our system's operation. What we absolutely must avoid is entering the "rot context" zone, the zone where we overload the model with too much context. This will exhaust your data plan and, more importantly, cause a performance drop. In this video, I'll explain the strategies I've implemented using Mistral, Ollama, and the VS Code interface to be able to launch RAG directly from Obsidian and Claude. What's the main difference between Obsidian and a RAG system? Obsidian is a Markdown-formatted database. For those who use Notion, it's exactly the same principle as Notion, meaning a system that allows you to read text that is normally in Markdown format. The advantage of this system is that it also allows us to create visual representations where we determine links based on keywords within documents. But be careful, it's absolutely not a RAG, as Jonas's video, which I highly recommend, very rightly points out. He's a professional who deals with RAGs, and it's his job. The real problem with Claude Code plus Obsidian is that it's not a RAG system. And unfortunately, all the videos on the internet have presented the Obsidian system as a RAG. This is absolutely not the case; on the contrary, it will completely overload your system. I talked about this in the previous video, which I'll link to in the description below, and I encourage you to watch Jonas's video. This will allow you to better understand the RAG problem today and why so many influencers have jumped on this bandwagon, and you realize they've completely missed the point. The main issue is that we're completely saturating Claude, and it's unmanageable. A RAG system is based on a vector database. This means that at some point, you take your documents, send them to a database, and create vectors. Vectors are mathematical coordinates that define the distance between concepts. When you ask a question, your system sends a query to a system called a retrieval system. It queries the database and selects only the documents that are optimized to answer your question. This is exactly what we call Retrieval-Augmented Generation. To arrive at this system, you have to follow certain steps. Before getting to the vector format, there are several steps: I retrieve the data, I extract the information from it. This could be OCR, PDFs, spreadsheets, or images. I create a chunked version, and after that, I build my embeddings—which are my vectors— and then I store them in a vector database. In practical terms, this means we have several options. We can create a vector database directly on the computer, but that requires a good graphics card, and not everyone will be able to do that. So, I've found other solutions that everyone can use. Imagine a vector database that costs less than $0.10 per gigabyte—practically free. And what's more, the first gigabyte is completely free. Another point: in this video, I'm going to explain how I went about cutting the chunks and optimizing the database. And what some influencers are telling you On the internet, it's common to use a plugin to directly download PDFs or web pages to your computer. And that's exactly what you shouldn't do. Here's why. When you browse the internet, you don't see it, but it's actually a succession of codes, numbers, and metadata behind the scenes. When you download all the information, the LLM (Learning Machine Learning) will download everything. The problem is that it will encounter what are called distractors, meaning irrelevant elements. What is the impact of a distractor within a single document? Studies explain it. With just one distractor, you can lose 8 to 30% of the accuracy of understanding. With four distractors, you lose between 30 and 70%. So the quality of your initial data is crucial. The basic principle is to have a raw source from which you extract the information. What you shouldn't do to avoid using up your data allowance with Claude or ChatGPT is perform an extraction using either program. There are free, high-performance systems that will allow you to do this. One such tool is Mistral. Mistral has a feature called the Studio function that allows you to switch to the Document AI section. For this example, I'll be looking at the documentation for DeepSeek-V3, the brand new Chinese model. It consists of over 55 pages of documentation. But the problem is that it contains graphs. If I use the PDF as is, I'll lose all the graph data. But there's a solution: use Mistral's functionality to extract data from spreadsheets and images and integrate it into my file. Here's how to do it. First, we'll upload our file and use the latest OCR model. What 's going to be important is to start removing the elements that could be distracting. So you realize that even though there are 55 pages, starting with page 44, I no longer need the information that follows. It's essentially quotes. And that's exactly the type of content that could interfere with understanding an LLM because it's very keyword-dense. So what I'm going to do is mark it from 1 to 44 and then I'm going to ask it to integrate the tables and images. And for the images, I'm going to integrate a JSON-based structured data code from the spreadsheets. This sequence will allow me to retrieve all the data. I run it, and what I'll get in a few moments is 44 pages cleaned up, structured, and formatted exactly in the format that the LLM understands. Each element is correctly laid out. There, you're starting to address the first step: going from raw data to extracted information. That's the first step. To retrieve the data, you click the "download" button, and you'll now get the entire document, completely free of charge, thanks to Mistral. Before we can use it in Claude or Claude CLI, we need to optimize the sequence, so we need to talk about metadata. Imagine that to navigate within a database, the model needs to orient itself. To do this, I thought of creating a metadata function to improve understanding of the document. What you need to understand about metadata is that it depends on the type of document you're working with. Medical scientific metadata won't be the same as customer service metadata. So, we'll use metadata information to quickly identify what the sequence contains, when it was published, what the topics are, and where the tables and keywords I'll need are located. How do we create this metadata? That's exactly what we'll do in the next step. Why is a RAG system generally relatively expensive? Because it's technical. And that's precisely what I teach you to do in the training courses: configure your systems, learn to create professional, auditable, and manageable systems. Prompt Engineering Elite represents over €16,900 worth of skills and operational assets within this training. You'll learn to configure operational AI agents on your data, work with datasets, build pipelines, and build HITL systems with humans within the decision-making loop. In this training, I'll get you ready for the official Google exam and the Claude Code 101 exam. In short, honestly, if you want to truly level up in the field of artificial intelligence, you'll find all the information in the description. Before we can use the second brain, Obsidian and Claude Code, we need to retrieve our document and apply the extraction optimization step just before the chunk section. Here's what it is. To improve retrieval navigation: if we're doing a RAG (Retrieval Access Group), we'll have a system that searches for metadata. And if we're using Claude or Obsidian, we need to find keywords quickly. So we're going to identify sections and create metadata within the 58 pages we've obtained. And for this, I'm going to use ChatGPT. I could do it with Claude, but because ChatGPT is much cheaper, I'm going to do exactly what professionals in the RAG field do. I'm going to create a metadata database within my document. And here's the prompt I'm using. The purpose of my prompt is to configure Claude CLI or ChatGPT to identify section blocks. For each section block, it will create metadata. This metadata will depend on the type of document. If you're working with accounting documents or legal articles, you won't create the same type of metadata. But what's generally very important is being able to identify the document name, the number of pages, the title, the section, the author, and the version. This will allow you to update the document if there are new versions in progress. So there's always a phase where you have to think about what data will allow you to navigate quickly within your document and divide it into sections based on criteria that will allow you to navigate the document most smoothly. It could be a division based on the subject, or it could be a division based on paragraphs. But to optimize my workflow, what I do is create a system of separators, which will allow me, in the same step (and in the same prompt), to use a Python script that will automatically identify the separators and create my own divisions, thus the chunking phase. With a single prompt, you create the meta tags, insert the keywords, check if everything works, create the separators, and slice the chunks. All the functions are in one prompt. So I'll give you all the details in the lessons just below. The last part of the prompt is what we call a validation checklist. Remember, we're using an agent. It's not just a chatbot. When you design a prompt on an agentic loop—and today ChatGPT and Claude are agentic systems —you have to tell it what to retrieve, what action to take, how to verify the data, and that's exactly what the prompt contains. The goal is for it to be able to verify its work, to identify if the meta tags are present, if the semantic content corresponds to the section, if it has integrated the figures and descriptions, and if everything is good, it proceeds with the slicing. Here's how to make the AI ​​work in this mode. Agentic. First tip: your prompt must be fewer than 500 lines in total. Second: the template needs the link to the original file and the directory where you want the job to run. So, we'll provide the directory, take the file path, and then specify the write directory. You copy the final directory where your AI agent, Claude CLI or ChatGPT, will execute the job. To optimize costs, I recommend changing templates. I tested it with the GPT 5.3 Codex template. In advanced mode, for this type of task, it's generally more than sufficient. Since the prompt is already a sufficiently detailed structure, I can automatically launch the run. Because it's an agentic system, it will follow each step. And what I advise you to do—though you have the prompts in the training materials—is to create what are called error reports. So the model will start creating timestamps to identify its behavior. Therefore, I'll have log sections in place. In other words, I'll be able to verify what it's doing. Here, I'm checking that the system understands the different work phases and I can see which stage of the workflow it 's at. If there's a malfunction, we can follow the log function. In the logs, the model will record each action or problem it might have encountered. This will allow you to intervene in your prompt and correct the workflow if the model's operating sequence needs improvement. When you're working, you need to make the work what 's called auditable, meaning the model will be able to identify that I've changed the file name and tell me, "Now I'm waiting for the document." It will record this in the log section, ask me for clarification, and I can then intervene. This is what we call a HITL loop, meaning you enter the loop to help ChatGPT or Claude do their work. You're no longer just asking ChatGPT to produce words, but to produce work and intervene only when, upon detecting certain elements, you 've coded instructions telling it to call you to resolve the issue. In other cases, we create what are called decision-making loops. It checks if it can resolve the condition. If it can't, it calls the person (that's a HITL loop), and if it can, it takes an action and restarts the cycle. Now, the point that might surprise you is that this work step, for 50 pages, will take about 4 minutes on average. This makes one thing clear: we're not dealing with systems like the ones you're led to believe, where you simply "take a PDF and send it," not at all. We're dealing with a system (and that's why RAG's services—implementing Retrieval-Augmented Generation for businesses—are primarily about work, about the ability to prepare the data, to create what we call an optimized dataset so that AI can read it easily and quickly) and thus gain in stability, responsiveness, and above all, cost, because ultimately, you're the one who benefits, because you'll spend less time querying the model and, more importantly, get much more precise answers. So, the preparation phase is one where we can't delegate to AI completely autonomously. The ability to segment or understand how the blocks are structured is up to the user to create the prompt based on the document type and metadata. So for that, of course, you have more advanced information in the RAG training. But by starting with the metadata structure and the chunk system, you ensure that the model already has a Optimized data for working. I'm going to show you something right after this: first, we'll check the chunk dimensions because we'll need to vectorize them later. And you can see that the model had an error in block number 46, and it's going to launch a correction system. As I told you, it's crucial to include a complete workflow logic in your protocol system where it can check its own results, append them, and index them. It knows what it's done, it goes back over it, it checks, and if it's valid, it 's finished its task. If it's not valid, it runs again. That's exactly what it does. And here, we're really delegating the work... a prerequisite for using Obsidian. Download for Windows; it's also available for Android, Apple, and other interfaces. I'll let you install it, click pause, and come back right after. Here's what we just obtained: a set of structured data called chunks. And here's what it looks like in the Obsidian database. From now on, we have the document that we can already use in Obsidian. We have a metadata structure at startup that allows us to identify the page, the document, the keywords, and potentially the presence of images, since we've extracted the image data. What's quite interesting thanks to the workflow we've implemented is that we can retrieve and model the images in Markdown format, which allows us to maintain consistency in the text. So when we work on these sequences, we can retrieve the documentation logic that's present within the chunk systems. Therefore, it's a document enriched with data. What's important from now on is to define the chunk size. I 'll show you how to do that; it's quite simple. There's a manual step where we simply copy two or three files to access the GPT tokenizer interface. This will allow us to check the size of our chunks. A chunk is usually variable, ranging from 200 to 4000 tokens. Of course, there are rules to follow depending on the document type and consistency, but what we're trying to find here is the maximum value of our chunk so we can send it to a vector database. So, I know my chunks average between 800 and 1200, which includes the metadata sequence that won't be used in the vector calculation. That part will be excluded. It will be used for retrieval, but not for vector elements. Now that I know the size, I'll go to the OpenAI Playground database. You need to create an OpenAI account and go to the Storage section, in the bottom left corner. Here, you have what's called a vector store. You click "Plus" and create a vector store. You give it a name, and once you've named it, you can add files. We'll click the "Add" button, then "Upload." We'll search for files and retrieve the chunks. And this is where we'll talk about overlaps. So, in the sequence shown here, I don't need any overlaps. I have sequences that are around 1500 at most. They're already pre-sliced. Now, you attach the documents, and in just a few seconds, your system will identify each chunk, number them, and give them an identifier called a file ID. We've already created a very large chunk of a vector database. We've retrieved data, put the schema data inside, extracted it, added metadata, optimized it, sliced ​​it, vectorized it, and stored it in the vector database. Now we have the ability to query our database, which is in embedded format. The storage cost is 0.1 dollar per gigabyte per day. So, to give you an idea, before you reach 1 GB with your files, I think you'll still be able to build a decent database, and overall, it saves you from needing a high-end PC with a powerful graphics card and delegating all that storage to OpenAI You need to use this command line. And to optimize performance and avoid having to download the model to your computer, you'll use the cloud version if you don't have a sufficient graphics card. If you have a graphics card with at least 12 GB of VRAM, you can use a GPT Qwen 20B. For it to run smoothly, you'd need an RTX 40 series card. Otherwise, keep the same structure and add `GPTQwen-cloud` to the end of your command line. Let's run a test on the RAG system. What I'm about to show you will surprise many of you, and I've already shown it in another video. So feel free to go watch it. It's the one where I show you how to connect Claude CLI with the Ollama system. Here, I'm going to launch the Claude CLI interface with a free model that will run either on my computer (which requires a graphics card) or, if you don't have a graphics card, you can run it in the cloud from the Ollama system. The one I have access to is the Qwen 120B, but I could use Claude if I wanted. From now on, I 'm going to query my RAG database. So, what you need to do is create a RAG command. When you create a RAG command, you have two options: a retrieval or a fetch. So now if I ask it to search (and you should understand that the word "search" here means retrieval in the context of the RAG system). It expects the variables. What do I want to find? I want the authors of the DeepSeek-V3 document and its publication date. And what's interesting is that I'm going to add the file ID and the chunk number. Of course, you can speak to it in French; it's just a habit for me. So the system will now query the OpenAI RAG database, and I won't have any more problems or need to store anything on my computer. Since it's an MCP system, I'll show you the MCP part right after. It asks me if I'm okay with it running a command on the MCP. So I'll show you the name of the MCP it's communicating with. And here I see that it's actually retrieving the vectors along with the sequences. So I see the tables it retrieves, I see the documents, and now it's even retrieving the system chunk identifiers. You see, it's actually using the metadata and identifiers we saw earlier to find what we need more quickly. And it tells me, "This information is present in chunks 03 and 0." So if I needed to work with and retrieve this information now, instead of spending my tokens with Claude, what I'd do is run a query with a system that's practically free until I reach my gigabyte limit. I'd only retrieve the sequences I need, and then I could work with Claude using my terminal. Which means DeepSeek-V3 research paper? The name of the author(s)? What models are addressed in the paper?" There, I can actually give him the references for the items, and a method that I find much simpler... but for me, that's for optimization. My goal is to optimize costs. Don't forget that I could tell him, "Go search my entire PC," I agree. But what will happen is that I'll completely burn through my entire data plan. This What we just did is pre-select the chunks. It cost us nothing. Now I know where my directory is, but I'm going to show you we can do even better. We can build search systems directly in databases, but I'll show you that we can simply provide the path. I tell it, to save time, because otherwise it will scan all the directories: where is the data? It will retrieve the ID that corresponds to the chunk, integrate them into the context, and be able to answer my question. So, the file IDs we have—there you go, we got the answer in a few moments, which means we don't need... and it didn't run a RAG because I didn't provide a RAG function; it doesn't have access to the RAG by default. It only retrieved the sequences I asked it to retrieve from my database because semantic vectors are the most precise. And you can see that it didn't run any MCP functions, so it relied solely on the retrieval I performed. This allowed me to avoid launching contexts, filling my context with 58 pages, and getting exactly the answers I was expecting: publication date, author... The author isn't actually addressed; it's DeepSeek, the models, and the different configurations present in the documents. So, we've already optimized this first part. What we can do in addition is use a search system that doesn't use Claude's entire context, but an indexing system. Let me explain: when you work, thanks to the metadata, you'll be able to create an index file. So I'm going to show you what it looks like. An index file allows you to identify the structure of an entire document: the name of the original document, how many figures there are, how many tables, and then where each element is located within the structure. I add formatted indexing, specifically to help the LLM navigate within it. Let's say I haven't found what I'm looking for. I'll run a query within my index file and separate the indexing for images, the overall index, figures, and tables. So, in fewer than 200 lines, I'll be able to condense information from more than 58 pages. Here's the system architecture. Initially, you have the index file. By default, the system reads the index file and then navigates to the chunks from it. And that's exactly what I coded with ChatGPT to integrate two complementary functions. Let me explain. Basically, you understand that we have a RAG function that allows me to perform searches. I can easily send this RAG function from Claude. I can ask Claude to run my RAG function now and search for my document. "In DeepSeek, how many versions of LLM exist?" I can do that perfectly well. The only issue is that it will cost me context using my system. So, since we can completely delegate this to another model, I can use that system. The RAG system, for me, is much more powerful than retrieving chunks, but it's possible. As you can see, when I make my request with my external system, what will happen is that I 'll receive a request and a response. And so, I get the return information. It identifies the files, it tells me what the elements are. I could have asked him, "Give me the chunk numbers so I can find it in my local database if I want." So I have the same database, and the information is in chunks 1 and 8 at the system level. Chunk. Okay? So for me, this system is really optimized. Then, we can add a system that will work only locally. And here's the logic for the AI ​​agents that need to be coded to work locally. I started with this idea. I think there are many possibilities, but I'm going with this one. I create a query function. So it launches a query function. The query function is a search system. It's what we call a command. So the query command will launch two agents. One agent works with an algorithm called BM25 and the other is TF-IDF; it's an algorithm that will search for keywords. So that means that if I find a keyword in my document, it will be weighted according to the length of my document. It's incredibly fast, on the order of milliseconds. If the keyword is present, it identifies the document. If it's not present, however, you miss it. And then, a grep function, which is also fast and complements the BM25 system. If BM25 returns zero (this is the logic I've implemented in the code, so a conditional system: if my BM25 system = 0 and my grep = 0), I run it as an AI agent on my document index. The index system will immediately route the model and get an immediate overview of the sequences. It will then be able to identify the elements, and identifying the figures and the chunk number where they are located will optimize the system. This way, we don't need to use a RAG system and we keep everything local. What can be interesting with this system is that we can also use a small local model to retrieve the data, always with the idea of ​​"Let's optimize our data plan because Claude quickly becomes very expensive in terms of context." And this system, with its separation of images, tables, directories, and architecture, requires one thing: you must always be very organized. But you also have to understand that whenever you work with documentation, it's always the same: sending PDFs like you see in videos from social media influencers or documents you download from the internet is pointless. There 's a reason why hiring a data professional, someone who does RAG (Retrieval and Access Management) or a data engineer costs money because it's work where the data structure is optimized to speed up the retrieval process. So the logic you need to implement is a guide system based on the index file architecture. Here, we're working with a single directory, so you need to create an index file for each document, especially for large document blocks, if you prefer, because they contain quite a lot of information. And then, there needs to be a master index file that can instruct the model. If you don't know where a piece of information is located, you'll have to implement a master index file in your original document, which will then allow you to distribute it to the different documents. So, you should always think of these systems as a routing system. Today, the entire architecture is based on this: optimization, in fact, of keywords within the routing system that allow the LLM to understand where the data is located, retrieve the chunk, send it to the context, and optimize the system. So, we actually have two possibilities. If you'd like to see another video on the second part, just leave a comment and let me know. I'll detail the processing for installation and configuration. Regarding the MCP RAG part I'm giving you, the configurations: the MCP is the one displayed on the screen, right here: npx @modelcontextprotocol/inspector. What you'll need is the identifier of your vector database. This is the number that is here. This is the ID of the vector database. Next, you'll need an environment key from OpenAI. You'll find this in the "API Keys" section. This is where you activate your API key to connect. These are the only two pieces of information you need. Then, you ask Claude to use the Context Protocol ( MCP) function, and he'll write a Python file for you. You give him the number of the database you want to use. So, if you have different databases, you can easily create one for customer service and another for invoices. The best approach is: the more specific the elements are, the faster it will be, the easier it will be to find, and the less time it will waste. Personally, I created two functions in the system: a search function and a fetch function. So, the fetch function is used when you have exactly the document you want to retrieve. As you saw earlier, we obtained the ID number. Let's say I want to retrieve the entire document; in that case, I can use a fetch function. The fetch function isn't strictly necessary. You can simply integrate the MCP tool search; that will be sufficient. My advice is to store your system information in environment variables, not within the requests themselves. Those are the three key points for getting the system working. Regarding my directory structure setup, I started with this directory structure, including the.env section. This is where my API keys

More from AI