top of page
Search

RAG workflow project

  • Ken Munson
  • Nov 15
  • 4 min read

Updated: Nov 17


Here’s a summary (with a little bit of detail) of the full Load → Chunk → Embed → Save to FAISS → Reload → Query workflow I built.


I don't guess I have to say that I used ChatGPT 5.1 for help, especially with the coding in Python as well as some deep troubleshooting this code.


Because I had to spend a lot of time troubleshooting a Chroma (Chroma is an open-source vector database for AI applications) on Windows 11, this project took longer that it should have. If you want to recreate this project (which I highly recommend), and you are going against the grain and doing the development on a windows machine, high highly recommend staying away from Chroma as your vector embedding facility. Trust me on this.


All things considered, this first part of the project, detailed here, probably took 30 hours. Would have been 20 minus the Chroma drama. With what I have written below, and me eventually getting the repo on GitHub, you could probably get it done in 5, depending on OS and API and other issues.



The project:



 RAG Pipeline Overview  RAG - Retrieval Augmented Generation (adding super specific things you want to a language model so it can be "informed" about that information - that it wouldn't normally have access to - like a specific company policy, etc.



Load → Chunk → Embed (Vertex AI) → Save to FAISS → Reload → Query



Below is a “conceptual summary” plus the important implementation details you actually used in our project.



---



 1. Load Documents



Purpose:


Bring raw files (PDF, TXT, etc.) into memory so they can later be split into chunks and embedded.



* I created a `data/` directory and placed the input PDF there.


* The Python ingestion script (`ingest.pipeline`) scans that directory.


It uses *LangChain document loaders**:



    * `PyPDFLoader` → extracts text page by page.


    * `TextLoader` → reads raw text files.



Output:


A list of raw `Document` objects, each containing:



   * `page_content`


    * `metadata` (filename, page number, etc.)



Your log output example:



```


Loaded 5 raw docs.


```



(The 5 docs came from splitting the PDF into 5 pages.)



---



2. Chunk the Documents



Purpose:


Break long documents into small, searchable pieces.


LLMs and embedding models work much better on short, coherent chunks.



How I did it:


I used LangChain’s `RecursiveCharacterTextSplitter`:



```python


RecursiveCharacterTextSplitter(


    chunk_size=1000,


    chunk_overlap=200,


)


```



This produces overlapping chunks so no important sentence gets cut in half.



Output example:



```


Created 16 chunks.


```



So: 5 pages (in the pdf I uploaded) → 16 good, LLM-friendly chunks.



---



3. Embed Each Chunk (Vertex AI Embeddings)



Purpose:


Convert text into numerical vectors so FAISS can index them.

-- Embedding a chunk using Vertex AI embeddings involves converting a segment of text (a "chunk") into a dense, numerical representation called a vector. This vector, also known as an embedding, captures the semantic meaning of the text and enables efficient retrieval and comparison of related information.



How I did it:



* We used Google’s Vertex AI through `langchain_google_vertexai`.


Model: *text-embedding-004**


* Configured with your Project ID + Region (`us-central1`).



```python


emb = VertexAIEmbeddings(


    model_name="text-embedding-004",


    project=settings.project_id,


    location=settings.location,


)


```



Each chunk becomes a high-dimensional vector (size ≈ 768–1024 depending on model).



These vectors represent the semantic meaning of the text.



---



4. Store Vectors in a Persisted FAISS Index



Purpose:


Save embeddings locally so you can reload them later and run similarity search.



I switched away from Chroma (because of Windows segfaults + dependency conflicts) and moved to FAISS (flat index) persisted to disk. This was the breakthrough.  I spent 10 hours trying to get this dumb Chroma vector store to work!



How I did it:



```python


vs = FAISS.from_documents(chunks, emb)


vs.save_local(settings.faiss_dir)


```



This writes:



```


vectorstore/


  index.faiss    ← the FAISS database


  index.pkl      ← metadata (docs + embeddings)


```



Ingest output example:



```


Done. Vector store persisted at: vectorstore (approx. vectors=16)


```



---


5. Reload the Index Later



To query the vectorstore, we reload it from disk:



```python


vs = FAISS.load_local(settings.faiss_dir, emb)


```



Because FAISS stores raw vectors, you must reload using the same embedding model.



---



6. Query: Similarity Search → LLM Answering



Purpose:


Take a user question, find the most relevant chunks, and pass them into the LLM.



I implemented two modes:



Similarity Search



Returns the nearest chunks based on vector distance:



```python


docs = vs.similarity_search_with_relevance_scores(query, k=5)


```



Max Marginal Relevance (MMR)



Returns diverse chunks to reduce redundancy:



```python


docs = vs.max_marginal_relevance_search(query, k=5, fetch_k=10)


```



Answering with the LLM (Gemini 2.0 Flash-001)



Then:



1. Take the retrieved chunks


2. Format them as context


3. Pass both the context + user question to a Gemini model



This happens in `app/chain.py`.



---



7. CLI Tool to Ask Questions



The script:



```bash


python -m scripts.query_cli "What does the document say about tracking ML experiments?"


```



Steps:



1. Query is received


2. FAISS retrieves top chunks


3. Gemini 2.0 Flash-001 is invoked


4. You see the final answer



---



Why This Works (Conceptually)



RAG = Retrieval Augmented Generation



Instead of asking the LLM to "remember" the document, it did the following:



1. Load


2. Chunk


3. Embed


4. Store in FAISS


5. Retrieve relevant chunks at query time


6. Give retrieved chunks to the LLM


7. LLM answers with grounded citations



This ensures:



* Correct answers


* No hallucinations


* Answers tied to your actual document


* Cost-efficient (embeddings are cheap, retrieval is local)



---




This is a fully functioning local RAG pipeline backed by FAISS with Vertex AI embeddings and Gemini 2.0 — the same components used in production systems at Google, OpenAI, and other major enterprise RAG deployments.


 
 
 

Recent Posts

See All
Introduction to this blog, the why

It has been quite a while since I started this site. The forces in the world of technology have shifted. There is now simply no denying what AI is and what impact it is having, and will have, on eve

 
 
 

Comments


bottom of page