Introduction
We are currently in a golden age of AI, with the emergence of large language models (LLMs) creating entirely new opportunities. This advancement is benefiting many industries. The technique of Retrieval Augmented Generation (RAG) has gained popularity for answering questions based on specific documents using LLMs. This blog will introduce RAG and demonstrate how it can be applied for Q&A on your PDF files.
RAG has two main components: ingestion and retrieval. Ingestion involves generating embeddings by chunking the knowledge base document—the source of truth—and passing these chunks through an embedding model. We store the embeddings in a Vectorstore, which preserves their semantic meaning. Retrieval works like this:
- when a user submits a query, Langchain conducts a similarity search in the VectorDB
- the VectorDB then returns several relevant documents or contexts
- Langchain combines the query, context, and a system prompt, and sends this to the LLM to generate a response
Below is the basic RAG workflow.
Techstack for personal ChatGPT/RAG
Here are the tools we will use:
- Programming Language - Python
- Langchain for the interaction with LLM and sentence transformer all-MiniLM-L6-v2 model for embedding
- FAISS(from Facebook) as a vectorstore
- Ollama for hosting the model locally
- A small pdf file on MS Dhoni(Former Indian Cricketer and Captain).
Scripts and files can be found in the GitHub location (mentioned later in the article)
Time to jump into the tutorial. Let's do it step by step.
Install required libraries
Ensure pip is installed on your machine.
pip install pypdf, sentence-transformers, faiss-cpu, langchain_huggingface, langchain_community
Chunking, embedding, and saving embedding to vectorstore
Let’s begin with the ingestion workflow. First, we divide the PDF file (dhoni.pdf) into manageable chunks using Langchain’s RecursiveCharacterTextSplitter. Next, we apply the sentence-transformers/all-MiniLM-L6-v2 model to generate dense vector representations of the sentences and texts. Finally, Langchain saves these vectors in the FAISS Vector Database, enabling efficient similarity searches. Let's jump into this by importing libraries.
from langchain.chains import LLMChain from langchain.prompts import PromptTemplate from langchain.callbacks.manager import CallbackManager from langchain.chains import RetrievalQA from langchain_huggingface import HuggingFaceEmbeddings from langchain_community.vectorstores import FAISS from langchain_community.llms import Ollama
The key element for the ingestion workflow is to generate embedding vectors. To start with, create an ingest.py file and define the path of the knowledge base and vector store to store the embedding. Create these directories (use mkdir command on Unix/linux platform).
Next, define a Python method, create_vector_db
, for generating the embedding. The create_vector_db
function loads the dhoni.pdf file (our knowledge base) and splits it into chunks of 1,000 characters. The RecursiveCharacterTextSplitter from Langchain helps in achieving this. We can define chunk overlaps of 200 characters (20%) to minimize the loss of context. 'The uggingFaceEmbeddings class from Langchain helps to embed the pdf file using sentence-transformer embedding model. After generating the embeddings, we save them in the vectorstore/
directory. Below is the code to achieve this.
DATA_PATH = 'data/' DB_FAISS_PATH = 'vectorstore/' # Create vector database def create_vector_db(): loader = DirectoryLoader(DATA_PATH, glob='*.pdf', loader_cls=PyPDFLoader) documents = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) texts = text_splitter.split_documents(documents) embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', model_kwargs={'device': 'cpu'}) db = FAISS.from_documents(texts, embeddings) db.save_local(DB_FAISS_PATH) if __name__ == "__main__": create_vector_db()
Set a custom prompt
Let’s set up a system prompt that instructs the LLM to act as an MS Dhoni fan and use the provided context to answer user queries. PromptTemplate class from langchain is helpful to set the custom template and set the tone of the conversation. To implement this, we’ll create a Python file called model.py
and define a function set_custom_prompt().
Here’s a basic structure for the file:
custom_prompt_template = """You are a MS Dhoni fan and you know everything about him. Use the context to answer the questions. Do not answer anything outside the context. Context: {context} Question: {question} Provide the answer below in a clear and readable format. Answer: """ def set_custom_prompt(): """ Prompt template for QA retrieval for each vectorstore """ prompt = PromptTemplate(template=custom_prompt_template, input_variables=['context', 'question']) return prompt
Load LLM
Ollama is used for loading the model. A Docker container can be started with ollama
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Install llama2 using the following command
Ollama pull llama2
Load the model in model.py
def load_llm(): # Load the locally downloaded model here llm = Ollama( model="llama2", temperature=0.01, verbose=True, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), ) return llm
Retrieval based on similarity search
Langchain first takes the user query and converts it into embeddings using the embedding model. It loads the existing embedding done during the ingestion workflow. Also, call a function called 'retrieval_qa_chain' that will conduct the semantic search. Langchain combines the system prompt, the user's embedded query, and the context from the vector search before sending this information to the LLM. Finally, the LLM generates a response based on these inputs, which Langchain returns to the user.
Let’s write a function to handle this process.
def qa_bot(): embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'}) try: # Load Faiss index with dangerous deserialization enabled db = FAISS.load_local(DB_FAISS_PATH, embeddings, allow_dangerous_deserialization=True) # Use the loaded index # Example: query the index, etc. except ValueError as e: print(f"ValueError loading Faiss index from {DB_FAISS_PATH}: {str(e)}") # Handle the error appropriately (e.g., log, notify, or exit gracefully) except Exception as e: print(f"Error loading Faiss index from {DB_FAISS_PATH}: {str(e)}") llm = load_llm() qa_prompt = set_custom_prompt() qa = retrieval_qa_chain(llm, qa_prompt, db) return qa
Langchain's RetrievalQA class is used to do the semantic search. It takes Next, it conducts a similarity search with the embedded query, prompting VectorDB to return documents related to the query based on semantic relationships. This method takes LLM instance, prompt template, and document database (FAISS index) as in input. RetrievalQA.from_chain_type retrieves the result from the document database and generates the final output based on the Top 2 results(K value is '2').
#Retrieval QA Chain def retrieval_qa_chain(llm, prompt, db): qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=db.as_retriever(search_type = "similarity", search_kwargs={'k': 2}), return_source_documents=True, chain_type_kwargs={'prompt': prompt} ) return qa_chain
Since qa_bot is the entry point of the entire workflow, let's create a wrapper function called final_result that will take the user query call the qa_bot method (that sets the prompt, and loads the LLM using the previously created function). Let's also capture the time taken by each query by calculating the start and end time using Python's time package.
Here is the function that will call the QA bot and print the result
def final_result(query): start_time = time.time() qa_result = qa_bot() response = qa_result({'query': query}) # ANSI escape code for green color green_color_code = '\033[92m' # Reset ANSI escape code (to revert to default color) reset_color_code = '\033[0m' print("\n" + query + "\n") print(green_color_code + "\n" + response['result'] + reset_color_code) #print(response) end_time = time.time() response_time = end_time - start_time print(f"Response Time:{response_time}") return response final_result("when was dhoni born?") final_result("where did Dhoni study?") final_result("where did Dhoni's father work") final_result("what are the awards Mahendra singh Dhoni won??")
Here are the results for the 4 questions:
python3 model.py when was dhoni born? MS Dhoni was born on July 7, 1981. Response Time:16.51418948173523 where did Dhoni study? Dhoni studied at DAV Jawahar Vidya Mandir School located in Ranchi. Response Time:13.603944778442383 where did Dhoni's father work According to the text, Mahendra Singh Dhoni's father, Pan Singh, worked as a junior manager in Mecon. Response Time:5.351931810379028 what are the awards Mahendra singh Dhoni won?? Mahendra Singh Dhoni has won several awards throughout his career as a cricketer. Some of the notable awards he has won include: 1. LG's People's Choice Award in 2013. 2. Rajiv Gandhi Khel Ratna, the highest honour for a sportsperson in India, in 2013. Response Time:12.355783224105835
Wow! I'm thrilled to hear you've finished the tutorial. Congratulations! You've successfully built your own ChatGPT that operates on your document and runs locally.
You can find the code and pdf file here: https://github.com/abhi-singh-123/Custom-RAG-Chatbot