Your Personal ChatGPT Tutorial

Introduction

We are currently in a golden age of AI, with the emergence of large language models (LLMs) creating entirely new opportunities. This advancement is benefiting many industries. The technique of Retrieval Augmented Generation (RAG) has gained popularity for answering questions based on specific documents using LLMs. This blog will introduce RAG and demonstrate how it can be applied for Q&A on your PDF files.

RAG has two main components: ingestion and retrieval. Ingestion involves generating embeddings by chunking the knowledge base document—the source of truth—and passing these chunks through an embedding model. We store the embeddings in a Vectorstore, which preserves their semantic meaning. Retrieval works like this:

when a user submits a query, Langchain conducts a similarity search in the VectorDB
the VectorDB then returns several relevant documents or contexts
Langchain combines the query, context, and a system prompt, and sends this to the LLM to generate a response

Below is the basic RAG workflow.

Techstack for personal ChatGPT/RAG

Here are the tools we will use:

Programming Language - Python
Langchain for the interaction with LLM and sentence transformer all-MiniLM-L6-v2 model for embedding
FAISS(from Facebook) as a vectorstore
Ollama for hosting the model locally
A small pdf file on MS Dhoni(Former Indian Cricketer and Captain).

Scripts and files can be found in the GitHub location (mentioned later in the article)

Time to jump into the tutorial. Let's do it step by step.

Install required libraries

Ensure pip is installed on your machine.

pip install pypdf, sentence-transformers, faiss-cpu, langchain_huggingface, langchain_community

Chunking, embedding, and saving embedding to vectorstore

Let’s begin with the ingestion workflow. First, we divide the PDF file (dhoni.pdf) into manageable chunks using Langchain’s RecursiveCharacterTextSplitter. Next, we apply the sentence-transformers/all-MiniLM-L6-v2 model to generate dense vector representations of the sentences and texts. Finally, Langchain saves these vectors in the FAISS Vector Database, enabling efficient similarity searches. Let's jump into this by importing libraries.

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.chains import RetrievalQA
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.llms import Ollama

The key element for the ingestion workflow is to generate embedding vectors. To start with, create an ingest.py file and define the path of the knowledge base and vector store to store the embedding. Create these directories (use mkdir command on Unix/linux platform).

Next, define a Python method, create_vector_db, for generating the embedding. The create_vector_db function loads the dhoni.pdf file (our knowledge base) and splits it into chunks of 1,000 characters. The RecursiveCharacterTextSplitter from Langchain helps in achieving this. We can define chunk overlaps of 200 characters (20%) to minimize the loss of context. 'The uggingFaceEmbeddings class from Langchain helps to embed the pdf file using sentence-transformer embedding model. After generating the embeddings, we save them in the vectorstore/ directory. Below is the code to achieve this.

DATA_PATH = 'data/'
DB_FAISS_PATH = 'vectorstore/'
# Create vector database
def create_vector_db():
    loader = DirectoryLoader(DATA_PATH,
                             glob='*.pdf',
                             loader_cls=PyPDFLoader)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
                                                   chunk_overlap=200)
    texts = text_splitter.split_documents(documents)
    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2',
                                       model_kwargs={'device': 'cpu'})
    db = FAISS.from_documents(texts, embeddings)
    db.save_local(DB_FAISS_PATH)
if __name__ == "__main__":
    create_vector_db()

Set a custom prompt

Let’s set up a system prompt that instructs the LLM to act as an MS Dhoni fan and use the provided context to answer user queries. PromptTemplate class from langchain is helpful to set the custom template and set the tone of the conversation. To implement this, we’ll create a Python file called model.py and define a function set_custom_prompt().

Here’s a basic structure for the file:

custom_prompt_template = """You are a MS Dhoni fan and you know everything about him. Use the context to answer the questions. Do not answer anything outside the context.
Context: {context}
Question: {question}
Provide the answer below in a clear and readable format.
Answer:
"""
def set_custom_prompt():
    """
    Prompt template for QA retrieval for each vectorstore
    """
    prompt = PromptTemplate(template=custom_prompt_template,
                            input_variables=['context', 'question'])
    return prompt

Load LLM

Ollama is used for loading the model. A Docker container can be started with ollama

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Install llama2 using the following command

Ollama pull llama2

Load the model in model.py

def load_llm():
    # Load the locally downloaded model here
    llm = Ollama(
        model="llama2",
        temperature=0.01,
        verbose=True,
        callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    )
    return llm

Retrieval based on similarity search

Langchain first takes the user query and converts it into embeddings using the embedding model. It loads the existing embedding done during the ingestion workflow. Also, call a function called 'retrieval_qa_chain' that will conduct the semantic search. Langchain combines the system prompt, the user's embedded query, and the context from the vector search before sending this information to the LLM. Finally, the LLM generates a response based on these inputs, which Langchain returns to the user.

Let’s write a function to handle this process.

def qa_bot():
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
                                       model_kwargs={'device': 'cpu'})
   
    try:
        # Load Faiss index with dangerous deserialization enabled
        db = FAISS.load_local(DB_FAISS_PATH, embeddings, allow_dangerous_deserialization=True)
        # Use the loaded index
        # Example: query the index, etc.
    except ValueError as e:
        print(f"ValueError loading Faiss index from {DB_FAISS_PATH}: {str(e)}")
        # Handle the error appropriately (e.g., log, notify, or exit gracefully)
    except Exception as e:
        print(f"Error loading Faiss index from {DB_FAISS_PATH}: {str(e)}")
    llm = load_llm()
    qa_prompt = set_custom_prompt()
    qa = retrieval_qa_chain(llm, qa_prompt, db)
    return qa

Langchain's RetrievalQA class is used to do the semantic search. It takes Next, it conducts a similarity search with the embedded query, prompting VectorDB to return documents related to the query based on semantic relationships. This method takes LLM instance, prompt template, and document database (FAISS index) as in input. RetrievalQA.from_chain_type retrieves the result from the document database and generates the final output based on the Top 2 results(K value is '2').

#Retrieval QA Chain
def retrieval_qa_chain(llm, prompt, db):
    qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                       chain_type='stuff',
                                       retriever=db.as_retriever(search_type = "similarity", search_kwargs={'k': 2}),
                                       return_source_documents=True,
                                       chain_type_kwargs={'prompt': prompt}
                                       )
    return qa_chain

Since qa_bot is the entry point of the entire workflow, let's create a wrapper function called final_result that will take the user query call the qa_bot method (that sets the prompt, and loads the LLM using the previously created function). Let's also capture the time taken by each query by calculating the start and end time using Python's time package.

Here is the function that will call the QA bot and print the result

def final_result(query):
    start_time = time.time()
    qa_result = qa_bot()
    response = qa_result({'query': query})
    # ANSI escape code for green color
    green_color_code = '\033[92m'
    # Reset ANSI escape code (to revert to default color)
    reset_color_code = '\033[0m'
    print("\n" + query + "\n")
    print(green_color_code + "\n" +  response['result'] + reset_color_code)
    #print(response)
    end_time = time.time()
    response_time = end_time - start_time
    print(f"Response Time:{response_time}")
    return response
final_result("when was dhoni born?")
final_result("where did Dhoni study?")
final_result("where did Dhoni's father work")
final_result("what are the awards Mahendra singh Dhoni won??")

Here are the results for the 4 questions:

python3 model.py
when was dhoni born?
MS Dhoni was born on July 7, 1981.
Response Time:16.51418948173523
where did Dhoni study?
Dhoni studied at DAV Jawahar Vidya Mandir School located in Ranchi.
Response Time:13.603944778442383
where did Dhoni's father work
According to the text, Mahendra Singh Dhoni's father, Pan Singh, worked as a junior manager in Mecon.
Response Time:5.351931810379028
what are the awards Mahendra singh Dhoni won??
Mahendra Singh Dhoni has won several awards throughout his career as a cricketer. Some of the notable awards he has won include:
1. LG's People's Choice Award in 2013.
2. Rajiv Gandhi Khel Ratna, the highest honour for a sportsperson in India, in 2013.
Response Time:12.355783224105835

Wow! I'm thrilled to hear you've finished the tutorial. Congratulations! You've successfully built your own ChatGPT that operates on your document and runs locally.

You can find the code and pdf file here: https://github.com/abhi-singh-123/Custom-RAG-Chatbot

Your Personal ChatGPT Tutorial

Introduction

Techstack for personal ChatGPT/RAG

Install required libraries

Chunking, embedding, and saving embedding to vectorstore

Set a custom prompt

Load LLM

Retrieval based on similarity search

Rate

Share

Categories

Share

Rate

Your Personal ChatGPT Tutorial

Introduction

Techstack for personal ChatGPT/RAG

Install required libraries

Chunking, embedding, and saving embedding to vectorstore

Set a custom prompt

Load LLM

Retrieval based on similarity search

Rate

Share

Categories

Share

Rate

Related content

Exploring Generative Adversarial Networks (GANs) for Butterfly Image Generation

Running Mistral AI model on AWS Bedrock using Python

Python script to dynamically query any database using bedrock generative ai

Python script to write SQL using Amazon Bedrock Generative AI

Sentiment Analysis with AI