PDF GPT

Use Your Own PDF's and OpenAI's API to Chat with your own PDF's

November 13, 2023 · 10 mins read

Table of Contents

Summary

This chatbot is designed to interact with users, answering questions related to multiple PDF documents uploaded by the user. The code employs various libraries and modules, such as Streamlit, PyPDF2, and custom modules from the ‘langchain’ library.

The main program begins with the import statements, bringing in the necessary libraries and modules. Streamlit is used for building interactive web applications, providing an interface for the user to interact with the chatbot. The ‘dotenv’ library helps load environment variables, which might be used for sensitive information or configuration. The ‘PyPDF2’ library is employed for reading text from PDF documents, and custom modules from the ‘langchain’ library are utilized for natural language processing tasks.

Moving on to the functions section, several key functions are defined to perform specific tasks. These functions encapsulate various aspects of the chatbot’s functionality, making the code modular and easier to understand.

get_pdf_text(pdf_docs): This function takes a list of PDF documents as input, iterates through them using the ‘PdfReader’ from PyPDF2, and extracts text from each page. The resulting text is then concatenated and returned.

get_text_chunks(text): Here, a ‘CharacterTextSplitter’ from the ‘langchain’ library is employed to split the text into smaller, manageable chunks. This is crucial for efficient processing and analysis, especially when dealing with large volumes of text.

get_vectorstore(text_chunks): This function utilizes embeddings, either from OpenAI or Hugging Face, to convert text chunks into vector representations. These embeddings are then used to create a vector store using the ‘FAISS’ library. This step is essential for comparing and retrieving relevant information during user interactions.

get_conversation_chain(vectorstore): The conversation chain is a core component of the chatbot. It involves initializing a chat language model (LLM), setting up a memory buffer for storing conversation history, and creating a conversational retrieval chain. The retriever is based on the vector store, ensuring that the chatbot can retrieve information relevant to the user’s queries.

handle_userinput(user_question): This function is responsible for processing user input, querying the conversation chain for a response, and displaying the response using the Streamlit interface. The conversation history is also updated, allowing for a continuous and context-aware conversation.

Moving on to the main program, the ‘main()’ function sets up the Streamlit page configuration, including the page title and icon. It also includes some HTML templates for styling, enhancing the visual appeal of the user interface.

The main interface consists of a header prompting the user to ask questions about their documents. A text input field is provided for users to enter their questions, and a button triggers the handling of user input.

The ‘handle_userinput’ function is called when the user submits a question. It queries the conversation chain for a response and updates the conversation history, which is then displayed on the Streamlit interface. Additionally, a set of predefined messages between the user and the chatbot is displayed using HTML templates.

The sidebar of the interface allows users to upload multiple PDF documents. Upon clicking the “Process” button, the code processes the PDF documents, extracts text, splits it into chunks, creates a vector store, and initializes the conversation chain. The conversation chain is stored in the Streamlit session state, ensuring that it persists across user interactions.

In summary, this Python code presents a comprehensive implementation of a chatbot capable of interacting with users based on information extracted from multiple PDF documents. The modular design, integration of natural language processing techniques, and the use of external libraries make the code versatile and capable of handling various conversational scenarios. The Streamlit framework provides a user-friendly interface, making it accessible for users to engage with the chatbot seamlessly.

Imports

# streamlit run app.py

import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from htmlTemplates import css, bot_template, user_template

Functions

def get_pdf_text(pdf_docs):
    text = ""
    # loop thru pdfs and read them
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf) # creates pdf object with pages
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n", 
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

def get_vectorstore(text_chunks):
    # choose between free huggingFace or OpenAi
    embeddings = OpenAIEmbeddings()
    #embeddings = HuggingFaceInstructEmbeddings(model_name = 'hku-nlp/instructor-xl' )
    vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vectorstore

def get_conversation_chain(vectorstore):
    # initalize memory 
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm-llm,
        retriever=vectorstore.as_retriever(),
        memory=memory

    )
    return conversation_chain


def handle_userinput(user_question):
    response = st.session_state.conversation({'question': user_question})
    st.write(response)

    # for i, message in enumerate(st.session_state.chat_history):
    #     if i % 2 == 0:
    #         st.write(user_template.replace(
    #             "", message.content), unsafe_allow_html=True)
    #     else:
    #         st.write(bot_template.replace(
    #             "", message.content), unsafe_allow_html=True)

Main Program


def main():
    load_dotenv()

    st.set_page_config(page_title="Chat with multiple PDFs", 
                       page_icon=":books:")
    st.write(css, unsafe_allow_html=True)

    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = None

    st.header("Chat with multiple PDFs :books:")
    user_question = st.text_input("Ask a question about your documents:")
    if user_question:
        handle_userinput(user_question)

    st.write(user_template.replace("", "Hello Robot"), unsafe_allow_html=True)
    st.write(bot_template.replace("", "Hello human"), unsafe_allow_html=True)

    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs = st.file_uploader(
            "Upload your PDFs here, then click on 'Process'", accept_multiple_files=True)
        if st.button("Process"):
            with st.spinner("Processing"):
                # get pdf text
                raw_text = get_pdf_text(pdf_docs)
                #st.write(raw_text)
                # get the text chunks
                text_chunks = get_text_chunks(raw_text)
                #st.write(text_chunks)

                #create vector store
                vectorstore = get_vectorstore

                # create conversation chain

                st.session_state.conversation = get_conversation_chain(vectorstore)
    
    




if __name__ == '__main__':
    main()

Streamlit Screenshots

png

png

png