English | Español
This repository contains the source code for the web application: Ask the PDF. The application enables users to upload a PDF document and, regardless of its size, begin asking questions about it. Additionally, it provides a concise summary in a table-of-contents format as a guide. It was designed for Spanish speakers, but the application also supports responses in English. It utilizes LLaMA 2 or gpt-3.5 Turbo as its language model; further details are provided below.
- Python version: 3.10
- Libraries: langchain, pypdf, nltk, faiss, gensim, gradio, sentence-transformers, hugchat.
- Installation: requirements.txt
The simplest way to try the app is to follow the link:
It was deployed with Docker and using the serverless service from Google, Cloud Run.
To run the web app on your local machine, install python 3.10 and git. Then, run on the terminal:
git clone https://github.com/a-jimenezc/pdf_chat
cd pdf_chat
pip install -r requirements.txt
python download_sentence_transformer.py
python app.py
The main goal of this project is to enable chatting with a PDF document, regardless of its size and without incurring any additional cost. This was achieved with the help of open source libaries and resources:
Providing a concise summary of the document helps the user gain an understanding of its content, making it easier to start chatting. LLMs can summarize documents, but the computational cost for large documents makes using them for this task prohibitively expensive. So, some preprossessing was needed. LDA (Latent Dirichlet Allocation) is a great algorithm for document processing. It produces a list of words per topic; it also allows the selection of the number of topics and words per topic. Then, the output word lists could be feeded into an LLM and ask, prompting it to articulate a description using natural language. This approach aids in extracting the core ideas from the document. In this case, the summary is given in a table-of-content format.
Creating a vector database for a one thousand page book would require a lot of calls to an embedding generator. Luckily, embedding generator models are relarively small and can be run in consumer hardware. The library SentenceTransformers is a "Python framework for state-of-the-art sentence, text and image embeddings". For this scenario, Multi-Lingual Models were used, due to the multilingual purpose of the application. distiluse-base-multilingual-cased-v1 worked fine after some testing, so this was used.
The library hugchat by Soulter offers an unofficial API for Hugging Chat. Currently, the model powering Hugging Chat is LLaMA 2, from Meta. So, the terms of use, limitations, caveats, and licencing stipulated by both LLaMA 2 and Hugging Chat apply when using this web application. Please, follow the previous links for more information.
The application also offers support for the OpenAI gpt-3.5 Turbo model. However, users are required to input their own API key due to associated costs. Please note that this information is kept temporarily in a per sesion basis. Once the sesion is closed, the API key is deleted.
The application was built with gradio. It offers back-end and front-end support for machine learning applications. Also, they have an exelent support for language models, including a very handy chat object.
LangChain made this project possible. It offers a rich set of tools for working wiht LLMs, including template for prompts, vector databases, and more.
- Add multilanguage support
- Add more LLMs to choose from.
GNU General Public License v2.0
This application relies on third-party libraries and resources. Consequently, its utilization is subject to specific terms of use, conditions, and licenses that pertain to these external libraries and resources.