10-k Filings Analyzer using LLM

Description

This is a local LLM model trained on 10-k filings data from Google(Alphabet Inc.), Uber, and Tesla. The model lets user ask questions relevant to these filings.

Setup

For setup I have used the following:

Backend: Langchain V0.2
Frontend: Streamlit
Vector Store: FAISS
Embedding Model: mxbai-embed-large
LLM Model: Llama3:latest

UI Design and Sample Queries

UI Design features a sidebar that shows all the text that was taken as context. Each element is a chunk of size 256 with an overlap of 20. For the given queries I took four relevant contexts. Here, are the response to the given questions:

We can take more chunks as reference to get detailed answers for same questions, as shown below:

Development

I have extracted text from the PDF by using PyPDF. (Probably should use Unstructureed as their are a lot of tables that need to be read properly)
The text was then divided into smaller chunks so that the information can be retained properly (I have tried huge chunks such as 10000, but the context was lost completely). These chunks are created by using a technique known as recursive chunking.
Using a "mxbai-embed-large" as embedding model, I generated embedding vectors for the text extracted from the PDFs.
This embedded vectors are then stored in FAISS vector store. This database is saved locally, so that We do not have to scan to PDF's again and again
This database is accessed again to give context to LLM model so that it can answer the relevant queries.

Prerequisites

Use Ollama to download embedding and LLM models easily. Visit Ollama official page. Make sure Ollama is running and you have downloaded the relevant models i.e. llama3 and mxbai-embed-large from ollama website.

How to setup

Clone the Github repo git clone https://github.com/OsafAliSayed/Alemeno-Internship-Assignment/
Create a python virtual environment in the project repository python -m venv venv.
access the environment using terminal venv/Scripts/activate for windows and venv/bin/activate for linux.
Run pip install -r requirements.txt.
Finally run streamlit run frontend.py to launch the application.