The main objective of this notebook is to demonstrate how to use the LLM to create the QA bot for PDF files. The process can break into a number of steps:
- Data Preprocessing - PDF files are processed to extract their textual content
- Content Chunking - The text content is divided into fixed-size chunks of 512 tokens, overlapping 100 tokens.
- Embeddings Generation - Each content chunk is transformed into embeddings using
e5-base-b2
model. (Any other suitable model can be used here) - Building the QA bot:
google/flan-t5-large
LLM model was selected for this task.
- Git clone this repo
- Install related packages
pip3 install -r requirements.txt
- Start jupyter notebook
jupyter notebook
- Click
demo.ipynb
and run all cells.