Scripts to replicate simple file search and RAG in a directory with embeddings and Language Models.
From scratch implementation, no Vector DBs yet.
A simplified use case: You have thousands of research papers but don't know which are the ones containing content that you want. You do a search according to a rough query and get an adequately good results.
local-file-search-image-pdf-text-2024-09-05_07.17.37.mp4
Run the following in terminal in your preferred virtual/conda environment.
sh setup.sh
It will install the the requirements from the requirements.txt
file.
-
September 4, 2024: Added image, PDF, and text file chat to
ui.py
with multiple Phi model options. -
September 1, 2024: Now you can upload PDFs directly to the Gradio UI (
python ui.py
) and start chatting.
You can run ui.py
and select the any PDF file in the Gradio UI to interactively chat with the document. (Just do python ui.py
and start chatting)
-
(Optional) Download the
papers.csv
file from here and keep in thedata
directory. You can also keep PDF files in the directory and pass the directory path. -
(Optional) Execute this step only if you download the above CSV file. Not needed if you have your own text files or PDFs in a directory. Run the
csv_to_text_files.py
script to generate a directory of text files from the CSV file. -
Run
create_embeddings.py
to generate the embeddings that are stored in JSON file in thedata
directory. Check the scripts for the respective file names. Checksrc/create_embedding.py
for relevant command line arguments to be passed.-
Generate example:
python create_embeddings.py --index-file-name index_file_to_store_embeddings.json --directory-path path/to/directory/containing/files/to/embed
-
Additional command line arguments:
--add-file-content
: To store text chunks in JSON file if planning to do RAG doing file file search.--model
: Any Sentence Transformer model tag. Default isall-MiniLM-L6-v2
.--chunk-size
and--overlap
: Chunk size for creating embeddings and overlap between chunks.--njobs
: Number of parallel processes to use. Useful when creating embeddings for hundreds of files in a directory.
-
-
Then run the
search.py
with the path to the respective embedding file to start the search. Type in the search query.-
General example:
python search.py --index-file path/to/index.json
The above command just throws a list of TopK files that matches the query.
-
Additional command line arguments:
--extract-content
: Whether to print the related content or not. Only works if--add-file-content
was passed during creation of embeddings.--model
: Sentence Transformer model tag if a model other thanall-MiniLM-L6-v2
was used during the creation of embeddings.--topk
: Top K embeddings to match and output to the user.--llm-call
: Use an LLM to restructure the answer for the question asked. Only works if--extract-content
is passed as the model will need context. Currently the Phi-3 Mini 4K model is used.
-