This project contains code for testing BM25-related stuff and the best embeddigng settings used for llm attribution.
A demo can be found here 🤗. Just enter some text into the query to get the documents it probably came from (right now it only contains some wikipedia articles).
bm25_init.py
andbm25_intro
contains code for the implementation and testing of BM25.embedding_examples
folder contains some examples of the embeddings of different sentence transformer models used for testing (the rest is stored elsewher).test_data
contains all the test data from Wikipedia, TreCovid, and NFCorpus.generate_embeddings
- creates the different embeddings to compare using some given model from Huggingface.dim_reduce
- tests dimensionlaity reduction using PCA on some embeddings to see how accuracy changesresults
- contains the pkl files for all the different embeddings and settings experimented.analyssis
- contains all the analysis files for the results above.