Linux Python 3.9.19 PyTorch 2.0.1
Please refer to requirements.txt for detail environments
- AQA: dataset from offical
- model_finetune: finetune embedding models and rerank models
- config: accelerate trainning config
- rag-retrieval: trainging code
- processed data: save processed data
- results: save results txt file
- src: main pipline of retriver
- In addition we provide the checkpoint of finetuned gte-large-en-v1.5, you can download it from the Baidu Netdisk (link:https://pan.baidu.com/s/1OOhfzIhedVWpgBYnBcTwKQ?pwd=8zoc, password:8zoc). You can convert the checkpoint to sentence-transformer checkpint form following the guidence of sentence-transformer official site SentenceTransformers Documentation — Sentence Transformers documentation (sbert.net)
- embedding model: NV-Embed-v1,Linq-Embed-Mistral,GritLM-7B, gte-large-en-v1.5, SFR-Embedding-Mistral
- rerank model: bge-reranker-v2-m3
cd model_finetune/rag-retrieval/embedding
bash train_embedding.sh
Note that you need to set up your relevant parameters (e.g., model name, number of negative examples) in .sh file before running it.
cd model_finetune/rag-retrieval/reranker
bash train_rerank.sh
Note that you need to set up your relevant parameters (e.g., model name, number of negative examples) in .sh file before running it.
python3 src/build_retriever.py
Note that you need to set up your relevant parameters (e.g., embedding model path, save_path) in build_retriever.py
file before running it.
python3 src/get_related_doc.py
See the code comments for details on how to use. You need to specify the model name of model path in the code to produce your results for each model.
python3 src/rerank.py
See the code comments for details on how to use.
python src/preprocess.py
python src/hard_negative_mining.py
See the code comments for details on how to use. We use this code to mine the hard negative examples, according to the similarity between docs and queries.
python3 src/RRF.PY
See the code comments for details on how to use. This code can merge the retrieval result produced by different models, using reciprocal rank fusion(RRF).
- hyde.py: Generating hypothetical answers for retrieval via LLM
- doc_classifier.py: Classify documents into different categories
- query_rewrite: Rewrite query
We have explored a lot of ways to improve the effectiveness of our model, but some prove to be only useful in the valid dataset (for example, reranker and hard negative mining), so we only apply some of the functions mentioned above in our best-result-model.
cd model_finetune/rag-retrieval/embedding
bash train_embedding.sh
We finetune gte-large-en-v1.5 using contrastive learning.
python3 src/build_retriever.py
python3 src/get_related_doc.py
Here we construct retrievers of five models: gte-large-en-v1.5(finetuned), GritLm-7B, SFR-Embedding-Mistral, NV-Embed-v1, Linq-Embed-Mistral. We retrieve 100 docs for each query by those five retrievers, and get five result files respectively in the result folder.
python3 src/RRF.PY
We finally combine the result of the mentioned fine models using RRF, and after voting, top-20 results of each query are selected as the final results.