In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely 🚀RocketQA. This toolkit has the following advantages:
- State-of-the-art: 🚀RocketQA provides our well-trained models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the latest models.
- First-Chinese-model: 🚀RocketQA provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from DuReader.
- Easy-to-use: By integrating this toolkit with JINA, 🚀RocketQA can help developers build an end-to-end question answering system with several lines of code.
We provide two installation methods: Python Installation Package and Docker Environment
First, install PaddlePaddle.
# GPU version:
$ pip install paddlepaddle-gpu
# CPU version:
$ pip install paddlepaddle
Second, install rocketqa package:
$ pip install rocketqa
NOTE: this toolkit MUST be running on Python3.6+ with PaddlePaddle 2.0+.
docker pull rocketqa/rocketqa
docker run -it docker.io/rocketqa/rocketqa bash
Refer to the examples below, you can build and run your own Search Engine with several lines of code. We also provide a Playground with JupyterNotebook. Try 🚀RocketQA straight away in your browser!
JINA is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.
cd examples/jina_example
pip3 install -r requirements.txt
# Generate vector representations and build a libray for your Documents
# JINA will automaticlly start a web service for you
python3 app.py index toy_data/test.tsv
# Try some questions related to the indexed Documents
python3 app.py query_cli
Please view JINA example to know more.
We also provide a simple example built on Faiss.
cd examples/faiss_example/
pip3 install -r requirements.txt
# Generate vector representations and build a libray for your Documents
python3 index.py en ../marco.tp.1k marco_index
# Start a web service on http://localhost:8888/rocketqa
python3 rocketqa_service.py en ../marco.tp.1k marco_index
# Try some questions related to the indexed Documents
python3 query.py
You can also easily integrate 🚀RocketQA into your own task. We provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running our models, you can use the following functions.
Returns the names of the available RocketQA models. To know more about the available models, please see the code comment.
Returns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by "available_models()" or your own checkpoints.
Dual-encoder returned by "load_model()" supports the following functions:
Given a list of queries, returns their representation vectors encoded by model.
Given a list of paragraphs and their corresponding titles (optional), returns their representations vectors encoded by model.
Given a list of queries and paragraphs (and titles), returns their matching scores (dot product between two representation vectors).
Cross-encoder returned by "load_model()" supports the following function:
Given a list of queries and paragraphs (and titles), returns their matching scores (probability that the paragraph is the query's right answer).
Following the examples below, you can retrieve the vector representations of your documents and connect 🚀RocketQA to your own tasks.
To run RocketQA models, you should set the parameter model
in 'load_model()' with RocketQA model name returned by 'available_models()'.
import rocketqa
query_list = ["trigeminal definition"]
para_list = [
"Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]
# init dual encoder
dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)
# encode query & para
q_embs = dual_encoder.encode_query(query=query_list)
p_embs = dual_encoder.encode_para(para=para_list)
# compute dot product of query representation and para representation
dot_products = dual_encoder.matching(query=query_list, para=para_list)
- August 26, 2021: RocketQA v2 was accepted by EMNLP 2021.
- May 5, 2021: PAIR was accepted by ACL 2021
- March 11, 2021: RocketQA v1 was accepted by NAACL 2021.
If you find RocketQA v1 models helpful, feel free to cite our publication RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering
@inproceedings{rocketqa_v1,
title="RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering",
author="Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu and Haifeng Wang",
year="2021",
booktitle = "In Proceedings of NAACL"
}
If you find PAIR models helpful, feel free to cite our publication PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval
@inproceedings{rocketqa_pair,
title="PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval",
author="Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
year="2021",
booktitle = "In Proceedings of ACL Findings"
}
If you find RocketQA v2 models helpful, feel free to cite our publication RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
@inproceedings{rocketqa_v2,
title="RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking",
author="Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
year="2021",
booktitle = "In Proceedings of EMNLP"
}
This repository is provided under the Apache-2.0 license.
For help or issues using RocketQA, please submit a Github issue.
For other communication or cooperation, please contact Jing Liu (liujing46@baidu.com) or scan the following QR Code.