/SimXNS

SimXNS, a research project for information retrieval, containing official implementations, by MSRA NLC IR team.

Primary LanguagePythonMIT LicenseMIT

SimXNS

✨Updates | 📜Citation | 🤘Furthermore | ❤️Contributing | 📚Trademarks

SimXNS is a research project for information retrieval by MSRA NLC IR team. Some of the techniques are actively used in Microsoft Bing. This repo provides the official code implementations.

Currently, this repo contains SimANS, MASTER, PROD and LEAD, and all these methods are designed for information retrieval. Here are some basic descriptions to help you catch up with the characteristics of each work:

  • SimANS is a simple, general and flexible ambiguous negatives sampling method for dense text retrieval. It can be easily applied to various dense retrieval methods like AR2. This method is also applied in Bing search engine, which is proven to be effective.
  • MASTER is a multi-task pre-trained model that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture.
  • PROD is a novel distillation framework for dense retrieval, which consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student.
  • LEAD aligns the layer features of student and teacher, emphasizing more on the informative layers by re-weighting.

Updates

  • 2023/07/03: upload the pretrained MASTER checkpoints for MARCO and Wikipedia to huggingface model hub.
  • 2023/07/03: update approaches for downloading resources.
  • 2023/05/29: release the official code of LEAD.
  • 2023/02/16: refine the resources of SimANS by uploading files in a seperated style and offering the file list.
  • 2023/02/02: release the official code of PROD.
  • 2022/12/16: release the official code of MASTER.
  • 2022/11/17: release the official code of SimANS.

Citation

If you extend or use this work, please cite our paper where it was introduced:

  • SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan, Weizhu Chen. EMNLP 2022. Code, Paper.
  • MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers. Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen. ECML-PKDD 2023. Code, Paper.
  • PROD: Progressive Distillation for Dense Retrieval. Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan. WWW 2023. Code, Paper.
  • LEAD: Liberal Feature-based Distillation for Dense Retrieval. Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jian Jiao, Jingwen Lu, Yan Zhang, Daxin Jiang, Linjun Yang, Rangan Majumder, Nan Duan. arXiv. Code, Paper.
@article{zhou2022simans,
   title={SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval},
   author={Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan and Weizhu Chen},
   booktitle = {{EMNLP}},
   year={2022}
}
@article{zhou2022master,
   title={MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers},
   author={Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen},
   booktitle = {{ECML-PKDD}},
   year={2023}
}
@article{lin2023prod,
   title={PROD: Progressive Distillation for Dense Retrieval},
   author={Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan},
   booktitle = {{WWW}},
   year={2023}
}
@article{sun2022lead,
  title={LEAD: Liberal Feature-based Distillation for Dense Retrieval},
  author={Sun, Hao and Liu, Xiao and Gong, Yeyun and Dong, Anlei and Jiao, Jian and Lu, Jingwen and Zhang, Yan and Jiang, Daxin and Yang, Linjun and Majumder, Rangan and others},
  journal={arXiv preprint arXiv:2212.05225},
  year={2022}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.