/OpenNIR

An end-to-end neural ad-hoc ranking pipeline.

Primary LanguagePythonMIT LicenseMIT

OpenNIR

An end-to-end neural ad-hoc ranking pipeline.

Quick start

OpenNIR requires Python 3.6 (not tested with other versions). Java 11 is required (for Anserini).

  • OpenNIR can also be run in Docker; you can find instructions here.

Install dependencies

pip install -r requirements.txt

Train and validate a model (here, ConvKNRM on ANTIQUE):

scripts/pipeline.sh config/conv_knrm config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Grid serach for BM25 over ANTIQUE for comparision with neural model performance:

scripts/pipeline.sh config/grid_search config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Models, datasets, and vocabularies will be saved in ~/data/onir/. This can be overridden by setting data_dir=~/some/other/place/ as a command line argument, in a configuration file, or in the ONIR_ARGS environment variable.

Features

Rankers

  • DRMM ranker=drmm paper
  • Duet (local model) ranker=duetl paper
  • MatchPyramid ranker=matchpyramid paper
  • KNRM ranker=knrm paper
  • PACRR ranker=pacrr paper
  • ConvKNRM ranker=conv_knrm paper
  • Vanilla BERT config/vanilla_bert paper
  • CEDR models config/cedr/[model] paper
  • MatchZoo models source
    • MatchZoo's KNRM ranker=mz_knrm
    • MatchZoo's ConvKNRM ranker=mz_conv_knrm

Datasets

Evaluation Metrics

  • map (from trec_eval)
  • ndcg (from trec_eval)
  • ndcg@X (from trec_eval, gdeval)
  • p@X (from trec_eval)
  • err@X (from gdeval)
  • mrr (from trec_eval)
  • rprec (from trec_eval)
  • judged@X (implemented in python)

Vocabularies

  • Binary term matching vocab=binary (i.e., changes interaction matrix from cosine similarity to to binary indicators)
  • Pretrained word vectors vocab=wordvec
    • vocab.source=fasttext
      • vocab.variant=wiki-news-300d-1M, vocab.variant=crawl-300d-2M
      • (information about FastText variants can be found here)
    • vocab=source=glove
      • vocab.variant=cc-42b-300d, vocab.variant=cc-840b-300d
      • (information about GloVe variants can be found here)
    • vocab.source=convknrm
      • vocab.variant=knrm-bing vocab.variant=knrm-sogou, vocab.variant=convknrm-bing vocab.variant=convknrm-sogou
      • (information about ConvKNRM word embedding variants can be found here)
    • vocab.source=bionlp
      • vocab.variant=pubmed-pmc
      • (information about BioNLP variants can be found here)
  • Pretrained word vectors w/ single UNK vector for unknown terms vocab=wordvec_unk
    • (with above word embedding sources)
  • Pretrained word vectors w/ hash-based random selection for unknown terms vocab=wordvec_hash (defualt)
    • (with above word embedding sources)
  • BERT contextualized embeddings vocab=bert
    • Core models (from HuggingFace): vocab.bert_base=bert-base-uncased (default), vocab.bert_base=bert-large-uncased, vocab.bert_base=bert-base-cased, vocab.bert_base=bert-large-cased, vocab.bert_base=bert-base-multilingual-uncased, vocab.bert_base=bert-base-multilingual-cased, vocab.bert_base=bert-base-chinese, vocab.bert_base=bert-base-german-cased, vocab.bert_base=bert-large-uncased-whole-word-masking, vocab.bert_base=bert-large-cased-whole-word-masking, vocab.bert_base=bert-large-uncased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-large-cased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-base-cased-finetuned-mrpc
    • SciBERT: vocab.bert_base=scibert-scivocab-uncased, vocab.bert_base=scibert-scivocab-cased, vocab.bert_base=scibert-basevocab-uncased, vocab.bert_base=scibert-basevocab-cased
    • BioBERT vocab.bert_base=biobert-pubmed-pmc, vocab.bert_base=biobert-pubmed, vocab.bert_base=biobert-pmc

Citing OpenNIR

If you use OpenNIR, please cite the following WSDM demonstration paper:

@InProceedings{macavaney:wsdm2020-onir,
  author = {MacAvaney, Sean},
  title = {{OpenNIR}: A Complete Neural Ad-Hoc Ranking Pipeline},
  booktitle = {{WSDM} 2020},
  year = {2020}
}

Acknowledgements

I gratefully acknowledge support for this work from the ARCS Endowment Fellowship. I thank Andrew Yates, Arman Cohan, Luca Soldaini, Nazli Goharian, and Ophir Frieder for valuable feedback on the manuscript and/or code contributions to OpenNIR.