OpenNIR

An end-to-end neural ad-hoc ranking pipeline.

Quick start

OpenNIR requires Python 3.6 (not tested with other versions). Java 11 is required (for Anserini).

OpenNIR can also be run in Docker; you can find instructions here.

Install dependencies

pip install -r requirements.txt

Train and validate a model (here, ConvKNRM on ANTIQUE):

scripts/pipeline.sh config/conv_knrm config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Grid serach for BM25 over ANTIQUE for comparision with neural model performance:

scripts/pipeline.sh config/grid_search config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Models, datasets, and vocabularies will be saved in ~/data/onir/. This can be overridden by setting data_dir=~/some/other/place/ as a command line argument, in a configuration file, or in the ONIR_ARGS environment variable.

Features

Rankers

DRMM ranker=drmm paper
Duet (local model) ranker=duetl paper
MatchPyramid ranker=matchpyramid paper
KNRM ranker=knrm paper
PACRR ranker=pacrr paper
ConvKNRM ranker=conv_knrm paper
Vanilla BERT config/vanilla_bert paper
CEDR models config/cedr/[model] paper
MatchZoo models source
- MatchZoo's KNRM ranker=mz_knrm
- MatchZoo's ConvKNRM ranker=mz_conv_knrm

Datasets

TREC Robust 2004 config/robust/fold[x]
MS-MARCO config/msmarco
ANTIQUE config/antique
TREC CAR config/car
New York Times config/nyt -- for content-based weak supervision
TREC Arabic, Mandarin, and Spanish config/multiling/* -- for zero-shot multilingual transfer learning (instructions)

Evaluation Metrics

map (from trec_eval)
ndcg (from trec_eval)
ndcg@X (from trec_eval, gdeval)
p@X (from trec_eval)
err@X (from gdeval)
mrr (from trec_eval)
rprec (from trec_eval)
judged@X (implemented in python)

Vocabularies

Binary term matching vocab=binary (i.e., changes interaction matrix from cosine similarity to to binary indicators)
Pretrained word vectors vocab=wordvec
- vocab.source=fasttext
  - vocab.variant=wiki-news-300d-1M, vocab.variant=crawl-300d-2M
  - (information about FastText variants can be found here)
- vocab=source=glove
  - vocab.variant=cc-42b-300d, vocab.variant=cc-840b-300d
  - (information about GloVe variants can be found here)
- vocab.source=convknrm
  - vocab.variant=knrm-bing vocab.variant=knrm-sogou, vocab.variant=convknrm-bing vocab.variant=convknrm-sogou
  - (information about ConvKNRM word embedding variants can be found here)
- vocab.source=bionlp
  - vocab.variant=pubmed-pmc
  - (information about BioNLP variants can be found here)
Pretrained word vectors w/ single UNK vector for unknown terms vocab=wordvec_unk
- (with above word embedding sources)
Pretrained word vectors w/ hash-based random selection for unknown terms vocab=wordvec_hash (defualt)
- (with above word embedding sources)
BERT contextualized embeddings vocab=bert
- Core models (from HuggingFace): vocab.bert_base=bert-base-uncased (default), vocab.bert_base=bert-large-uncased, vocab.bert_base=bert-base-cased, vocab.bert_base=bert-large-cased, vocab.bert_base=bert-base-multilingual-uncased, vocab.bert_base=bert-base-multilingual-cased, vocab.bert_base=bert-base-chinese, vocab.bert_base=bert-base-german-cased, vocab.bert_base=bert-large-uncased-whole-word-masking, vocab.bert_base=bert-large-cased-whole-word-masking, vocab.bert_base=bert-large-uncased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-large-cased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-base-cased-finetuned-mrpc
- SciBERT: vocab.bert_base=scibert-scivocab-uncased, vocab.bert_base=scibert-scivocab-cased, vocab.bert_base=scibert-basevocab-uncased, vocab.bert_base=scibert-basevocab-cased
- BioBERT vocab.bert_base=biobert-pubmed-pmc, vocab.bert_base=biobert-pubmed, vocab.bert_base=biobert-pmc

Citing OpenNIR

If you use OpenNIR, please cite the following WSDM demonstration paper:

@InProceedings{macavaney:wsdm2020-onir,
  author = {MacAvaney, Sean},
  title = {{OpenNIR}: A Complete Neural Ad-Hoc Ranking Pipeline},
  booktitle = {{WSDM} 2020},
  year = {2020}
}

Acknowledgements

I gratefully acknowledge support for this work from the ARCS Endowment Fellowship. I thank Andrew Yates, Arman Cohan, Luca Soldaini, Nazli Goharian, and Ophir Frieder for valuable feedback on the manuscript and/or code contributions to OpenNIR.

eugene-yang/OpenNIR