terrier-org/pyterrier

Metrics are all zero with webis-touche2020 dataset

Closed this issue · 4 comments

Describe the bug

pt.init(version=5.7, helper_version="0.0.7")
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
retriever = pt.BatchRetrieve(index, wmodel='BM25', metadata=['docno','text'])

all the metrics are zero using bm25 in webis-touche2020 dataset, the same code works good for other datasets, such as beir/TREC-COVID.

To Reproduce
Steps to reproduce the behavior:

  1. Which index - irds:beir/webis-touche2020/v2
  2. Which retrieval - bm25 with default parameters
  3. What pipeline - just bm25 retriver
  4. What was the dataframe output - it has the retrieved results, but with all metrics zero.

Hi @Jia-py -- thanks for reporting.

Terrier indexes have a maximum length for the fields that they store, which includes the docno. The default of 20 is often enough, but some datasets (such as touche) have longer docnos.

To change the maximum length, you'll need to set meta={"docno": 39} when indexing, as follows (the maximum docno is 39 characters in the dataset):

indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020_v2', meta={"docno": 39})

I hope this helps!

Hi @seanmacavaney, thanks for your time and help! It works well now.

No problem, happy to help :)