Metrics are all zero with webis-touche2020 dataset
Closed this issue · 4 comments
Jia-py commented
Describe the bug
pt.init(version=5.7, helper_version="0.0.7")
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
retriever = pt.BatchRetrieve(index, wmodel='BM25', metadata=['docno','text'])
all the metrics are zero using bm25 in webis-touche2020 dataset, the same code works good for other datasets, such as beir/TREC-COVID.
To Reproduce
Steps to reproduce the behavior:
- Which index - irds:beir/webis-touche2020/v2
- Which retrieval - bm25 with default parameters
- What pipeline - just bm25 retriver
- What was the dataframe output - it has the retrieved results, but with all metrics zero.
Jia-py commented
seanmacavaney commented
Hi @Jia-py -- thanks for reporting.
Terrier indexes have a maximum length for the fields that they store, which includes the docno
. The default of 20 is often enough, but some datasets (such as touche) have longer docnos.
To change the maximum length, you'll need to set meta={"docno": 39}
when indexing, as follows (the maximum docno is 39 characters in the dataset):
indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020_v2', meta={"docno": 39})
I hope this helps!
Jia-py commented
Hi @seanmacavaney, thanks for your time and help! It works well now.
seanmacavaney commented
No problem, happy to help :)