facebookresearch/DPR

mismatch between encoded results and wiki passages

Hannibal046 opened this issue · 0 comments

Hi, thanks so much for the great work. I have a question about the size of wiki passages and encoded index. After downloading the data as instructed, I found the size of index doesn't match that of passages:

import pickle,csv

n_embedding = 0
for idx in range(50):
    index_path = f"DPR/dpr/downloads/data/retriever_results/nq/single/wikipedia_passages_{idx}.pkl"
    data = pickle.load(open(index_path,'rb'))
    n_embedding += len(data)


n_doc = 0
wikidata_path = "DPR/dpr/downloads/data/wikipedia_split/psgs_w100.tsv"
docs = []
with open(wikidata_path) as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        if row[0] == "id":continue
        n_doc += 1

print("n_embedding=",n_embedding)
print("n_doc=",n_doc)

The results are:

n_embedding= 21015300
n_doc= 21015324