mismatch between encoded results and wiki passages
Hannibal046 opened this issue · 0 comments
Hannibal046 commented
Hi, thanks so much for the great work. I have a question about the size of wiki passages and encoded index. After downloading the data as instructed, I found the size of index doesn't match that of passages:
import pickle,csv
n_embedding = 0
for idx in range(50):
index_path = f"DPR/dpr/downloads/data/retriever_results/nq/single/wikipedia_passages_{idx}.pkl"
data = pickle.load(open(index_path,'rb'))
n_embedding += len(data)
n_doc = 0
wikidata_path = "DPR/dpr/downloads/data/wikipedia_split/psgs_w100.tsv"
docs = []
with open(wikidata_path) as f:
reader = csv.reader(f, delimiter="\t")
for row in reader:
if row[0] == "id":continue
n_doc += 1
print("n_embedding=",n_embedding)
print("n_doc=",n_doc)
The results are:
n_embedding= 21015300
n_doc= 21015324