Wrong number of items
Closed this issue · 1 comments
celsofranssa commented
After reading all the items (from the 34 categories), I just found 35.39M
instead of the reported 48.19M
.
items = {}
for category in tqdm(categories, desc="Reading metadata"):
with open(f"resource/dataset/Amazon-Reviews-2023/raw/meta_{category}.jsonl", 'r') as category_file:
for line in category_file:
item = json.loads(line.strip())
items[item["parent_asin"]] = item["title"]
len(items)
# 35393189
Did I miss something?
hyp1231 commented
Not all items in the dataset have associated metadata. The total item count is based on the distinct number of parent_asin
in the user-item reviews. As a result, the dataset includes 48.19M
items in total, of which 35.39M
have metadata.