hyp1231/AmazonReviews2023

Wrong number of items

Closed this issue · 1 comments

After reading all the items (from the 34 categories), I just found 35.39M instead of the reported 48.19M.

items = {}
for category in tqdm(categories, desc="Reading metadata"):
    with open(f"resource/dataset/Amazon-Reviews-2023/raw/meta_{category}.jsonl", 'r') as category_file:
        for line in category_file:
            item = json.loads(line.strip())
            items[item["parent_asin"]] = item["title"]
len(items)
# 35393189

Did I miss something?

Not all items in the dataset have associated metadata. The total item count is based on the distinct number of parent_asin in the user-item reviews. As a result, the dataset includes 48.19M items in total, of which 35.39M have metadata.