关于 TaobaoAd_x1 数据集

Question

关于 TaobaoAd_x1 数据集

Closed this issue a year ago · 3 comments

aobaoAd_x1
Dataset description

Taobao is a dataset provided by Alibaba, which contains 8 days of ad click-through data (26 million records) that are randomly sampled from 1140000 users. Following the original data split, we use the first 7 days (i.e., 20170506-20170512) of samples for training, and the last day's samples (i.e., 20170513) for testing. We follow the preprocessing steps that have been applied to reproducing the DMR work. Note that a small part (~5%) of samples have been dropped during preprocessing due the missing of user or item profiles. The preprocessed data can be accessed from the BARS benchmark.

上述描述部分指出： a small part (~5%) of samples have been dropped during preprocessing due to the missing of user or item profiles.

可以请教下，这部分sample是如何去除的吗？是否在 DMR data preprocessing 所提供代码的基础上引入了新的逻辑呀？感谢~！

Answer 1 · 2023-09-01T00:48:02.000Z

这个missing是原数据预处理的逻辑，我也是在数据处理后发现样本条数不对。经查对方的代码，发现有些user/item profile join不上的样本被去掉了。

Answer 2 · 2023-09-01T01:37:49.000Z

感谢答复~！发现DMR复现方案的作者，本着【按时间戳排序，防止穿越】的思路，生成了train_sorted.csv。关于这一点有两处疑惑请教，谢谢！

无论train.csv是否sort，都会在 train_loader 中被 shuffle=True 打乱，这里 train_sorted.csv 是否有必要生成呀
对于同一个 user_id，根据 timestamp 的前后，训练集中会有 test_timestamp > train_timestamp_1 > train_timestamp_2 的两个样本 (train_timestamp_1, train_timestamp_2)，这会不会导致 train_timestamp_1 的gt泄露给 train_timestamp_2

Answer 3 · 2023-09-01T01:51:07.000Z

DMR原方案训练只过一遍数据，并按时间顺序读入batch，所以他进行排序了。我复现的时候使shuffle的，效果比他复现的要好。
训练测试的划分是按时间戳一分为2的。整体来说没有穿越。但是训练过程读入batch确实会有不同样本之间相互泄露的问题，但是业界并没有过度关注这个问题，生产中好像也没有大问题。