reczoo/BARS

关于 TaobaoAd_x1 数据集

Closed this issue · 3 comments

lemyx commented

aobaoAd_x1
Dataset description

Taobao is a dataset provided by Alibaba, which contains 8 days of ad click-through data (26 million records) that are randomly sampled from 1140000 users. Following the original data split, we use the first 7 days (i.e., 20170506-20170512) of samples for training, and the last day's samples (i.e., 20170513) for testing. We follow the preprocessing steps that have been applied to reproducing the DMR work. Note that a small part (~5%) of samples have been dropped during preprocessing due the missing of user or item profiles. The preprocessed data can be accessed from the BARS benchmark.

上述描述部分指出: a small part (~5%) of samples have been dropped during preprocessing due to the missing of user or item profiles.

可以请教下,这部分sample是如何去除的吗?是否在 DMR data preprocessing 所提供代码的基础上引入了新的逻辑呀?感谢~!

这个missing是原数据预处理的逻辑,我也是在数据处理后发现样本条数不对。经查对方的代码,发现有些user/item profile join不上的样本被去掉了。

lemyx commented

感谢答复~!发现DMR复现方案的作者,本着【按时间戳排序,防止穿越】的思路,生成了train_sorted.csv。关于这一点有两处疑惑请教,谢谢!

  1. 无论train.csv是否sort,都会在 train_loader 中被 shuffle=True 打乱,这里 train_sorted.csv 是否有必要生成呀
  2. 对于同一个 user_id,根据 timestamp 的前后,训练集中会有 test_timestamp > train_timestamp_1 > train_timestamp_2 的两个样本 (train_timestamp_1, train_timestamp_2),这会不会导致 train_timestamp_1 的gt泄露给 train_timestamp_2
  1. DMR原方案训练只过一遍数据,并按时间顺序读入batch,所以他进行排序了。我复现的时候使shuffle的,效果比他复现的要好。
  2. 训练测试的划分是按时间戳一分为2的。整体来说没有穿越。但是训练过程读入batch确实会有不同样本之间相互泄露的问题,但是业界并没有过度关注这个问题,生产中好像也没有大问题。