data process
Closed this issue · 6 comments
你好,在数据处理的部分,为啥要对数据进行balance处理,有什么讲究吗
一个 impression 里面一般正例要远远少于负例(即一个 impression 里的 candidate news 中,大部分都没有被点击)。
没有做 Negative sampling 的情况下,就是一个普通的二分类问题,每个 impression 里面的每个 candidate news 都会生成一个训练样本,导致最终的正例远远少于负例。可以参考 https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18。
做了 Negative sampling 的情况下(这个 repo 里面的 model 代码都做了),是把一个正例和 K 个负例当成一个 pair,loss 的表达式反映着整个 pair 的匹配程度。这个时候的 balance 指的就是在一个 impression 的 candidate news 中,给每个正例匹配 K 个负例后,将多余的负例丢弃。这部分可以找篇本 repo 的 model 的 paper 来看看。没记错的话,除了 DKN,其他 paper 里面都介绍了 Negative sampling。
感谢回复,我再看下论文,谢谢🙏
打算研究一下楼主的代码,在这个基础上把预训练的内容给加上去,我看楼主在embedding的时候用的还是glove。另外楼主的这个项目很不错,再次感谢了🙏
刚刚看了下recommenders中关于新闻推荐的代码,在他的源代码也看到了Negative sampling的处理了。
@yusanshi can you please share pretrained weight for this model and one more thing please let me know which config you used before training and evaluation.
Thanks
can you please share pretrained weight for this model
Sorry but in fact I don't have them... Months ago, I trained and tested all the methods on MIND small dataset and shown the results and checkpoint links in README.md
. However, I have made some small changes to the code and I began to use MIND large dataset. So I removed the outdated results. But I haven't trained and tested on the MIND large dataset.
let me know which config you used before training and evaluation
Just those in src/config.py
.