yusanshi/news-recommendation

data process

Closed this issue · 6 comments

你好,在数据处理的部分,为啥要对数据进行balance处理,有什么讲究吗

一个 impression 里面一般正例要远远少于负例(即一个 impression 里的 candidate news 中,大部分都没有被点击)。

没有做 Negative sampling 的情况下,就是一个普通的二分类问题,每个 impression 里面的每个 candidate news 都会生成一个训练样本,导致最终的正例远远少于负例。可以参考 https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

做了 Negative sampling 的情况下(这个 repo 里面的 model 代码都做了),是把一个正例和 K 个负例当成一个 pair,loss 的表达式反映着整个 pair 的匹配程度。这个时候的 balance 指的就是在一个 impression 的 candidate news 中,给每个正例匹配 K 个负例后,将多余的负例丢弃。这部分可以找篇本 repo 的 model 的 paper 来看看。没记错的话,除了 DKN,其他 paper 里面都介绍了 Negative sampling。

感谢回复,我再看下论文,谢谢🙏

打算研究一下楼主的代码,在这个基础上把预训练的内容给加上去,我看楼主在embedding的时候用的还是glove。另外楼主的这个项目很不错,再次感谢了🙏

刚刚看了下recommenders中关于新闻推荐的代码,在他的源代码也看到了Negative sampling的处理了。

@yusanshi can you please share pretrained weight for this model and one more thing please let me know which config you used before training and evaluation.
Thanks

@ayush-angelium

can you please share pretrained weight for this model

Sorry but in fact I don't have them... Months ago, I trained and tested all the methods on MIND small dataset and shown the results and checkpoint links in README.md. However, I have made some small changes to the code and I began to use MIND large dataset. So I removed the outdated results. But I haven't trained and tested on the MIND large dataset.

let me know which config you used before training and evaluation

Just those in src/config.py.