data process

Question

data process

Closed this issue 4 years ago · 6 comments

你好，在数据处理的部分，为啥要对数据进行balance处理，有什么讲究吗

Answer 1 · 2020-08-16T09:25:53.000Z

一个 impression 里面一般正例要远远少于负例（即一个 impression 里的 candidate news 中，大部分都没有被点击）。

没有做 Negative sampling 的情况下，就是一个普通的二分类问题，每个 impression 里面的每个 candidate news 都会生成一个训练样本，导致最终的正例远远少于负例。可以参考 https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18。

做了 Negative sampling 的情况下（这个 repo 里面的 model 代码都做了），是把一个正例和 K 个负例当成一个 pair，loss 的表达式反映着整个 pair 的匹配程度。这个时候的 balance 指的就是在一个 impression 的 candidate news 中，给每个正例匹配 K 个负例后，将多余的负例丢弃。这部分可以找篇本 repo 的 model 的 paper 来看看。没记错的话，除了 DKN，其他 paper 里面都介绍了 Negative sampling。

Answer 2 · 2020-08-16T09:35:29.000Z

感谢回复，我再看下论文，谢谢🙏

Answer 3 · 2020-08-16T09:40:08.000Z

打算研究一下楼主的代码，在这个基础上把预训练的内容给加上去，我看楼主在embedding的时候用的还是glove。另外楼主的这个项目很不错，再次感谢了🙏

Answer 4 · 2020-08-16T09:50:09.000Z

刚刚看了下recommenders中关于新闻推荐的代码，在他的源代码也看到了Negative sampling的处理了。

Answer 5 · 2020-11-03T06:11:38.000Z

@yusanshi can you please share pretrained weight for this model and one more thing please let me know which config you used before training and evaluation.
Thanks

Answer 6 · 2020-11-03T06:26:44.000Z

@ayush-angelium

can you please share pretrained weight for this model

Sorry but in fact I don't have them... Months ago, I trained and tested all the methods on MIND small dataset and shown the results and checkpoint links in README.md. However, I have made some small changes to the code and I began to use MIND large dataset. So I removed the outdated results. But I haven't trained and tested on the MIND large dataset.

let me know which config you used before training and evaluation

Just those in src/config.py.