关于util.build_vocab方法的疑问

Question

关于util.build_vocab方法的疑问

Closed this issue 4 years ago · 1 comments

您好，似乎在util.build_vocab方法中，代码会加载pre_trained embedding文件至'unk'这个词项位置，这个词在glove.840B.300d文件的第171915行。
请问这样做的目的是将glove.840B.300d后面的词也视作unk（低频词）吗？
这样的做法是否有什么特殊用意，还是说这是一种领域内的常用做法？

Answer 1 · 2018-09-21T16:24:35.000Z

Yes, all the words after line 171915 will be regarded as low-frequency words and they share the same embedding.
It's a common practice to regard words lower than a certain frequency as rare words and use a shared embedding for them. The most widely used method is to choose a vocabulary size in advance, say, 50k (this is pretty much determined by the capacity of your model), and then regard words outside top 50k as rare words.