thunlp/KNET

关于util.build_vocab方法的疑问

Closed this issue · 1 comments

您好,似乎在util.build_vocab方法中,代码会加载pre_trained embedding文件至'unk'这个词项位置,这个词在glove.840B.300d文件的第171915行。
请问这样做的目的是将glove.840B.300d后面的词也视作unk(低频词)吗?
这样的做法是否有什么特殊用意,还是说这是一种领域内的常用做法?

  1. Yes, all the words after line 171915 will be regarded as low-frequency words and they share the same embedding.

  2. It's a common practice to regard words lower than a certain frequency as rare words and use a shared embedding for them. The most widely used method is to choose a vocabulary size in advance, say, 50k (this is pretty much determined by the capacity of your model), and then regard words outside top 50k as rare words.