liuwei1206/LEBERT

关于NER实验中部分超参数设置的问题

Closed this issue · 3 comments

--max_scan_num=1000000
--per_gpu_train_batch_size=4
--per_gpu_eval_batch_size=16
1)论文中NER部分的实验结果也是在以上参数下得出的吗?
2)对于max_scan_num,是否可以理解为只扫描预训练词向量中前1,000,000个词?或者说整个实验中只用到了 https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz这个文件中的前1,000,000个词向量?

Hi,

A1: The training batch_size for each dataset is exactly the same value described in the paper.

A2: A very good question! Yes, --max_scan_num=1000000 means we only use the first 100w words in the embedding. This is a hyperparameter and we set a different value for each dataset. Usually, for the small dataset, we search from {150w, 200w, 300w}; for the big one, we search from {200w, 300w, 500w}. Intuitively, there exists a trade-off for max_scan_num. A large value means the corpus can match more gold segmentation words. However, as the max_scan_num becomes larger, the number of matched incorrect words or called noise words also increases. As a result, the Lexicon Adapter will feel difficult to pick out the correct words.

Hopes it help.

Wei

And the value of max_scan_num corresponding to each dataset can be found in the checkpoints. You can find shell scripts in the checkpoints, in which I give out the max_scan_num used in my paper. Note, those values may not be the best since I didn't tune that hyperparameter carefully.

OK, I will have a try. Thank you very much. This paper you wrote is really great!

And the value of max_scan_num corresponding to each dataset can be found in the checkpoints. You can find shell scripts in the checkpoints, in which I give out the max_scan_num used in my paper. Note, those values may not be the best since I didn't tune that hyperparameter carefully.