关于NER实验中部分超参数设置的问题

Question

关于NER实验中部分超参数设置的问题

Closed this issue 4 years ago · 3 comments

--max_scan_num=1000000
--per_gpu_train_batch_size=4
--per_gpu_eval_batch_size=16
1）论文中NER部分的实验结果也是在以上参数下得出的吗？
2）对于max_scan_num，是否可以理解为只扫描预训练词向量中前1,000,000个词？或者说整个实验中只用到了 https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz这个文件中的前1,000,000个词向量？

Answer 1 · 2021-06-24T08:13:29.000Z

Hi,

A1: The training batch_size for each dataset is exactly the same value described in the paper.

A2: A very good question! Yes, --max_scan_num=1000000 means we only use the first 100w words in the embedding. This is a hyperparameter and we set a different value for each dataset. Usually, for the small dataset, we search from {150w, 200w, 300w}; for the big one, we search from {200w, 300w, 500w}. Intuitively, there exists a trade-off for max_scan_num. A large value means the corpus can match more gold segmentation words. However, as the max_scan_num becomes larger, the number of matched incorrect words or called noise words also increases. As a result, the Lexicon Adapter will feel difficult to pick out the correct words.

Hopes it help.

Wei

Answer 2 · 2021-06-24T08:17:28.000Z

And the value of max_scan_num corresponding to each dataset can be found in the checkpoints. You can find shell scripts in the checkpoints, in which I give out the max_scan_num used in my paper. Note, those values may not be the best since I didn't tune that hyperparameter carefully.

Answer 3 · 2021-06-24T08:51:49.000Z

OK, I will have a try. Thank you very much. This paper you wrote is really great!

And the value of max_scan_num corresponding to each dataset can be found in the checkpoints. You can find shell scripts in the checkpoints, in which I give out the max_scan_num used in my paper. Note, those values may not be the best since I didn't tune that hyperparameter carefully.