allenai/scibert

Question Over Punctuation Charts in Vocab Creation

Closed this issue · 7 comments

Hi

Interesting to see you look at creating your own vocab. It appears with BERT they used a special variant and code hasn't been made available nor exact details of what was run. In your cheatsheet i've found reference to your use of Google's SentencePiece through the python wrapper by the looks of thing. I wondered if you had any more specific details on preparation and post processing as the output by default won't include custom tokens nor does it use ## etc. Pretty good idea how you have likely done most of it but would be nice to know for sure. Also in the cheatsheet the command used set the vocab to 31K not 30K and the length of your vocab files differs from those of BERT too.

Could you also comment on the significance of the "[unused15]" kind of entries as the base BERT vocab has 994 of them where as you only have 100. I havent found any details on what these are used for and what created them.

More significantly (maybe) i was curious as to whether you had looked at any preprocessing (additional tokenisation) before running SentencePiece. The reason being that the BERT tokeniser does whitespace and punctuation splitting before word piece tokenisation over the resulting tokens. It looks as though this could part of what WordPiece does as well, based on the occurrence of lots of single character entries (with and without ## prefixed) and the lack of entries with punctuation characters in them along with letters or digits. For example you have entries like "(1997),", where as the BERT one doesnt have anything like this (with the exception of symbol characters not classes as punctuation). One issue with these entries as they stand is that when you apply the BERT tokeniser as part of task training you are never going to use this entry, so even though you have a ~30K vocab size a portion of it will not be used and therefor the neural model is going to have unused capacity. There is also the potential to change the number of possible occurrence of UNK tokens resulting from the tokenisation steps.

Following on from that there is a question as to whether this has possible impacted (negatively or positively) the results you have seen as a possible side effect of this difference. So any more information on what may also have been tried in this area would be of interest to hear.

Thanks

Tony

Good questions.

I wondered if you had any more specific details on preparation and post processing

No, just changing the format of the vocab file that SentencePiece to match what BERT is expecting. Notice that they use the ## differently

31K not 30K

BERT uncased vocab is close to 31K, I wanted to use the same size but I don't think it matters that much

"[unused15]"

As far as I can tell, you can safely remove the [unused*] tokens from your vocab, but I didn't try it and didn't read the BERT code throughly to be sure. Given that training this thing is slow, I kept them into the vocab just in case

any preprocessing (additional tokenisation) before running SentencePiece

I didn't, but that's reasonable to do. As you suggested, I think it is even better to have this as part of WordPiece, not before WordPiece (as it is the case in BERT).

Thanks for following up on these questions.

I haven't found an explanation for these [unused*] entries yet, and couldn't find anything in the code which might have any ties. Did you add the ones in the sciBERT vocab yourself, if so as a post process step or custom tag input to SentencePiece. Again we have not seen how these get added by SentencePiece so curious as to how they got there.

SentencePiece doesn't generate them. They are copy-paste from the BERT vocab. However, I would suggest you train your model without them, most probably it will work just fine.

Hi scibert-team,

First of all, I really appreciate your great work!

I would like to ask additional question about preprocessing. You said you didn't any additional tokenisation. How about other preprocessing?
I mean, I wonder whether did you replace superscripts, urls or expressions like "Figure 1" or "Table 2".

Ryota

Hey @roy29fuku , we use SciSpacy https://github.com/allenai/SciSpaCy for tokenization and sentence splitting. As for replacing superscripts, urls, or expressions, we didn't do any of that (unless PDFBox somehow does this).

Thank you @kyleclo !
Your explanation really helped!

Could you maybe explain why this approach is best in comparison to regular base vocab from BERT? Do most people use this approach with language modeling? Do you have to do this from scratch, or are their packages out there that automatically create a new vocab.txt file? Just trying to learn, new to this stuff.