word_language_model/data.py - remove '<eos>'

Question

word_language_model/data.py - remove '<eos>'

drtonyr opened this issue 10 months ago · 0 comments

A long long time ago, back in the days when n-gram modelling ruled, our LMs didn't carry context over from one sentence to another. In the current era of LLMs we use many past sentences as context.

The supplied data in data/wikitext-2 acknowledges this change and uses '.' as the token to represent the full stop at the end of words. However, the code in data.py is still n-gram style and explicitly enforces the token <eos> at the end of every line.

This is (extremely valuable!) example code, it should be as clean and general as possible. The use of <eos> adds nothing and should be removed. Indeed, it can be seen as a bug as it's adding an artificial token when it's easily predictable so it will artificially reduce perplexity. Example code should be general and not have hard wired variables in it from a legacy implementation. If people want to have a <eos> token then it should be in the data like wikitext-2, if they don't want to use it then it shouldn't be enforced.

It would be very easy to remove the cases of + ['<eos>'] in data.py and the resulting example code would be more general and more scientific.