Tokenizer's `num_words` filtering is based on word's index
Opened this issue · 2 comments
In the method texts_to_sequences_generator
(of the Tokenizer
), the num_words
check is based on the word's index. I understand that this check is fast, but wouldn't it be a problem if the ordering is changed (ie, if it isn't based on frequency anymore) ?
keras-preprocessing/keras_preprocessing/text.py
Lines 333 to 340 in 5949df1
Hello,
Note, I'm far from an expert in NLP
Do you have an example where you wouldn't use frequency?
As long as word_index is sorted in order of importance it should work I think.
Hi,
In my current project, we defined an external index/word mapping, as our dataset often change but not our vocabulary. So the tokens won't always be sorted in order of importance.
For the record, I don't need this particular method (yet, I think...) but I found the assumption on the data in the check a little bit "hard".