word_language_model/data.py - two areas of redundant code
drtonyr opened this issue · 0 comments
As this is (extremely useful!) example code, it should be as clean as possible.
I'm looking at word_language_model/data.py and there are two areas where the clarity and speed could be improved by removing redundant code.
-
tokenize()
runs in two passes called# Add words to the dictionary
and# Tokenize file content
. The first callsadd_word()
which does both the adding words to the dictionary and it returns the token. So everything can be done in one pass. Cleanest is to completely remove the first pass and change the lineids.append(self.dictionary.word2idx[word])
toids.append(self.dictionary.add_word(word))
. -
In
# Tokenize file content
, a list of torch tensors is built and thentorch.cat()
is used to merge into the final list. It is both cleaner and faster not to use the intermediate torch tensors and simply do:
# Tokenize file content
with open(path, 'r', encoding="utf8") as f:
ids = []
for line in f:
words = line.split() + ['<eos>']
for word in words:
ids.append(self.dictionary.word2idx[word])
return torch.tensor(ids).type(torch.int64)
In both cases I've just tried to take out the redundant code to make things cleaner to read and faster to execute (data load was about 20 minutes for the billion word corpus).