word_language_model/data.py - two areas of redundant code

Question

word_language_model/data.py - two areas of redundant code

drtonyr opened this issue 10 months ago · 0 comments

As this is (extremely useful!) example code, it should be as clean as possible.

I'm looking at word_language_model/data.py and there are two areas where the clarity and speed could be improved by removing redundant code.

tokenize() runs in two passes called # Add words to the dictionary and # Tokenize file content. The first calls add_word() which does both the adding words to the dictionary and it returns the token. So everything can be done in one pass. Cleanest is to completely remove the first pass and change the line ids.append(self.dictionary.word2idx[word]) to ids.append(self.dictionary.add_word(word)).
In # Tokenize file content, a list of torch tensors is built and then torch.cat() is used to merge into the final list. It is both cleaner and faster not to use the intermediate torch tensors and simply do:

        # Tokenize file content 
        with open(path, 'r', encoding="utf8") as f:
            ids = []
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids.append(self.dictionary.word2idx[word])

        return torch.tensor(ids).type(torch.int64)

In both cases I've just tried to take out the redundant code to make things cleaner to read and faster to execute (data load was about 20 minutes for the billion word corpus).