bentrevett/pytorch-sentiment-analysis

Theoretical doubt on FastText

hect1995 opened this issue · 1 comments

After reading the original paper and your repo I still do no understand the use of bi-grams. You add them at the end of your sentence but in theory they do not exist in the GloVe (as as they are group of words not just one word) so I do not know what is their objective and how they are used. Moreover as you are reducing the vocabulary to 25k words its very unlikely that bigrams will be among those as their probability of appearing is less than single words. I would appreciate if you could clarify it a little further for me.
Thanks

Actually, it's not true that bigrams are unlikely to appear in the vocabulary.

count = 0
total = 0

for token, _ in TEXT.vocab.freqs.most_common(MAX_VOCAB_SIZE):
    if ' ' in token: #if token has a space then it is most probably a bi-gram
        count += 1
    total += 1

print(count/total * 100)

The above code prints ~65, meaning that 65% of the vocabulary of 25,000, i.e. 16,250 tokens within the vocabulary are bi-grams.

It is true that none of these bi-grams will appear in the GloVe embeddings and will therefore their entry in the embedding layer will be initialized randomly - however their embeddings are still learned by the neural network.