bentrevett/pytorch-sentiment-analysis

Question about imdb dataset and bert tokenizer (tut. 6)

antgr opened this issue · 2 comments

antgr commented

Although IMDB provides token ids for its sentence tokens,

print(vars(train_data.examples[6]))
{'text': [5949, 1997, 2026, 2166, 1010, 1012, 1012, 1012, 1012, 1996, 2472, 2323, 2022, 10339, 1012, 2339, 2111, 2514, 2027, 2342, 2000, 2191, 22692, 5691, 2097, 2196, 2191, 3168, 2000, 2033, 1012, 2043, 2016, 2351, 2012, 1996, 2203, 1010, 2009, 2081, 2033, 4756, 1012, 1045, 2018, 2000, 2689, 1996, 3149, 2116, 2335, 2802, 1996, 2143, 2138, 1045, 2001, 2893, 10339, 3666, 2107, 3532, 3772, 1012, 11504, 1996, 3124, 2040, 2209, 9895, 2196, 4152, 2147, 2153, 1012, 2006, 2327, 1997, 2008, 1045, 3246, 1996, 2472, 2196, 4152, 2000, 2191, 2178, 2143, 1010, 1998, 2038, 2010, 3477, 5403, 3600, 2579, 2067, 2005, 2023, 10231, 1012, 1063, 1012, 6185, 2041, 1997, 2184, 1065], 'label': 'neg'}

we use BertTokenizer, and we also provide tokenize_and_cut to data.Field

What I do not understand is:
Shouln't the mapping of imdb from tokens to token ids be the same with the one of bert tokenized?
Are they the same? And how do we know it?

What should we do if we had a dataset with the form:
["This is a sentence", sentiment]

The IMDB dataset does not provide token ids, it is just the strings that make up the movie reviews.

The token ids are provided by the BertTokenizer, this basically has a dictionary that maps from strings to ids. The same as what our TEXT.vocab.stoi does in the previous tutorials where we created our own vocabulary.

Those ids are created when we do:

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In which the TEXT field is used to handle the movie reviews (raw strings) by first tokenizing them with tokenize_and_cut (it's tokenize argument) and then converting them into ids with tokenizer.convert_tokens_to_ids (it's preprocessing argument).

antgr commented

Thank you very much!