Need to set "sort_key" in BucketIterator()?

Question

Need to set "sort_key" in BucketIterator()?

diavy opened this issue 4 years ago · 3 comments

Thank you for your great codes illustration!

I noticed the following codes segment:

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, device = device)

According to the official document:

sort_key – A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding.

So it seems that if you do not set the key, there's no difference between Iterator and BucketIterator. I also tested it myself, and confirmed it.

Could you please check it in your codes?

Answer 1 · 2020-01-20T08:54:11.000Z

If you do not provide a sort_key to an iterator then TorchText will get the sort_key from the dataset, see here.

The IMDB dataset used in these tutorials has it's sort key set here, which is used by the BucketIterator.

If you were using your own dataset then you would need to set your own sort key.

Answer 2 · 2020-01-20T17:04:06.000Z

Thank you, that's very helpful. BTW, could you please add the LSTM-CRF model in POS-TAGGING folder if possible? looking forward to it!

Answer 3 · 2020-01-21T15:40:31.000Z

I'll look into it!