harvardnlp/seq2seq-attn

Understanding the data preprocessing

vikram-gupta opened this issue · 6 comments

I am sure that i am missing something simple here but please bear with me :)

What is the philosophy behind loading the pkl files of movietriple corpus (which are in the form of indices) and then using format_data() function in preprocess.py to update the word indices and then converting them to file having words and then again using get_data() function to convert them back to indices?

I was thinking of using the txt files of the corpus and pass them to get_data() function directly. This would create the vocab as well as do the conversion to indices. What would i miss here?

This is to ensure that we have the vocabulary (and the corresponding word frequencies) before taking the top K most frequent words.

Thanks @yoonkim

But that can be done by loading the text file of the corpus with words (and not the indices)? In fact the function get_data() calls make_vocab() to do the same.

Why do we do call the format_data() function then?

sorry, I don't see the format_data() function... is it in preprocess.py?

i think you may have a different version. there is no format_data function in preprocess.py

@yoonkim you are right ! Sorry for the inconvenience. I messed up with the version. Thanks again!