Understanding the data preprocessing
vikram-gupta opened this issue · 6 comments
I am sure that i am missing something simple here but please bear with me :)
What is the philosophy behind loading the pkl files of movietriple corpus (which are in the form of indices) and then using format_data() function in preprocess.py to update the word indices and then converting them to file having words and then again using get_data() function to convert them back to indices?
I was thinking of using the txt files of the corpus and pass them to get_data() function directly. This would create the vocab as well as do the conversion to indices. What would i miss here?
This is to ensure that we have the vocabulary (and the corresponding word frequencies) before taking the top K most frequent words.
Thanks @yoonkim
But that can be done by loading the text file of the corpus with words (and not the indices)? In fact the function get_data() calls make_vocab() to do the same.
Why do we do call the format_data() function then?
sorry, I don't see the format_data() function... is it in preprocess.py?
@yoonkim yes.
i think you may have a different version. there is no format_data function in preprocess.py
@yoonkim you are right ! Sorry for the inconvenience. I messed up with the version. Thanks again!