wasiahmad/NeuralCodeSum

Problem about the symbol in python data.

HaorenD opened this issue · 2 comments

Hi~ Your work is really great!
I have know from your paper that you preprocessed the original dataset with SnakeCase and CamelCase and i also get the .py file you used to implement this two approach. But i also noticed that, compared to the code.original, the codes in code.original_subtoken don't have symbol like ':', ',' , '=', '(', ')', but still hava some symbol like '{', '[' and so on. Could u pls share the scripts or the strategy you used to process those symbol and indent or blank?

Does symbols and spaces have any impact on the prediction? I have the same question as well.

Hi~ Your work is really great!
I have know from your paper that you preprocessed the original dataset with SnakeCase and CamelCase and i also get the .py file you used to implement this two approach. But I also noticed that, compared to the code.original, the codes in code.original_subtoken don't have symbol like ':', ',' , '=', '(', ')', but still hava some symbol like '{', '[' and so on. Could u pls share the scripts or the strategy you used to process those symbol and indent or blank?

For the Python dataset, we used the data that the authors of Bolin et al., 2019 shared with us. With our preprocessing, we are unable to reproduce some of their results. So, we asked for the python dataset (preprocessed) and they shared it with us. Therefore our preprocessing script's output won't match the data we shared.

Does symbols and spaces have any impact on the prediction? I have the same question as well.

This requires an experiment. Small changes in the source code (e.g., keeping or removing punctuation symbols) may produce a bit different result. However, we exactly shared the dataset we used in our work, so we believe if you use our provided dataset, the results should be comparable to the numbers we mentioned in our paper.