Questions about the python and java datasets.
shengqiangzhang opened this issue · 3 comments
Hi @wasiahmad ,
The input data of the model in A Transformer-based Approach for Source Code Summarization
is a series of tokens, but the input data of my model is abstract syntax tree (AST), I need to find the original source code (executable source code snippet) corresponding to a series of tokens, and then parse it to AST.
I have downloaded the data from their original work, but I found that the size of the dataset used in your paper is different from the size of their original dataset. For example, in the train set of the python dataset, the original size exceeds 100,000, while yours is about 50,000.
I want to compare with your model, so I selected the experiment dataset provided by your paper.
Since the series of tokens can not be parsed into AST, I need to find the corresponding original source code from their original work.
Unfortunately, I can not find the original source code for all the series of tokens.
If you could provide me with the corresponding original code files (the size of your experiment datasets are inconsistent with the original datasets), I believe I can convert them to AST and compare the experiment results with yours.
Thank you.
Hi, I understand your need. A few things to note.
-
The preprocessed python dataset we used is shared by the authors of Bolin et al., 2019 as we were unable to reproduce their results using the dataset we preprocessed.
-
Note that these datasets are extremely noisy, so you may not be able to use the full data if you use AST-based methods.
-
We also performed some naive experiments using AST, you can find the details in the paper. We did this only for the Java dataset and you can find the dataset (java_with_sbt.zip) in our provided Google drive link.
-
The AST extraction from the original Java code is done by our co-author Saikat (https://github.com/saikat107), I have asked him to reply in this thread.
Thanks!
Hi @shengqiangzhang ,
Like @wasiahmad mentioned, we used the same processed dataset as Bolin et.al., 2019 used. However, to my best knowledge,
the python dataset is from this paper and can be found here. You can find the description of the raw data here.
I hope that helps. Let me know if you have further questions. Feel free to close the issue if not.
Thanks!
Hi @saikat107 @wasiahmad ,
Thank you for your help, I am trying to transform the input data into AST format.