Generating code documentation with code2seq
balysMorkunas opened this issue · 8 comments
Hi,
I am currently writing a bachelors project where my aim is to test how inline code comments in the training dataset effects the performance of generating code documentation using the code2seq model. Would you be able to briefly tell me about the possibilities of this model regarding automatic code comment generation and how would one set that up? I am very thankful for your time.
edit: I can see that issue #34 has related information, I will of course follow it for now, but if you have any additional tips it would be much appreciated!
Hi @balysMorkunas ,
Thank you for your interest in code2seq!
I believe that the paper may give a better intuition than what I can describe in brief: https://openreview.net/pdf?id=H1gKYo09tX
Let me know if you have any specific questions!
Uri
Thank you, I'll contact again if any serious questions arise!
Hi again,
I started looking at how to retrain the model and preprocess the dataset for documentation generation. I followed your suggestion on issue #34 where you suggest to change JavaExtractor to output documentation instead of method names.
Could you please elaborate/give example on how to do that? Do you by chance mean to use node.getJavaDoc()
instead of node.getName()
? What other changes should I be aware of?
Thank you very much for your time and effort, I really appreciate it,
Balys.
Hi @urialon, I am in a very similar situation to @balysMorkunas and would also like to hear your input about this question. Thank you for your time!
Hi @balysMorkunas and @bacevicius ,
Thank you for your interest in code2seq!
Do you by chance mean to use node.getJavaDoc() instead of node.getName()
Basically yes!
Another option, if you wish to train on an existing dataset, is to set it to a unique ID, and then replace the unique ID with the documentation later. See also:
#45
For additional scripts and hyperparameters.
Best,
Uri
Thanks for your answer @urialon !
Do you think that the hyperparameters config.SUBTOKENS_VOCAB_MAX_SIZE = 190000
and config.TARGET_VOCAB_MAX_SIZE = 27000
are enough for documentation generation, or should be increased?
Anything else to watch out for, regarding the hyperparameters, maybe max_code_len
and min_code_len
in JavaExtractor?
Thank you very much for your time,
Balys.
Hi @balysMorkunas ,
Sorry for the delayed response.
These hyperparameters look OK to me, but they depend on the exact dataset and can never really be known in advance.
max_code_len
and min_code_len
refer to the size of the functions that you consider, so it is up to the dataset you are working with.
Best,
Uri
Thanks for your help!