Question about dataset

Question

Question about dataset

SAMMY-KIM opened this issue 2 years ago · 2 comments

Hello.
It's nice to look at wonderful paper and codes.
The code also works well:)

I have a question about the dataset. (Python)
Data source: https://github.com/EdinburghNLP/code-docstring-corpus
Even I know the dataset was from above github, could I know how could you make a code for dataset? (How to parse?)

If it's available, I want the model is works for general code data. Not train or test dataset.
But for that, I need to the way to parse the data like excluding _ underscore and others.
If you know, I hope to get detail information.

Thank you:-)

@saikat107
Hello. I checked your reply in another issue #19
If you know further information, could you share it in detail to create a dataset structure?
Thank you:)

Answer 1 · 2022-09-26T17:12:25.000Z

You can take a look at the code summarization dataset from CodeXGlue. Typically, we use the docstring of a function as a summary of that function. So, we can create a summarization dataset from any code sources.

Answer 2 · 2022-09-28T00:40:16.000Z

@wasiahmad
Thank you for your reply. :)
But my question was. generally source code like A.

A. def get_flashed_messages(with_categories=False):
""" comment for the function """
flashes = _request_ctx_stack.top.flashes
# some code is existed.
return flashes

But train/test data set of 'code.original_subtoken' file is consisted of like B.
B. def get flashed messages with categories False ~~~

Like it's splitted with some rules. (excluding underscore).
So when we want to test general code to the model, how to make it this form? Maybe there will be some script. So I have ask to know that:)