coinse/autofl

Generating JSON Data Files from new Raw Data files

Closed this issue · 11 comments

I would like to apply this program to new bugs other than Defects4J. However, the information is inefficient to generate the required JSON files. Could you please share the method you used to create snippet.json, field_snippet.json, test_snippet.json, and failing_test.txt from raw data for Defects4J or at least detailed specification of the items in the files?

Hi Sungtae,

Thank you for your interest in our work. In the spirit of a quick response, I'll provide what I believe to be a detailed description of the items in the JSON files. If you would like to know anything more specific, please let us know via a comment, and we would be happy to provide further information. (@agb94, please correct me if I am wrong in my descriptions below.)

  • snippet.json contains Method Declaration information from source Java files from a project. My understanding is that it contains information of covered classes. Each item in the JSON list provides information about a single method. The information provided for a method seems self-explanatory - the src_path is the file where the method is defined, the classname is the classname, the signature is the method signature, etc. Let us know if you are having difficulty with a particular element.
  • field_snippet.json is the same as snippet.json except it deals with Field Declarations.
  • test_snippet.json is the same as snippet.json except it deals with test Java files from a project.
  • failing_tests is an output of the defects4j test command when run on a buggy version of the code. It is not printed to stdout, but a file of the same name is generated in the project directory where defects4j test was run. In particular, it is the stack trace of the failing test.

Again, let us know if you would like clarification in any way - we would be happy to help.

I have tried to generate the three corresponding JSON files for Defects4J and other datasets I have through my automated Python script, but have not been successful. Therefore, I would greatly appreciate it if you could share the specific method you used to create these JSON files in your research.

These files are the ones I tried to create similarly to the existing format. While they do not fully match the original data, running them through AutoFL still detected the final buggy method without issues. This example is for Lang-47.
Now, I would like to efficiently generate JSON files for other bugs and execute them through AutoFL.

field_snippet.json
snippet.json
test_snippet.json

Hi. Please try this: https://github.com/coinse/autofl_defects4j_data
Let me know if you encounter any issues running the script. Thanks.

First of all, thank you for your kind consideration.
After reviewing the code and ReadMe, I understood that through Docker, you explore the data within the Defect4J dataset to extract and use the necessary information. However, I believe that the AutoFL technology can automatically detect defects not only with the existing Defect4J data but also with other code, coverage information, and test information. Although it is just another dataset within Defect4J, is there a way to create the intended JSON files with more limited information? If the entire Defect4J error dataset is necessary to create the JSON files required for the overall AutoFL operation, I would greatly appreciate it if you could explain whether this is the original intention or if it is to train the system to effectively perform FL with minimal information on other data.

@Sungtae124 Well, the D4J makes it easier for us to extract certain data but that does not compromise the generalizability of the technique. Our intention was not to overfit to D4J, and I believe it is possible to create those JSON files for other projects outside D4J.

Dear Sungtae,

Thank you for your continued interest in our work. What @ntrolls said is accurate - AutoFL was designed to operate not only on Defects4J, but on other benchmarks as well (Indeed, we perform experiments on the Python benchmark BugsInPy). Necessarily, different circumstances lead to the development of different pipelines to provide information to LLMs. If your question is on a theoretical level, i.e. "can AutoFL work on other benchmarks than Defects4J", the answer is yes - one just needs to construct the appropriate pipeline.

On the other hand, if your question is practical, i.e. "how can I get AutoFL to work in my scenario" - I find it difficult to help you based on the given information, as neither your objective nor your problem are clear to me. It is unclear why the repository Gabin provides is not a good enough example for you to make your own JSON files. It is also unclear what technical challenges you are facing - it is difficult to answer your question, "is there a way to create the intended JSON files with more limited information?", without knowing what more limited information you are referring to. I would be happy to help if you could clarify.

Thank you for your kind response.

Using the Lang20 dataset from Defect4J as an example, my goal is to generate JSON files using only the original code, config.properties, coverage.json, npe.traces.json, tests.json, and diff_Lang-20.txt files. It is unclear whether all this data is included in the existing Defect4J dataset, so I am attaching the relevant materials.
(Please understand that I am unable to attach the properties file.)

diff_Lang-20.txt
coverage.json
npe.traces.json
tests.json

@Sungtae124 The required json files can created from the original sources, as explained by @smkang96 above (#4 (comment)). If you have coverage information for each individual test cases as well as the access to the source code, you can create the first three files, as they are simply lists of structural elements from the original source code - you just may have to parse the code to match them to the coverage traces. The final file, failing_test, is generated by Defects4J, but I think this is a trivial input for an FL as we always start from some tests failing.

Dear Sungtae,

I have checked the files you have provided, and it seems that one could recreate the json files from the information that you have.

  • Information for the snippet.json and field_snippet.json file can be populated by using the information in the coverage.json file and parsing target Java files in the original code appropriately.
  • test_snippet.json can be populated using the information in your tests.json file, and again parsing the Java files in original code containing the test(s) appropriately.
  • failing_tests seemingly can be constructed from npe.traces.json, which seems to contain an error stack; the error type would always be java.lang.NullPointerException I suppose based on the file name.

Thus, my understanding is that provided that you build an appropriate pipeline to process the files you have, you should be able to use AutoFL. Of course, providing that specific pipeline is not within the scope of this repository, as it deals with a scenario that we the authors are unfamiliar with. Consequently, I will close this issue with this comment. If you have additional questions, feel free to leave them below.

你好。请尝试一下:https ://github.com/coinse/autofl_defects4j_data 如果您在运行脚本时遇到任何问题,请告诉我。谢谢。

I'm sorry to bother you. I saw that you used the file "sbfl_method_ranks_full.json". Could you please give a detailed description of how to construct this file? Looking forward to your reply. Thanks.