Generating JSON Data Files from new Raw Data files
Closed this issue · 11 comments
I would like to apply this program to new bugs other than Defects4J. However, the information is inefficient to generate the required JSON files. Could you please share the method you used to create snippet.json
, field_snippet.json
, test_snippet.json
, and failing_test.txt
from raw data for Defects4J or at least detailed specification of the items in the files?
Hi Sungtae,
Thank you for your interest in our work. In the spirit of a quick response, I'll provide what I believe to be a detailed description of the items in the JSON files. If you would like to know anything more specific, please let us know via a comment, and we would be happy to provide further information. (@agb94, please correct me if I am wrong in my descriptions below.)
snippet.json
contains Method Declaration information from source Java files from a project. My understanding is that it contains information of covered classes. Each item in the JSON list provides information about a single method. The information provided for a method seems self-explanatory - thesrc_path
is the file where the method is defined, theclassname
is the classname, thesignature
is the method signature, etc. Let us know if you are having difficulty with a particular element.field_snippet.json
is the same assnippet.json
except it deals with Field Declarations.test_snippet.json
is the same assnippet.json
except it deals with test Java files from a project.failing_tests
is an output of thedefects4j test
command when run on a buggy version of the code. It is not printed tostdout
, but a file of the same name is generated in the project directory wheredefects4j test
was run. In particular, it is the stack trace of the failing test.
Again, let us know if you would like clarification in any way - we would be happy to help.
I have tried to generate the three corresponding JSON files for Defects4J and other datasets I have through my automated Python script, but have not been successful. Therefore, I would greatly appreciate it if you could share the specific method you used to create these JSON files in your research.
These files are the ones I tried to create similarly to the existing format. While they do not fully match the original data, running them through AutoFL still detected the final buggy method without issues. This example is for Lang-47.
Now, I would like to efficiently generate JSON files for other bugs and execute them through AutoFL.
Hi. Please try this: https://github.com/coinse/autofl_defects4j_data
Let me know if you encounter any issues running the script. Thanks.
First of all, thank you for your kind consideration.
After reviewing the code and ReadMe, I understood that through Docker, you explore the data within the Defect4J dataset to extract and use the necessary information. However, I believe that the AutoFL technology can automatically detect defects not only with the existing Defect4J data but also with other code, coverage information, and test information. Although it is just another dataset within Defect4J, is there a way to create the intended JSON files with more limited information? If the entire Defect4J error dataset is necessary to create the JSON files required for the overall AutoFL operation, I would greatly appreciate it if you could explain whether this is the original intention or if it is to train the system to effectively perform FL with minimal information on other data.
@Sungtae124 Well, the D4J makes it easier for us to extract certain data but that does not compromise the generalizability of the technique. Our intention was not to overfit to D4J, and I believe it is possible to create those JSON files for other projects outside D4J.
Dear Sungtae,
Thank you for your continued interest in our work. What @ntrolls said is accurate - AutoFL was designed to operate not only on Defects4J, but on other benchmarks as well (Indeed, we perform experiments on the Python benchmark BugsInPy). Necessarily, different circumstances lead to the development of different pipelines to provide information to LLMs. If your question is on a theoretical level, i.e. "can AutoFL work on other benchmarks than Defects4J", the answer is yes - one just needs to construct the appropriate pipeline.
On the other hand, if your question is practical, i.e. "how can I get AutoFL to work in my scenario" - I find it difficult to help you based on the given information, as neither your objective nor your problem are clear to me. It is unclear why the repository Gabin provides is not a good enough example for you to make your own JSON files. It is also unclear what technical challenges you are facing - it is difficult to answer your question, "is there a way to create the intended JSON files with more limited information?", without knowing what more limited information you are referring to. I would be happy to help if you could clarify.
Thank you for your kind response.
Using the Lang20 dataset from Defect4J as an example, my goal is to generate JSON files using only the original code
, config.properties
, coverage.json
, npe.traces.json
, tests.json
, and diff_Lang-20.txt
files. It is unclear whether all this data is included in the existing Defect4J dataset, so I am attaching the relevant materials.
(Please understand that I am unable to attach the properties file.)
@Sungtae124 The required json files can created from the original sources, as explained by @smkang96 above (#4 (comment)). If you have coverage information for each individual test cases as well as the access to the source code, you can create the first three files, as they are simply lists of structural elements from the original source code - you just may have to parse the code to match them to the coverage traces. The final file, failing_test
, is generated by Defects4J, but I think this is a trivial input for an FL as we always start from some tests failing.
Dear Sungtae,
I have checked the files you have provided, and it seems that one could recreate the json
files from the information that you have.
- Information for the
snippet.json
andfield_snippet.json
file can be populated by using the information in thecoverage.json
file and parsing target Java files in theoriginal code
appropriately. test_snippet.json
can be populated using the information in yourtests.json
file, and again parsing the Java files inoriginal code
containing the test(s) appropriately.failing_tests
seemingly can be constructed fromnpe.traces.json
, which seems to contain an error stack; the error type would always bejava.lang.NullPointerException
I suppose based on the file name.
Thus, my understanding is that provided that you build an appropriate pipeline to process the files you have, you should be able to use AutoFL. Of course, providing that specific pipeline is not within the scope of this repository, as it deals with a scenario that we the authors are unfamiliar with. Consequently, I will close this issue with this comment. If you have additional questions, feel free to leave them below.
你好。请尝试一下:https ://github.com/coinse/autofl_defects4j_data 如果您在运行脚本时遇到任何问题,请告诉我。谢谢。
I'm sorry to bother you. I saw that you used the file "sbfl_method_ranks_full.json". Could you please give a detailed description of how to construct this file? Looking forward to your reply. Thanks.