squaresLab/Darjeeling

Tests incorrectly marked as redundant

Closed this issue · 20 comments

In the example GCD program. The only patch that is deemed plausible is

--- gcd.c
+++ gcd.c
@@ -10,7 +10,8 @@
   if (a == 0) {
     printf("%g\n", b);
   }
-  {
+return 0;
+    {
     while (b != 0) {
       if (a > b) {
         a = a - b;

However, this is an incorrect patch as all positive tests fail. This positive tests are not run on this patch as they are incorrectly deemed redundant. To truly fix the incorrectly marked redundancy would require the system to recognize that the added line makes code unreachable. Would adding an option to toggle redundancy checks on tests be a valid partial fix?

Using the run_redundant_test option in #279 causes a single, and different plausible patch to be found:

--- gcd.c
+++ gcd.c
@@ -8,7 +8,8 @@
   b = atoi(argv[2]);
 
   if (a == 0) {
-    printf("%g\n", b);
+    printf("%g\n", b);return 0;
+    
   }
   {
     while (b != 0) {

This patch passes all tests and introduces no problems besides weird formatting and is thus a correct patch. #279 is just a workaround though, not a true fix.

This is interesting. I wonder if this has something to do with the way that I implemented append. I'll look into this today.

Remind me: What configuration are you using?

The example gcd configuration

version: '1.0'

program:
  image: squareslab/tse2012:gcd
  language: c
  source-directory: /experiment/source
  build-instructions:
    time-limit: 10
    steps:
      - gcc gcd.c -o gcd
    steps-for-coverage:
      - gcc gcd.c -o gcd --coverage
  tests:
    type: genprog
    workdir: /experiment
    number-of-failing-tests: 1
    number-of-passing-tests: 10
    time-limit: 5

seed: 0
threads: 16
localization:
  type: spectrum
  metric: tarantula
algorithm:
  type: exhaustive
coverage:
  method:
    type: gcov
    files-to-instrument:
      - gcd.c
transformations:
  schemas:
    - type: delete-statement
    - type: replace-statement
    - type: append-statement
optimizations:
  ignore-equivalent-insertions: yes
  ignore-dead-code: yes
  ignore-string-equivalent-snippets: yes
resource-limits:
  candidates: 100

For reference, here's the coverage information that I obtain for the example config:

(darjeeling) chris@chris-MS-7B77:~/tools/darjeeling$ darjeeling coverage example/gcd/repair.yml 
{
  n1 [FAIL]: {
    gcd.c: 10..13; 16; 18..28; 30..32; 37; 40..41; 43..44; 47..48; 51
  }
  p1 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p10 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p2 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p3 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p4 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p5 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p6 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p7 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p8 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
  p9 [PASS]: {
    gcd.c: 16; 18..28; 30..32; 37; 40..41; 43; 47..49; 51; 54; 57
  }
}

Interestingly, these numbers don't really make much sense. gcd.c only has 25 lines. So, if nothing else, it looks like Darjeeling isn't correcting the line numbers to account for the instrumentation that is added to the top of the file.

After fixing an issue with coverage instrumentation, I get the following coverage output:

{
  n1 [FAIL]: {
    gcd.c: 4; 7..8; 10..11; 14..15; 18
  }
  p1 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p10 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p2 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p3 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p4 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p5 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p6 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p7 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p8 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
  p9 [PASS]: {
    gcd.c: 4; 7..8; 10; 14..16; 18; 21; 24
  }
}

The generated repair without running redundant tests is still incorrect unfortunately:

--- gcd.c
+++ gcd.c
@@ -10,7 +10,8 @@
   if (a == 0) {
     printf("%g\n", b);
   }
-  {
+printf("%g\n", a);
+    {
     while (b != 0) {
       if (a > b) {
         a = a - b;

I can confirm that I'm seeing similar incorrect behavior:

--- gcd.c
+++ gcd.c
@@ -10,7 +10,11 @@
   if (a == 0) {
     printf("%g\n", b);
   }
-  {
+if (a == 0) {
+  printf("%g\n", b);
+  }
+
+    {
     while (b != 0) {
       if (a > b) {
         a = a - b;

Okay. I think that I've figured out the issue. It does appear to have been introduced by the append-statement operator. The problem is that the line that is "changed" by the operator is considered to be the line where we do the append. We don't have coverage for that line, because nothing is there. Instead, we should consider the line of the target statement (i.e., the original if statement in the code).

Specifically, the issue here is the lines_changed method in the Candidate class:

      def lines_changed(self) -> List[FileLine]:                                                                                                                                  
          """                                                                                                                                                                     
          Returns a list of source lines that are changed by this candidate                                                                                                       
          patch.                                                                                                                                                                  
          """                                                                                                                                                                     
          replacements = \                                                                                                                                                        
              map(lambda t: t.to_replacement(), self.transformations)                                                                                                             
          locations = [rep.location for rep in replacements]                                                                                                                      
          return [FileLine(loc.filename, loc.start.line) for loc in locations]   

The modified lines are determined by the location of the replacements rather than by the line property of the transformations themselves. (There are some deeper issues here, too, such as Transformation assuming that transformations are only associated with a single line.)

I've pushed a fix to the issue. The bad patch is no longer accepted and instead I find a plausible repair. Let me know if this resolves the issue on your end?

It works on my end as well!

I've found this issue to still occur in the zune example within Darjeeling. The configuration being:

version: '1.0'

program:
  image: darjeeling/example:zune
  language: c
  source-directory: /experiment/source
  build-instructions:
    time-limit: 15
    steps:
      - gcc zune.c -o zune
    steps-for-coverage:
      - gcc zune.c -o zune --coverage
  tests:
    type: genprog
    workdir: /experiment
    number-of-failing-tests: 4
    number-of-passing-tests: 18
    time-limit: 5

seed: 0
threads: 16
localization:
  type: spectrum
  metric: tarantula
algorithm:
  type: exhaustive
coverage:
  method:
    type: gcov
    files-to-instrument:
      - zune.c
transformations:
  schemas:
    - type: delete-statement
    - type: replace-statement
    - type: append-statement
optimizations:
  ignore-equivalent-insertions: yes
  ignore-dead-code: yes
  ignore-string-equivalent-snippets: yes
resource-limits:
  candidates: 100

It produces the following incorrect patch:

--- zune.c
+++ zune.c
@@ -21,5 +21,5 @@
   }
 
   printf("current year is %d\n", year);
-  return 0;
+  
 }

When I add run-redundant-tests it does not find the incorrect patch, but instead finds no patch at all.

I'm having difficulty reproducing this one. I can't find a plausible patch regardless of the value of run-redundant-tests. Can you share the output of darjeeling coverage repair.yml?

Oh, I've realized it found the plausible patch due to some tweaks I was making. It's a bit concerning that it can't find it at all though.

Oh, I've realized it found the plausible patch due to some tweaks I was making. It's a bit concerning that it can't find it at all though.

If I recall correctly, this is one of the bugs that Darjeeling can't fix because it requires transformations that are beyond the repair model specified in the example configuration. In that respect, it probably doesn't make for a good example bug 🙂

The original GenProg does manage to sneakily fix the issue because (a) it preprocesses the original source code, and, more importantly, (b) it uses CIL to represent ASTs. CIL performs certain transformations on the program to make analysis much easier. The result is that GenProg actually ends up doing something that's somewhat closer to expression-level repair. (In reality, CIL introduces lots of temporary variables, and the repairs that GenProg can actually make are a bit of a roll of the dice depending on the exact names and ordering of variables.)

Will Darjeeling be able to eventually sneak that fix in as well?

One way or another, Darjeeling should be able to fix that bug soon. The nicest approach, and one most likely to arrive in the next few weeks, is to use a finer grained transformation to fix the bug. Alternatively, it wouldn't be too much work to add an option to Darjeeling that allows for preprocessing of certain C/C++ projects to mimic the behavior of the original GenProg. The problem with that particular approach is that, like the original GenProg, the patches would be generated for the preprocessed codebase and it would be up to the developer to transfer those patches to the original codebase. (With some clever mapping of original/pre-processed code, it might be possible to do that transformation automatically, but I imagine that it isn't easy.)

Interesting. Good to know, thank you for explaining!