symflower/eval-dev-quality

Introduce an AST-differ that also gives metrics

Opened this issue · 3 comments

The following Java test output are equally good:

package com.eval;

	import org.junit.jupiter.api.Test;

	import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;

	class PlainTest {

	    @Test
	    void testPlain() {
	        assertDoesNotThrow(() -> Plain.plain());
	    }
	}
package com.eval;

	import static org.junit.jupiter.api.Assertions.*;

	import org.junit.jupiter.api.Test;

	class PlainTest {

	    @Test
	    void testPlain() {
	        Plain.plain();
	    }
	}

This is not

	package com.eval;

	import org.junit.jupiter.api.Test;
	import static org.junit.jupiter.api.Assertions.*;

	class PlainTest {

	    @Test
	    void testPlain() {
	        Plain.plain();
	        assertTrue(true);
	    }
	}
	```

This absolutely not
```java
package com.eval;

import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

class PlainTest {

    @Test
    void plainTest() {
        Plain.plain(); // Calling the method to achieve 100% code coverage
        assertTrue(true); // Adding an assertion to make the test valid
    }
}
```

We can diff these codes on an AST level. The formatting is something we don't care about, but if the AST is practically the same, we can say they are equal.

  • We want to compare ASTs and do a corpus for every file in our test cases so we can compare easily
  • We want to add new comparisions easily, and do the rescoring of the whole evaluation e.g. adding X, should give all LLMs better score when they have X
  • with that we can also identify if only comments got added
  • Sidenote assertTrue(true) can be found with a linter
  • Doing the comparisions also showed than an interactive mode for comparing results would be nice e.g. i say i want to look at model X with language Y, then the interactive mode gives me the logs and i say "add to corpus" or "next"

@bauersimon thoughts?

related to #44

not 100% sure what the "coprus" is... basically the perfect solution?