Introduce an AST-differ that also gives metrics
Opened this issue · 3 comments
zimmski commented
The following Java test output are equally good:
package com.eval;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
class PlainTest {
@Test
void testPlain() {
assertDoesNotThrow(() -> Plain.plain());
}
}
package com.eval;
import static org.junit.jupiter.api.Assertions.*;
import org.junit.jupiter.api.Test;
class PlainTest {
@Test
void testPlain() {
Plain.plain();
}
}
This is not
package com.eval;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
class PlainTest {
@Test
void testPlain() {
Plain.plain();
assertTrue(true);
}
}
```
This absolutely not
```java
package com.eval;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
class PlainTest {
@Test
void plainTest() {
Plain.plain(); // Calling the method to achieve 100% code coverage
assertTrue(true); // Adding an assertion to make the test valid
}
}
```
We can diff these codes on an AST level. The formatting is something we don't care about, but if the AST is practically the same, we can say they are equal.
- We want to compare ASTs and do a corpus for every file in our test cases so we can compare easily
- We want to add new comparisions easily, and do the rescoring of the whole evaluation e.g. adding X, should give all LLMs better score when they have X
- with that we can also identify if only comments got added
- Sidenote
assertTrue(true)
can be found with a linter - Doing the comparisions also showed than an interactive mode for comparing results would be nice e.g. i say i want to look at model X with language Y, then the interactive mode gives me the logs and i say "add to corpus" or "next"
zimmski commented
@bauersimon thoughts?
bauersimon commented
related to #44
bauersimon commented
not 100% sure what the "coprus" is... basically the perfect solution?