vcu-swim-lab/KnowHows

use SrcML for parsing diffs

Closed this issue · 5 comments

try parsing diffs with SrcML first, and if that doesn't work, probably parse entire files with SrcML then correlate to the diff (or figure out what to do with diffs, in general)

SrcML.NET is out there, but it's not maintained and using an old version of srcML that uses two executables. I think we can accomplish what we need with a few custom external calls to the current version.

i agree. let's use the latest version of SrcML

We can now generate srcML documents of type XDocument using the functions in c05d7b6. srcML must be in your PATH. TODO:

  • Throw exception or gracefully fail when srcML is not present or returns an error.
  • Filter to only parse additions on diffs.
  • Determine what's relevant from a returned srcML document.

After trying different diffs, I think we will run into problems trying to parse them in isolation. You can't count on the context provided. We should instead parse the patched file for each commit in SrcML with the --position flag, which makes it possible to then correlate with the diffs on line and column number. Some psuedocode:

For each file in commit_files
    keyword_list = []
    srcMLdoc = raw_url parsed with srcML
    Filter to only include useful nodes
    For each @@ ---- @@ diff block in patch
        filtered_diff = filtered for additions
        For each line in filtered_diff
            Find nodes matching pos:line in srcMLdoc
            Add to keyword_list

We can now process hunk blocks in unified diffs to get the line additions for files. When processing files, we should check the status for each file in the commit and only process on modified and created files, as removed files are irrelevant. The next step is now correlating line numbers with full files parsed by SrcML to pick out the values we want.