use SrcML for parsing diffs
Closed this issue · 5 comments
try parsing diffs with SrcML first, and if that doesn't work, probably parse entire files with SrcML then correlate to the diff (or figure out what to do with diffs, in general)
SrcML.NET is out there, but it's not maintained and using an old version of srcML that uses two executables. I think we can accomplish what we need with a few custom external calls to the current version.
i agree. let's use the latest version of SrcML
We can now generate srcML documents of type XDocument
using the functions in c05d7b6. srcML must be in your PATH. TODO:
- Throw exception or gracefully fail when srcML is not present or returns an error.
- Filter to only parse additions on diffs.
- Determine what's relevant from a returned srcML document.
After trying different diffs, I think we will run into problems trying to parse them in isolation. You can't count on the context provided. We should instead parse the patched file for each commit in SrcML with the --position
flag, which makes it possible to then correlate with the diffs on line and column number. Some psuedocode:
For each file in commit_files
keyword_list = []
srcMLdoc = raw_url parsed with srcML
Filter to only include useful nodes
For each @@ ---- @@ diff block in patch
filtered_diff = filtered for additions
For each line in filtered_diff
Find nodes matching pos:line in srcMLdoc
Add to keyword_list
We can now process hunk blocks in unified diffs to get the line additions for files. When processing files, we should check the status
for each file in the commit and only process on modified
and created
files, as removed
files are irrelevant. The next step is now correlating line numbers with full files parsed by SrcML to pick out the values we want.