cs50/compare50

In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out

dmalan opened this issue · 3 comments

In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out

Tricky one there. In short, the current noise threshold is too high for function definitions. These definitions carry relatively few tokens (especially in speller) and are by compare50's current standards not big enough to be relevant. We could lower the noise threshold and see how that performs here, or allow the distribution code comparison to have a lower noise threshold than the actual comparison.

Full story: to exclude distro code from student code, compare50 will use the same method it uses for comparison. Namely it breaks up the file into k-grams, short sequences of tokens. If any such sequence from the distro code matches a sequence in the student's file, that sequence gets removed/ignored. The problem here lies in the "noise" threshold, that is effectively the length of the sequences. If this length is too short, almost everything will match, but if it's too long almost nothing will match. Through experimentation we landed on the "magic number" 25 (tokens)(

comparator = comparators.Winnowing(k=25, t=35)
).

Hm, here too could we do more thorough comparisons after the initial filtration, such that we re-examine all ~50 matches, diff out distro code, then exact-match other lines before sending to the GUI? Probably pretty fast for just 50 pairs?

That would essentially be a new method exact-by-line. But would be interesting to try out, given that for text/exact, lines are somewhat of a logical unit of information. Small gotchas though perhaps with this technique, there are always uninteresting lines in code:

{
}