jpeddicord/askalono

How much should whitespace matter?

Opened this issue · 1 comments

Deeply open-ended question, but the following file is a direct copy of https://spdx.org/licenses/AGPL-1.0.html "by hand" (right-click, copy, paste), but askalono id only scores 0.999 instead of the 1.0 that printing the extract from the JSON gets you: LICENSE-RIGHTCLICK.txt

It's not clear to me which is the canonical version and thus which is (arguably) a license violation. It's also not clear to me that askalono should fudge the line breaks here. It's also not clear to me that askalono should NOT fudge the line breaks here.

I can't find an option to enable "massage newline differences like this one" in the library API, and I think that doing so might be worth it as an option on top of the whole "the return value is a ratio reflecting the scoring of it as a match" bit.

That said, the original issue seems to be a problem in the underlying data used: SPDX has subtle differences between the HTML and JSON renderings in terms of how it emits spaces.