Feature request: add support for multi-license texts to license comparison logic
pmonks opened this issue · 18 comments
I was recently evaluating org.spdx.utility.compare.LicenseCompareHelper
, and noticed that it doesn't detect any licenses or exceptions in a text that contains multiple licenses and/or exceptions. Specifically, I was testing with the license text from the JavaMail project which is CDDL-1.1 OR GPL-2.0 WITH Classpath-exception-2.0
, and which I would expect to find a match for all three SPDX identifiers in that expression (albeit I wasn't expecting it to construct a license expression, though that would be very nice too!).
More specifically, I expected the following code to print false
three times, but instead it prints true
three times:
String text = /* Code to read https://raw.githubusercontent.com/javaee/javamail/master/LICENSE.txt */
System.out.println(LicenseCompareHelper.isTextStandardLicense(text, LicenseInfoFactory.getListedLicenseById("CDDL-1.1")).isDifferenceFound());
System.out.println(LicenseCompareHelper.isTextStandardLicense(text, LicenseInfoFactory.getListedLicenseById("GPL-2.0")).isDifferenceFound()); // Perhaps this should be GPL-2.0-only?
System.out.println(LicenseCompareHelper.isTextStandardException(text, LicenseInfoFactory.getListedExceptionById("Classpath-exception-2.0")).isDifferenceFound());
(apologies for any syntax errors in this code - this is hand translated from the actual Clojure code I wrote)
Is matching of multi-license texts something that's in scope for the project, and if so might this be a potential new feature?
I was recently evaluating
org.spdx.utility.compare.LicenseCompareHelper
, and noticed that it doesn't detect any licenses or exceptions in a text that contains multiple licenses and/or exceptions.
You're correct - the LicenseCompareHelper
is designed to report a match using the license matching guidelines for the entire license text parameter.
I wouldn't recommend using this method to "find" license text within a larger body of text - it is actually very inefficient at that task.
In my applications, I use other algorithms to find potential license matching (e.g. using the Sørensen Dice coefficient) then passing the suspected text to the LicenseCompareHelper
to do the more rigorous (and expensive) matching.
Is matching of multi-license texts something that's in scope for the project, and if so might this be a potential new feature?
I think this would be valuable. There's probably 2 different features we could implement:
isStandardLicenseWithinText
(or something a bit better named) which would report true if a listed license text is found anywhere within the text passed into the method. This is what you were expecting in the example above. Note - this is actually implemented to some degree in the LicenseListPublisher Match.java code.getAllListedLicenseWithText
(or something a bit better named) which would return a list of SPDX license ID's for all licenses found within a stream of text. The design could be similar to what I've done in other products.
There is a 3rd feature we could consider, but I think it would be very challenging to implement:
- a method which would return an appropriate license expression for text containing multiple listed licenses.
I probably won't have much time to work on this in the near term, but I would welcome any pull requests to add the functionality.
@goneall I'm terribly sorry to have potentially wasted your time, but is org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIds()
already mostly doing this (albeit it doesn't appear to check for standard exceptions, something I might be able to add in a PR)?
@pmonks Good point on the standard exceptions - a PR for this would be much appreciated.
@goneall I'll take a look at what's involved in a PR and hopefully get something for you to review in the next week or two.
And just to confirm - my impression that org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIds()
does what I was looking for in the original feature request (minus exceptions), is correct?
[edit] to partially answer my own question, when tested with the JavaMail license text, this method returns an empty array, so perhaps it isn't what I'm looking for.
I took a swing at an implementation of isStandardLicenseWithinText
here, but it isn't finding CDDL-1.1 in the JavaMail license text either, and I don't think that's due to an issue in the new code. It's been almost a decade since I did any Java in anger though, so would love another set of eyeballs to take a look and see if I'm missing anything obvious, if that's not too onerous of a request?
I took a look at the code and found one problem in the readTextFromURL(String url)
method. If you replace the code with the following, it will read the text correctly:
try {
BufferedReader rdr = new BufferedReader(new InputStreamReader(conn.getInputStream(), encoding));
String line = rdr.readLine();
while (line != null) {
sb.append(line);
sb.append('\n');
line = rdr.readLine();
}
}
finally {
conn.getInputStream().close();
}
After fixing the readUrl, it is still not recognizing the license.
It is failing the pattern matching in line 792. Here's the pattern the code compiled for the match:
\Q1.\E\s*\QDefinitions.\E\s*\Q1.1.\E\s*\Q"Contributor"\E\s*\Qmeans\E\s*\Qeach\E\s*\Qindividual\E\s*\Qor\E\s*\Qentity\E\s*\Qthat\E\s*\Qcreates\E\s*\Qor\E\s*\Qcontributes\E\s*\Qto\E\s*\Qthe\E\s*\Qcreation\E\s*\Qof\E\s*\QModifications.\E\s*\Q1.2.\E\s*\Q"Contributor\E\s*\QVersion"\E\s*\Qmeans\E\s*\Qthe\E\s*\Qcombination\E\s*\Qof\E\s*\Qthe\E\s*\QOriginal\E\s*\QSoftware,\E\s*\Qprior\E\s*\QModifications\E\s*\Qused\E\s*\Qby\E\s*\Qa\E\s*\QContributor\E\s*\Q(if\E\s*\Qany),\E\s*\Qand\E\s*\Qthe\E\s*\QModifications\E\s*\Qmade\E\s*\Qby\E\s*\Qthat\E\s*\Qparticular\E\s*\QContributor.\E\s*\Q1.3.\E\s*\Q"Covered\E\s*\QSoftware"\E\s*\Qmeans\E\s*\Q(a)\E\s*\Qthe\E\s*\QOriginal\E\s*\QSoftware,\E\s*\Qor\E\s*\Q(b)\E\s*\QModifications,\E\s*\Qor\E\s*\Q(c)\E\s*\Qthe\E\s*\Qcombination\E\s*\Qof\E\s*\Qfiles\E\s*\Qcontaining\E\s*\QOriginal\E\s*\QSoftware\E\s*\Qwith\E\s*\Qfiles\E\s*\Qcontaining\E\s*\QModifications,\E\s*\Qin\E\s*\Qeach\E\s*\Qcase\E\s*\Qincluding\E\s*\Qportions\E\s*\Qthereof.\E\s*\Q1.4.\E\s*\Q"Executable"\E\s*\Qmeans\E\s*\Qthe\E\s*\QCovered\E\s*\QSoftware\E\s*.*.*\Qbe\E\s*\Qdeemed\E\s*\Qto\E\s*\Qconstitute\E\s*\Qany\E\s*\Qadmission\E\s*\Qof\E\s*\Qliability.\E\s*\QNOTICE\E\s*\QPURSUANT\E\s*\QTO\E\s*\QSECTION\E\s*\Q9\E\s*\QOF\E\s*\QTHE\E\s*\QCOMMON\E\s*\QDEVELOPMENT\E\s*\QAND\E\s*\QDISTRIBUTION\E\s*\QLICENSE\E\s*\Q(CDDL)\E\s*\QThe\E\s*\Qcode\E\s*\Qreleased\E\s*\Qunder\E\s*\Qthe\E\s*\QCDDL\E\s*\Qshall\E\s*\Qbe\E\s*\Qgoverned\E\s*\Qby\E\s*\Qthe\E\s*\Qlaws\E\s*\Qof\E\s*\Qthe\E\s*\QState\E\s*\Qof\E\s*\QCalifornia\E\s*\Q(excluding\E\s*\Qconflict-of-law\E\s*\Qprovisions).\E\s*\QAny\E\s*\Qlitigation\E\s*\Qrelating\E\s*\Qto\E\s*\Qthis\E\s*\QLicense\E\s*\Qshall\E\s*\Qbe\E\s*\Qsubject\E\s*\Qto\E\s*\Qthe\E\s*\Qjurisdiction\E\s*\Qof\E\s*\Qthe\E\s*\QFederal\E\s*\QCourts\E\s*\Qof\E\s*\Qthe\E\s*\QNorthern\E\s*\QDistrict\E\s*\Qof\E\s*\QCalifornia\E\s*\Qand\E\s*\Qthe\E\s*\Qstate\E\s*\Qcourts\E\s*\Qof\E\s*\Qthe\E\s*\QState\E\s*\Qof\E\s*\QCalifornia,\E\s*\Qwith\E\s*\Qvenue\E\s*\Qlying\E\s*\Qin\E\s*\QSanta\E\s*\QClara\E\s*\QCounty,\E\s*\QCalifornia.\E\s*
To understand why it is failing, we could analyze the above Regex and find out why it's not matching the JavaMail text.
I took a look at the code and found one problem in the
readTextFromURL(String url)
method.
Good catch! I was missing a cast - now fixed in my fork.
To understand why it is failing, we could analyze the above Regex and find out why it's not matching the JavaMail text.
As a preliminary step to that, I pulled out just the CDDL-1.1 text from the JavaMail license and ran it through LicenseCompareHelper.isTextStandardLicense()
. That returned differences (i.e. no match), so I think your theory that it's the regex that's failing to match is correct.
Any tips for troubleshooting regexes & texts this big?
[edit] should this be a separate issue (i.e. the CDDL-1.1 regex not matching this specific license text)?
Any tips for troubleshooting regexes & texts this big?
There's some online resources - regex101 is one I've used in the past.
should this be a separate issue (i.e. the CDDL-1.1 regex not matching this specific license text)?
There's a few possible issues:
- The JavaMail CDDL text really isn't CDDL per the SPDX license matching guidelines
- The License XML for CDDL-1.1 doesn't include the right optional and/or variable text tags to allow the match
- There's a bug in the license matching code
Based on prior experiences, 2 above is the most likely, followed by 1 followed by 3.
Once we find the specific text that is different, we can figure out the cause. If the issue is 2 above, then we should open an issue and/or a pull request in the license-list-XML repo. If it is 3 above, then we can open a new issue here.
So I had some time to look at this, and the regex above is not matching the CDDL-1.1 section of the JavaMail license for two reasons:
- The optional text at the start (
<<beginOptional>>COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL) Version 1.1<<endOptional>>
) is not accounted for in the regex. Perhaps this is handled some other way (e.g. by pre-processing the license text)? - The regex doesn't account for the horizontal rule on line 352 of the JavaMail license text, which is possibly related to spdx/license-list-XML#1617 ?
I think this means that explanation 1 or 2 is the relevant one here, and if so that the non-matching of the JavaMail license text is not related to this issue (and I'm happy to raise a separate issue if you'd like?).
Separately, I'll add some more multi-license text unit tests to my fork, using different source texts, and we can temporarily drop the JavaMail license text from the unit tests until that separate issue is sorted out.
Here's the latest diff with updated unit tests as described above. However it's still not matching a synthetic multi-license text constructed from single license texts that are exercised individually elsewhere in the unit tests, so I'm not sure if the code in org.spdx.utility.compare.LicenseCompareHelper.isStandardLicenseWithinText()
is buggy, or if there's something wrong with the algorithm itself. As I say it's been a long time since I wrote Java in anger, so additional eyeballs would be welcome!
@pmonks - sorry for the long absence on this one.
I found a couple of issues in the isStandardLicenseWithinText()
that is causing the problem.
Basically, we are need to normalize more of the text. In this specific case, the regex has a copyright symbol ©
and the test text has a (c)
which is equivalent per the matching guidelines but is not taken into account when comparing the text. It is interesting that we're getting different strings from the same license information, but I think the isStandardLicenseWithinText()
should take this into account.
I'll see if I can come up with a fix and submit a PR.
@pmonks - I just created a PR against your repo/branch with changes to make that work. See the comments in the PR for details.
If this works for you, go ahead and merge the PR and create a PR against this repo.
@pmonks - I'm about to do a new release of the library - if you can get the PR created in the next day or so, I'll merge it into this release - otherwise we'll include it in the next follow-on release.
Thank you @goneall! Will review tomorrow and get a PR back to the main repository tomorrow.