Bad `LicenseFilesReader#createFileDetails` logic
Vampire opened this issue · 0 comments
I spent the last 3 days to find out why for the LICENSE.txt
file of javax.activation:javax.activation-api:1.2.0
I got ASLv2, CDDL 1.0, and CDDL 2.0 as found licenses.
I have significant amount of customization including own normalizer bundle without reusing the built-in one (probably at some point will create a PR with the improvements), so I initially thought it is something I broke.
I was even more confused when removing the normalizer rule
{
"bundleName": "CDDL-1.0",
"licenseNamePattern": "Common Development and Distribution License( \\(CDDL\\),?)? (version )?(.?\\s?)?1\\.0"
},
suddenly made the CDDL 1.0 disappear, while this rule should not even look at the license file contents.
It turned out, that the logic in com.github.jk1.license.reader.LicenseFilesReader#createFileDetails
is highly questionable imho.
When creating the LicenseFileDetails
instance it looks at the text content.
If the text contains a certain EPLv2 string representation the file is assumed to be EPLv2.
If not, but the text contains literally "CDDL", the file is assumed to be CDDL 1.0.
If not, but the text contains a certain ASLv1.1 string representation the file is assumed to be ASLv1.1.
If not, but the text contains a certain ASLv2 string representation the file is assumed to be ASLv2.
If not, but the text contains another certain EPLv2 string representation the file is assumed to be EPLv2 but with slightly different values than the most significant case.
This is highly problematic per-se as the inspected license files could for example also contain multiple licenses and especially the CDDL case is extremely bad.
What happened here for the javax.activation:javax.activation-api:1.2.0
license file is, that it did not contain the most significant EPLv2 string, but it contained a CDDL 1.1 license. As that method just looks for "CDDL" it was recognized as CDDL 1.0.
The bundle normalizer then in com.github.jk1.license.filter.LicenseBundleNormalizer#normalizeLicenseFileDetails
sends not only the license text, but also the prematurely "detected" license name and url to the transformation rules. After the rules are applied the previous information is discarded and the outcome of the normalizer rules is used instead. So without that rule, no rule matches the prematurely wrongly detected CDDL 1.0 license and it is thrown out. With the rule, it survives the bundle normalizer and is wrongly reported by any reporter using the licenses detected in the license file.
While the EPLv2, ASLv1.1, and ASLv2 detections in that method are pretty unlikely to produce false-positives, the CDDL 1.0 rule is pretty likely as we have seen.
My question is, why this premature lookup is done at all, except maybe for legacy reasons.
Well, true, the license bundle normalizer needs to be configured manually to do any work.
But especially with license files that can contain any text and multiple licenses it is a major flaw to not use the bundle normalizer anyway imho.
And in the default HTML reports the license file licenses are not rendered at all anyway, so currently there would not be lost too much when removing these detections at all.
If they stay, then at least the EPLv2 rules should be changed to produce consistent license name and URL,
and the CDDL rule needs to be made more specific or removed.
Or a CDDL 1.1 rule with higher precedence (lower in the if-block cascade) needs to be added so that it is more unlikely that CDDL 1.1 is misrecognized as CDDL 1.0 as it is right now.