jpeddicord/askalono

`remove_common_tokens` poorly handles "odd" licenses

Closed this issue · 7 comments

I am using askalono as a library and was trying to figure out why I was getting extremely low confidence scores vs the 0.3.0 CLI that I installed, which was correctly identifying one of the problematic license files, eg. https://github.com/rust-random/rand/blob/master/rand_core/LICENSE-MIT

Here's an example run where I just print the result of each preprocess step, remove_common_tokens runs first, and basically truncates all of relevant license text which results in the analysis being unable to do much of anything.

[2019-05-29T07:31:48Z DEBUG askalono::preproc] INPUT: 
Copyright 2018 Developers of the Rand project
Copyright (c) 201 The Rust Project Developers

Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
documentation files (the 'Software'), to deal in the
Software without restriction, including without
limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and or sell copies of
the Software, and to permit persons to whom the Software
is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice
shall be included in all copies or substantial portions
of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.

---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] AFTER: 
 2018 Developers of the Rand project
 (c) 201 The Rust Project Developers
---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] AFTER: 
 2018 Developers of the Rand project
 (c) 201 The Rust Project Developers
---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] AFTER: 
 2018 Developers of the Rand project
 c 201 The Rust Project Developers
---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] AFTER: 
 2018 developers of the rand project
 c 201 the rust project developers
---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] AFTER: 
 2018 developers of the rand project
 c 201 the rust project developers
---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] AFTER: 
 2018 developers of the rand project c 201 the rust project developers
---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] AFTER: 
2018 developers of the rand project c 201 the rust project developers
---
[2019-05-29T07:31:48Z DEBUG askalono::preproc] Aggressively normalized to:
2018 developers of the rand project c 201 the rust project developers
---

This license is a bit odd with the 2 copyright headers at the beginning, and indeed, removing one of them won't trigger the truncation any longer.

[2019-05-29T07:39:09Z DEBUG askalono::preproc] INPUT: 
Copyright 2018 Developers of the Rand project

Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
documentation files (the 'Software'), to deal in the
Software without restriction, including without
limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and or sell copies of
the Software, and to permit persons to whom the Software
is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice
shall be included in all copies or substantial portions
of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.

---
[2019-05-29T07:39:09Z DEBUG askalono::preproc] Aggressively normalized to:
permission is hereby granted free of charge to any person obtaining a copy of this software and associated documentation files the software to deal in the software without restriction including without limitation the rights to use copy modify merge publish distribute sublicense and or sell copies of the software and to permit persons to whom the software is furnished to do so subject to the following conditions the above copyright notice and this permission notice shall be included in all copies or substantial portions of the software the software is provided as is without warranty of any kind express or implied including but not limited to the warranties of merchantability fitness for a particular purpose and noninfringement in no event shall the authors or copyright holders be liable for any claim damages or other liability whether in an action of contract tort or otherwise arising from out of or in connection with the software or the use or other dealings in the software
---

Yikes -- that's no good at all. Good digging. I'm going to:

  1. Disable the LCS removal for now (with intent to fix + re-enable later)
  2. Add a regression test to ensure this doesn't break in the future
  3. Fix in master & release 0.4.0 (was looking at 1.0 for a time, but I'd like to get this fixed first and then prep that)
  4. Backport to 0.3 as 0.3.1

Going to see how far I can get on all of this today. :)

Sounds good, thanks for the quick response, but no need to rush! :)

Oh gosh. I just realized that remove_common_tokens isn't even in 0.3.0, which explains why my cherry pick didn't resolve cleanly. But it probably snuck into the askalono.linux-static build I manually added a few months later.

@Jake-Shadle were you using the static build off of GitHub releases, or was it via cargo install? I think that might have been the issue with 0.3. This is fixed in master for the next release, however.

I was using cargo install.

cargo install from crates.io (cargo install askalono-cli) or via this repository? I'm trying to track down where you experienced this bug so I can make sure it's eradicated everywhere.

Oh sorry, cargo install from crates.io, that's how I narrowed down the cause, by looking at the commits that had happened after the 0.3.0 version bump, as initially I had assumed some of the changes on my fork had been the cause, but when going back to an unmodified HEAD, it exhibited the same behavior.

Ok, I think what might have happened is you did cargo install from crates.io originally, but at some point might have run it directly against this repository (which still has the 0.3.0 version number in source; I haven't yet bumped that). I just did a fresh cargo install --force askalono-cli to get the latest version published to crates.io and it's giving me expected results:

❯❯❯ cargo install --force askalono-cli
    Updating crates.io index
  Installing askalono-cli v0.3.0
     [...]
   Compiling env_logger v0.5.13
   Compiling askalono v0.3.0
   Compiling ignore v0.4.7
   Compiling askalono-cli v0.3.0
    Finished release [optimized] target(s) in 4m 26s
   Replacing /Users/peddicor/.cargo/bin/askalono

❯❯❯ which askalono
/Users/peddicor/.cargo/bin/askalono

❯❯❯ askalono id ~/Desktop/testlicense
License: MIT (original text)
Score: 0.994

So there is definitely a bug in master (that's currently being worked around by disabling that text preprocessor) but 0.3.0 should be fine, in both library and executable form as published to crates.io.

If I've missed something, please let me know and try to get a reproducible test case and I'll dig into this more for that version.