nexB/debian-inspector

debut-0.9.4 not detecting GPLv2 from copyright texts

rnjudge opened this issue · 2 comments

Tern uses the debut package to parse debian copyrights and find package licenses. I understand that debut is now debian-inspector but as far as I can tell, the code is the same at the moment so I am opening an issue in this repo. Debut is not finding a license for the following copyright text (libgpm2copyright.txt) from the libgpm2 package. Here's what we're doing to collect the licenses that doesn't yield any results:

>>> from debut import debcon
>>> from debut import copyright as debut_copyright

>>> with open('libgpm2copyright.txt') as file:
...     libgpm2copy = file.read()

>>> collected_paragraphs = list()
>>> for paragraph in iter(debcon.get_paragraphs_data(libgpm2copy)):
...     if 'license' in paragraph:
...             cp = debut_copyright.CopyrightLicenseParagraph.from_dict(paragraph)
...             collected_paragraphs.append(cp)
>>> collected_paragraphs
[CopyrightLicenseParagraph(license=LicenseField(name='', text=None), comment=FormattedTextField(text=None), extra_data={})]


>>> deb_pkg_data = debut_copyright.DebianCopyright(collected_paragraphs).to_dict()
>>> deb_pkg_data
{'paragraphs': [{'license': '', 'comment': ''}]}

Is it possible for this text to be parse-able for licenses by debian-inspector?

@rnjudge Hi! 👋 and thanks for the report!
This is one of the many unstructured copyright files.
There are these things we can do:

  1. try harder to infer some structure from this #6
  2. improve license detection in Debian copyright files in nexB/scancode-toolkit#2390

Separately these are related:

  • Determine the primary license from a copyright file #8
  • Improve tracing of license detection in package manifests nexB/scancode-toolkit#2389

We now process and report correctly the license in unstructured copyright files at https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/debian_copyright.py#L393
Since these have no structure whatsoever, we recover from parsing and treat this differently.

Some files are semi-structured like pulseaudio and we have #6 open for this.
Here's the result scanning the copyright file mentioned in the issue.
libgpm2_copyright.json

Other related issues are tracked elsewhere, so closing for now. Thanks!