gold per-file requirements hurt sustainability and deter contributors, why "MUST"?
ljharb opened this issue ยท 13 comments
Having boilerplate/frontmatter in every file is an annoyance to contributors. It's an additional CI check that needs to be run, it's additional friction for people creating new files (but especially newcomers), and it does nothing to increase the security of the project itself. It certainly would make auditing easier for folks who depend on projects that have copy-pasted files from my project - but that's not related to my project's security.
At the very least, I think these should be downgraded to "SHOULD" - but my personal preference would be to invert them and say that one should actively NOT do this. Duplicating information that's already available in the right place at the root of the repo is just noise.
I believe you mean the gold criteria [copyright_per_file] and [license_per_file].
This is a proposed change (reduction) in requirements, so we need to hear from others. If anyone has comments for or against this proposed change, please say so in this issue!
For simplicity, here is more information about each criterion.
Criterion copyright_per_file
:
- Requirement: The project MUST include a copyright statement in each source file, identifying the copyright holder (e.g., the [project name] contributors). {Met justification} [copyright_per_file]
- Details: This MAY be done by including the following inside a comment near the beginning of each file: "Copyright the [project name] contributors.". See "Copyright Notices in Open Source Software Projects" by Steve Winslow.
- Rationale: This isn't legally required in most jurisdictions, per the Berne Convention. For example, copyright notices have not been required in the US since 1979. On the other hand, this is not hard to add. Ben Balter's "Copyright notices for open source projects" provides some good arguments for why it should be included: "First, someone may want to use your work in ways not allowed by your license; notices help them determine who to ask for permission. Explicit notices can help you prove that you and your collaborators really are the copyright holders. They can serve to put a potential infringer on notice by providing an informal sniff test to counter the 'Oh yeah, well I didnโt know it was copyrighted' defense. For some users the copyright notice may suggest higher quality, as they expect that good software will include a notice... Git can track these things, but people may receive software outside of git or where the git history has not been retained." In addition, we have been informed by the Linux Foundation's SPDX community that having this information is extremely valuable for relicensing and for checking to determine if a copyrighted work is derived from another. While version control systems do track versioning within a project, when files are copied between projects this information is often lost. Having the copyright notice information helps those researching sources, e.g., if they wish to try to relicense something.
Criterion license_per_file
:
- Requirement: The project MUST include a license statement in each source file. This MAY be done by including the following inside a comment near the beginning of each file: SPDX-License-Identifier: [SPDX license expression for project]. {Met justification} [license_per_file]
- Details: This MAY also be done by including a statement in natural language identifying the license. The project MAY also include a stable URL pointing to the license text, or the full license text. Note that the criterion license_location requires the project license be in a standard location. See this SPDX tutorial for more information about SPDX license expressions. Note the relationship with copyright_per_file, whose content would typically precede the license information.
- Rationale: Files are sometimes individually copied from one project into another. Per-file license information increases the likelihood that the original license will be honored. SPDX provides a simple standard way to identify common licenses, without having to embed the full license text in each file; since this makes the criterion easier to do, we specifically mention it. Technically, the text after "SPDX-License-Identifier" is a SPDX license expression, not an identifier, but the tag "SPDX-License-Identifier" is what is used for backwards-compatibility.
I would be in favor of changing this from MUST to SHOULD, but perhaps we could compromise somewhat and leave it as 'MUST' if a particular source file was "sourced" from another project as you mention in the rationale (or add an appropriate one if not present for that file). I think that would especially be important when the license type of that particular "borrowed" source file is different than the license type that you are releasing your project under. For instance, a project releasing under LGPL 2.1 license but pulling in a Java class that was licensed under (say) Apache 2 license. During all the code reviews that I've done in the past 10 years, I certainly have seen teams "borrow" something like the source code for org.apache.commons.lang3.StringUtils and pull that directly into their repo rather than including a dependency for Apache Commons Lang 3.
The disadvantage of copyrighting every source file is when you need to change it because the company or organization name changes, it becomes tedious to update it everywhere. I'm somewhat facing that now, because for OWASP ESAPI, most (if not all) of our Java source files have a copyright notice for "Open Web Application Security Project" and then OWASP had to go and change their name to "Open Worldwide Application Security Project". More of an annoyance, but it still illustrates the point.
That makes total sense for vendored code, it's just exceedingly rare to ever do that in the ecosystems I participate in.
I always thought of this requirement as a way to prove to everyone that the provinance is in order. We know the copyright and license status for every individual file. And I think a gold project should live up to that.
But complying with REUSE does allow for also providing the same info out of file when necessary and as long as that info exists, I think it's fine. That is how we in curl comply with this.
I've seen too many files copied piecemeal from one project to another. IMHO, NOT having the info there is asking for trouble.
I'd even go so far as to suggest making it a SHOULD for SILVER, but leaving it a MUST for GOLD.
Trouble for who? The point of this program is to make the project itself more secure.
Being required to add a copyright notice to every single source file seems odd to me. As far as I am aware, copyright notices have little to no effect in most jurisdictions, at least in recent decades. The contents of a single source file might not even meet the threshold of originality in many jurisdictions.
Of course, if projects wish to use copyright notices in every single file as a deterrent against infringement, that's perfectly fine, but I don't see why it would be mandated for security reasons here.
If this recommendation is supposed to ensure that the origin and license of each file is clear, I am pretty sure someone could come up with a criterion that causes less friction.
If we're going to require Copyright notice on every source file, can we at least make an exclusion for configuration files? That gets confusing because they often get heavily edited by library users.
@kwwall - The requirements are only for "source files". Usually configuration files aren't source files, so typical configuration files are already excluded from this requirement.
With multiple files being copied from project to project as a common development pattern, the license information is key to retain to ease analysis. And if there are problems with a license for a file contents being included with other files, only the copyright holder can change the license, so keeping this original metadata with the file, helps de-risk issues with using the software.
As vulnerabilities occur at the file level, having this information handy, helps with notification to the copyright holder, who may have used the contents of this file, in other locations as well.
Agree with David that configuration files and other generated evidence, do not typically have copyright asserted on the contents.
@kestewart In which ecosystems is this a common and "non-discouraged" pattern?
Agree with David that configuration files and other generated evidence, do not typically have copyright asserted on the contents.
A minor nit: The gold badge currently only requires per-file license statements for source files. Configuration files are often not generated evidence, but since they also aren't normally source files, there's no gold requirement for per-file license statements in typical configuration files (unless they're also source files). I am NOT a lawyer, but my understanding is that it's often pointless to try to claim copyright on configuration files. Copyright law only covers expression. This makes claiming copyright over configuration files often dubious (depending on the circumstance). See this discussion: https://groups.drupal.org/node/17555. No one here is suggesting that license statements be required in every configuration file, but I thought it'd be worth clarifying that there are good reasons for that. You can do it if you want to of course :-).
Configuration file vs source file can get a bit fuzzy at times when it concerns the whole "infrastructure as code" paradigm, but as long as the gold badge standard doesn't get too draconian and leave those choices to the development teams, I think it will work out fine.