Curation guideline

From OSSelot
Jump to navigation Jump to search

This information is intended to provide guidelines on how data are curated for the OSSelot project and how contributing works. The curator should be familiar with their preferred scanning tool (ours is Fossology) and have a general understanding of copyright law and in particular knowledge of FOSS licensing.

Note: Whenever information is given that is specific to Fossology, it is prepended with the keyword fossy.

Preparation

  • Obtain the component in source code form.
    • Note the download URL.
  • Naming convention:
    • Try to follow the project’s naming and version convention, e.g. as given by the release’s git tag.
    • If this is not consistent, use only lowercase letters.
    • [package name]-[version number], e.g. angular-15.1.0.
  • Analyze the component with a license scan tool (e.g. Fossology, Scancode).
    • fossy: Fossology default settings for analysis:
      • 7. Select optional analysis:
        • Upload from file
        • Copyright/Email/URL/Author Analysis
        • Monk License Analysis, scanning for licenses performing a text comparison
        • Nomos License Analysis, scanning for licenses using regular expressions
        • Ojo License Analysis, scanning for licenses using SPDX-License-Identifier
      • 10. ScanCode Toolkit, scan for
        • License
        • Copyright
    • Scancode default options for analysis:
      scancode -cli --license-text -json [package name-version].json [package]
      
      c: copyrights; l: licenses; i: file information; --license-text: include full license text

Data curation

  • A licensing expert reviews and analyzes the scanning results.
  • Fossology can directly be used to review the results. The Scancode results must be reviewed with an external tool, e.g. Opossum.
  • Review is done on file level, i.e. every file in the source code tree for which at least one scanner found a result is analyzed.
    • fossy: In Fossology, you can browse through the relevant files by selecting "Go through all files with licenses and no clearing result".
  • That means:
    • scanner findings are confirmed, or
    • scanner findings are corrected.
  • If there are no findings for a file, the conclusion is NO ASSERTION (for SPDX tag LicenseConcluded).
    • fossy: In Fossology, this is given by the clearing decision types "No license known" or "Irrelevant" or "Non-functional".

LicenseComments

In case a license conclusion is not obvious, the decision is explained.

  • This is done with the following heuristic:

    The information in the file is:
    "[Quote licensing information in the source code file]"
    [Give reason for conclusion] Therefore, [license] is concluded.

  • Example 1: No version

    The information in the file is:
    "This file is GPL'd."
    As no version of the GPL is given, GPL-1.0-or-later is concluded.

  • Example 2: URL for license text

    The information in the file is:
    "This file is licensed under License A. You can find the license text at https://www.LicenseTextOfLicenseA.com."
    The URL contains the license text of License A, therefore License A is concluded. The information was retrieved on [date].

  • fossy: In Fossology, the explanations are given in the "Comment" section which maps to the SPDX tag LicenseComments.

Correcting scanner findings

The following list includes typical cases where scanner findings have to be corrected and how to do so.

Not a license

The scanner concludes a license from an expression in a file that is not actually a license expression at all. In this case, the incorrect license finding is removed.

  • fossy: In Fossology, the source of the scanner finding is highlighted when clicking on the number (#1) behind the scanner.

Not the file's license

The scanner concludes a license from a license expression that is part of the file’s content but not the license of the file itself. In this case, the incorrect license finding is removed.

License text

Files that contain only a license text (e.g. COPYING) are concluded by the scanners to be licensed under the respective license. This is usually not correct. Most license texts are not explicitly licensed, so the finding is removed. The GNU licenses contain a license statement for the license text itself which is concluded for these cases (License-of-GNU-licenses).

Imprecise finding

The scanner finding might be imprecise, e.g. w.r.t. to the version of a license, e.g. no version number is given. If this is the case, the imprecise finding is removed and the specified license and version is concluded. If no version is given, the lowest existing version with the -or-later extension is concluded.

Dual licensing

A file might offer a choice of two or more licenses under which it can be used. If the context requires to chose one specific license, this choice must be noted. However, all applicable licenses must be concluded. Also, dual license cases require additional post-processing, see section "Post-processing" below.

  • fossy: In Fossology, add the following text to the "Acknowledgement" section of the "Dual-license" finding to note the license choice, if applicable:

    To the extend files may be licensed under License A or License B, in this context License B has been chosen. This shall not restrict the freedom of other users to choose either License A or License B. For convenience, all license texts are provided.

License exceptions

In particular for the GNU licenses, there are a number of license exceptions.

  • fossy: Fossology notes the license and the exception as separate findings. This is corrected to one finding using the SPDX license expression [License] WITH [exception], e.g. GPL-2.0-or-later WITH GCC-exception-2.0.
  • fossy: If the Fossology license database does not yet contain these licenses, they have to be added.

Generic license texts

For some licenses, especially the BSD-type licenses, many variants of the license texts exist. The scanners often provide only the generic license texts. If an individual text differs from the generic text, the individual license text is provided.

  • fossy: In Fossology, click percentage of match to see differences.
  • fossy: The individual text is copied from the file into the "License" section of Fossology.

External references

Sometimes the file does not contain the name or text of a license but references an external resource such as a COPYRIGHT file in the root directory or a URL. In these cases, the external reference is checked and the detected license is concluded and the process is documented as a LicenseComment (in case of a URL, the date of access is noted).

(Partially) global license assignment

Sometimes there is a Readme file or similar that contains a statement assigning a license to several files within the source tree (e.g. all files in a specific directory). As such information is often outdated or does not account for individual licensing of files, it is not used to assign a license to a file here.

Acknowledgment

If a license has an acknowledgment requirement, the respective acknowledgment text is given. In particular for CC_BY licenses, the acknowledgment must contain the following information (if available): name of the creator, copyright notice, license notice, disclaimer, link to the material.

  • fossy: In Fossology, the acknowledgment text is given in the "Acknowledgement" section.

fossy: Bulk statements

In Fossology, scanner findings can be confirmed, removed or corrected with bulk statements.

  • When doing so, it is crucial to start with the shorter bulk statements as these can be part of a longer bulk statement which would then be modified by running the short bulk statement after the long one. For example (abbreviated):

    Short bulk statement: "This file is licensed under GPL version 2.0."
    Long bulk statement: "This file is licensed under GPL version 2.0. As a special exception, you may..."

    Here, the short bulk statement will modify the findings for the file with the long bulk statement. It should therefore be run first so that afterwards, the long bulk statement can correct the conclusion for the relevant files.
  • Do not limit the scope of bulk statements, rather choose unique bulk statements. When reusing bulk statements for future uploads, the initial scope is not preserved, but they are applied to the entire upload, so it might yield false results.

Curating copyright statements

  • Remove findings that were incorrectly identified as a copyright statement (e.g. license texts, code, etc.).
  • Remove content from copyright statements that is not part of the copyright notice (e.g. formatting signs, license notices, comments on content, code, etc.).
  • If the source code tree contains an AUTHORS file, the content of this is given as value to the SPDX tag PackageCopyrightText in the post-processing stage (see section “Post-processing” below).

Package license

Only If there is a LICENSE or COPYING or similar file in the root directory that states a main license for the package, we give this information as value to the SPDX tag PackageLicenseDeclared.

  • fossy: In Fossology, this is marked as the "main license" by activating the star symbol. Caution: If the main license is a custom text, Fossology takes the standard template text anyway. This has to be corrected manually in the post-processing stage (see section “Post-processing” below).

Report export and post-processing

In the SPDX standard, licenses are denoted by a short identifier (e.g. GPL-2.0-only or LicenseRef-MIT-customized). Licenses that are not listed in the SPDX License List are prefixed by "LicenseRef-", and in the section "License information" of the SPDX tag:value file, the full license text is given. Licenses with standard texts according to the SPDX License List do not carry the "LicenseRef-" prefix, and their license text is not given in the tag:value file. For the OSSelot project however, the SPDX tag:value file is intended to be self-consistent, i.e. for every short license identifier the corresponding full license text must be given.

  • fossy: In order to achieve this while ensuring the SPDX file can be valid, we have patched our Fossology installation to add the "LicenseRef-" prefix to all license identifiers. For details, see the article on Fossology.

Export reports

When all license information and copyright statements of the entire package are curated, the result is exported as SPDX tag:value and OSS Disclosure files.

  • fossy: The Fossology settings for report generation must be changed for every new package. Go to Conf → SPDX Report Settings, select "Show SPDX license comments" and submit the change.
  • fossy: Export SPDX tag:value report.
  • fossy: Export ReadMe_OSS (OSS disclosure report).

Post-processing

Some post-processing operations on the SPDX tag:value and the OSS disclosure reports are required. At least some of these operations can be easily scripted.

  • Rename files to fit naming convention
    • SPDX tag:value report: [package name]-[version number]-SPDX2TV.spdx, e.g. angular-15.1.0-SPDX2TV.spdx.
    • OSS disclosure file: [package name]-[version number]-OSS-disclosure.txt, e.g. angular-15.1.0-OSS-disclosure.txt.

Both reports

  • Set line break to 80 characters.
  • Remove empty lines.
  • Only required for FOSSology versions 4.2 or lower:
    • For "or later" license references, replace "+" with "-or-later", e.g. GPL-2.0+ → GPL-2.0-or-later.
    • For GNU licenses without "or later" extension, add "-only", e.g. GPL-2.0 → GPL-2.0-only.

OSS disclosure report

  • Remove headings "Main license" and "Other licenses", and replace by heading "Licenses".

SPDX tag:value report

To see how the SPDX tag:value file is generally used in OSSelot have a look at the SPDX2TV template.

The following tags must be edited:

  • Creator: Person: [name of creator]
  • CreatorComment: <text>This document was created using license information and a generator from Fossology. It contains the license and copyright analysis of [package]. Please check "LicenseComments" for explanations of concluded licenses.</text>
  • PackageLicenseConcluded: NOASSERTION
  • (Not required for FOSSology versions 4.3 or higher) If main license is not a template license text, add correct customized license reference to PackageLicenseDeclared.
  • Dual licensing conclusions: Remove "LicenseRef-Dual-license" and correct AND operator to OR (e.g. LicenseA AND LicenseB AND LicenseRef-Dual-license → LicenseA OR LicenseB). If there is dual licensing and multiple licenses, be aware of the SPDX operator hierarchy (default order of precedence: WITH, AND, OR). For only two licenses, this is not required for FOSSology versions 4.3 or higher, but for three or more licenses, manual editing is still necessary.
  • As the SPDX standard does not contain template license texts but the OSSelot variant does, we need to add the prefix "LicenseRef-" to all license IDs that do not yet carry it to obtain a valid SPDX document. See patch in FOSSology#Customization.

The SPDX tag:value file must be validated either with the SPDX online tools or with the CLI tools. When the SPDX tag:value file is valid, convert to spdx.json, spdx.rdf.xml, spdx.yaml formats.

Contribution

The contribution of a newly curated package must contain the following artifacts:

  • README with download URL, purl, creator name
  • OSS disclosure file
  • SPDX tag:value file
  • SPDX json file
  • SPDX rdf.xml file
  • SPDX yaml file

To contribute, the repository https://github.com/Open-Source-Compliance/package-analysis must be forked and a pull request must be created.

  • The Contribution must be licensed under CC0-1.0.
  • The pull request must contain a "Signed-off-by: [Name] <Email>" statement to indicate acceptance of the Certificate of Origin.
  • The contribution will be reviewed. If changes are required, we kindly ask the contributor to be persistent and resubmit the reworked contribution. When it is accepted, the artifacts will be published.

Contact

Please direct any questions or remarks to info@osselot.org. We will be happy to help.