Skip to content

Conversation

@sharksurfauto-byte
Copy link

Summary

Implements parser for CITATION.cff files used for software citation metadata. Supports CFF spec versions 1.0.0+ with best-effort extraction.

Changes

  • New parser module: src/packagedcode/citation.py (231 lines)
  • Comprehensive test suite: tests/packagedcode/test_citation.py (8 tests)
  • Test fixtures covering minimal, full-featured, edge cases, and backward compatibility

Key Features

  • PyYAML-based parsing with silent error handling (no crashes, no noise)
  • Comprehensive author format support: family+given names, name-only, ORCID storage
  • Best-effort extraction: Only cff-version is required, all other fields optional
  • Package type: generic (citation metadata ≠ ecosystem-specific packages)
  • Backward compatible with older CFF versions (tested with 1.0.0)

Design Decisions

Why package type generic?

CFF describes citation metadata, not distributable artifacts. A CFF file can describe software, datasets, papers, or other citable objects. The type: software field in CFF ≠ package ecosystem (npm, pypi, etc.). Using generic avoids misleading over-inference.

Why extracted_license_statement not declared_license_expression?

CFF license field may be free-form text or non-SPDX identifiers. Using extracted_license_statement allows ScanCode's license detection to normalize later, following best-effort extraction principles.

Why only cff-version is required?

Per CFF specification, cff-version is the only strictly required field. Other fields (message, authors, title) are recommended but context-dependent. This aligns with both the spec and ScanCode's philosophy.

Test Coverage

  • File recognition (uppercase/lowercase)
  • Minimal parsing (only required fields)
  • Full-featured parsing
  • Multiple author formats with ORCID
  • Error handling (invalid YAML, missing cff-version)
  • Backward compatibility (CFF v1.0.0)
    Fixes Add support for citation file format #3580

Moved free-unknown_88.RULE to inactive directory as it was causing
false positives when scanning casual documentation that mentions
'open source license'. This resolves issue aboutcode-org#4221 where ts-jest README
was incorrectly flagged with LicenseRef-scancode-free-unknown.

The rule matched only 3 words ('open source license') with low relevance
(50), making it too broad and triggering on non-licensing contexts.

Added test case to prevent regression:
- tests/licensedcode/data/datadriven/unknown/ts-jest-no-false-positive.md
- tests/licensedcode/data/datadriven/unknown/ts-jest-no-false-positive.md.yml

Fixes: aboutcode-org#4221
Signed-off-by: Aliasghar Jawadwala <sharksurfauto@gmail.com>
Implements parser for CITATION.cff files used for software citation
metadata. Supports CFF spec versions 1.0.0+ with best-effort extraction.

Key features:
- PyYAML-based parsing with silent error handling
- Comprehensive author format support (family/given, name-only, ORCID)
- Field mappings: license→extracted_license_statement, etc.
- Package type: generic (citation metadata, not ecosystem-specific)
- Backward compatible with older CFF versions

Test suite includes 8 tests covering: file recognition, minimal/full
parsing, multiple author formats, error handling, and backward compatibility.

Fixes aboutcode-org#3580

Signed-off-by: Aliasghar Jawadwala <sharksurfauto@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for citation file format

1 participant