Add support for parsing CITATION.cff metadata files #4728

sharksurfauto-byte · 2026-02-04T06:31:28Z

Summary

Implements parser for CITATION.cff files used for software citation metadata. Supports CFF spec versions 1.0.0+ with best-effort extraction.

Changes

New parser module: src/packagedcode/citation.py (231 lines)
Comprehensive test suite: tests/packagedcode/test_citation.py (8 tests)
Test fixtures covering minimal, full-featured, edge cases, and backward compatibility

Key Features

PyYAML-based parsing with silent error handling (no crashes, no noise)
Comprehensive author format support: family+given names, name-only, ORCID storage
Best-effort extraction: Only cff-version is required, all other fields optional
Package type: generic (citation metadata ≠ ecosystem-specific packages)
Backward compatible with older CFF versions (tested with 1.0.0)

Design Decisions

Why package type `generic`?

CFF describes citation metadata, not distributable artifacts. A CFF file can describe software, datasets, papers, or other citable objects. The type: software field in CFF ≠ package ecosystem (npm, pypi, etc.). Using generic avoids misleading over-inference.

Why `extracted_license_statement` not `declared_license_expression`?

CFF license field may be free-form text or non-SPDX identifiers. Using extracted_license_statement allows ScanCode's license detection to normalize later, following best-effort extraction principles.

Why only `cff-version` is required?

Per CFF specification, cff-version is the only strictly required field. Other fields (message, authors, title) are recommended but context-dependent. This aligns with both the spec and ScanCode's philosophy.

Test Coverage

File recognition (uppercase/lowercase)
Minimal parsing (only required fields)
Full-featured parsing
Multiple author formats with ORCID
Error handling (invalid YAML, missing cff-version)
Backward compatibility (CFF v1.0.0)
Fixes Add support for citation file format #3580

Moved free-unknown_88.RULE to inactive directory as it was causing false positives when scanning casual documentation that mentions 'open source license'. This resolves issue aboutcode-org#4221 where ts-jest README was incorrectly flagged with LicenseRef-scancode-free-unknown. The rule matched only 3 words ('open source license') with low relevance (50), making it too broad and triggering on non-licensing contexts. Added test case to prevent regression: - tests/licensedcode/data/datadriven/unknown/ts-jest-no-false-positive.md - tests/licensedcode/data/datadriven/unknown/ts-jest-no-false-positive.md.yml Fixes: aboutcode-org#4221 Signed-off-by: Aliasghar Jawadwala <sharksurfauto@gmail.com>

Implements parser for CITATION.cff files used for software citation metadata. Supports CFF spec versions 1.0.0+ with best-effort extraction. Key features: - PyYAML-based parsing with silent error handling - Comprehensive author format support (family/given, name-only, ORCID) - Field mappings: license→extracted_license_statement, etc. - Package type: generic (citation metadata, not ecosystem-specific) - Backward compatible with older CFF versions Test suite includes 8 tests covering: file recognition, minimal/full parsing, multiple author formats, error handling, and backward compatibility. Fixes aboutcode-org#3580 Signed-off-by: Aliasghar Jawadwala <sharksurfauto@gmail.com>

sharksurfauto-byte added 2 commits February 1, 2026 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for parsing CITATION.cff metadata files #4728

Add support for parsing CITATION.cff metadata files #4728

Uh oh!

sharksurfauto-byte commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add support for parsing CITATION.cff metadata files #4728

Are you sure you want to change the base?

Add support for parsing CITATION.cff metadata files #4728

Uh oh!

Conversation

sharksurfauto-byte commented Feb 4, 2026

Summary

Changes

Key Features

Design Decisions

Why package type generic?

Why extracted_license_statement not declared_license_expression?

Why only cff-version is required?

Test Coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why package type `generic`?

Why `extracted_license_statement` not `declared_license_expression`?

Why only `cff-version` is required?