feat: comprehensive metadata extraction across all language bindings#144
Merged
feat: comprehensive metadata extraction across all language bindings#144
Conversation
Implement convert_with_metadata() function that extracts rich document metadata during HTML-to-Markdown conversion in a single pass. Extracted metadata includes document-level info (title, description, keywords, author, canonical URL), header hierarchy (h1-h6 with nesting), link extraction with categorization, image metadata with type classification, structured data extraction (JSON-LD), and language attributes. Implementation uses new metadata module with complete type system, single-pass collection via MetadataCollector during tree traversal, feature-gated behind metadata Cargo feature for zero overhead when disabled, and follows the InlineImageCollector pattern for consistency. Testing includes 20 metadata unit tests, 4 integration tests, 14 doctests, and all 430+ existing tests passing with zero clippy warnings. Performance optimizations include pre-allocated collections, single tree traversal with no double-pass, and less than 5% overhead when enabled.
Implement complete metadata extraction API across Python, TypeScript, Ruby, PHP, and WASM bindings with comprehensive type safety and test coverage. ## Python (PyO3) - Added convert_with_metadata() returning tuple[str, ExtendedMetadata] - Implemented 11 type conversion functions (document, headers, links, images, structured data) - Created MetadataConfig class with full property accessors - Added complete TypedDict type stubs in _rust.pyi (6 metadata types) - Implemented 51 comprehensive integration tests across 9 test classes - All tests passing with mypy --strict validation ## TypeScript (NAPI-RS) - Added 8 NAPI object structs for complete type coverage - Implemented 10 Rust-to-JS conversion helpers with proper enum mapping - Added convertWithMetadata() and convertWithMetadataBuffer() functions - Enabled metadata feature in default Cargo features for npm packages - Added hasMetadataSupport() runtime feature detection - Created 14 comprehensive vitest tests covering all use cases - Added METADATA.md (600+ lines) and QUICK_START.md documentation ## Ruby (Magnus) - Implemented convert_with_metadata() returning [markdown, metadata_hash] - Added 15+ type conversion functions for Ruby Hash construction - Created comprehensive RBS type signatures with 9 type aliases - Added Ruby wrapper method with proper aliasing - Enabled metadata feature in default Cargo features - Fixed RBS type annotations (removed redundant ? symbols) - Implemented 40+ RSpec tests with complete coverage - Added METADATA.md with API documentation ## PHP (ext-php-rs) - Implemented convert_html_with_metadata() with ZendHashTable conversion - Created 5 readonly Value Objects (DocumentMetadata, HeaderMetadata, LinkMetadata, ImageMetadata, StructuredData) - Added comprehensive payload validation with InvalidOption exceptions - Implemented Bridge and Converter service layer integration - Added global helper function and HtmlToMarkdown facade method - Created 21 PHPUnit tests with PHPStan level max compliance - All tests passing with proper type safety ## WASM (wasm-bindgen) - Added WasmMetadataConfig struct with getters/setters - Implemented convertWithMetadata() and convertBytesWithMetadata() - Used serde_wasm_bindgen for efficient JSON serialization - Fixed 3 clippy style violations (Default impl, unwrap_or_default, struct init) - Added comprehensive WASM-specific tests - Updated README.md with complete API documentation - Zero clippy warnings with -D warnings flag ## Cross-Binding Features - Feature-gated implementation: zero overhead when disabled - Single-pass metadata collection during tree traversal - Pre-allocated collections for optimal performance - Type-safe conversions across all bindings - Comprehensive error handling with panic guards - Full documentation with examples in each language - Test coverage: 51 (Python) + 14 (TypeScript) + 40+ (Ruby) + 21 (PHP) ## Breaking Changes None. All new APIs are additive only. ## Testing - All language bindings compile cleanly - Zero linting warnings across all languages - Python: 51 tests passing, mypy --strict validated - TypeScript: 14 tests ready, NAPI types auto-generated - Ruby: 40+ tests ready, RBS types validated - PHP: 21 tests passing, PHPStan level max compliant - WASM: Tests passing, zero clippy warnings
Update all package manifests, lock files, and CHANGELOG.md for 2.13.0 release with comprehensive metadata extraction feature across all language bindings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds comprehensive metadata extraction API to the html-to-markdown library across all language bindings (Python, TypeScript, Ruby, PHP, WASM) with full type safety and test coverage.
Key Features
New
convert_with_metadata()FunctionmetadataCargo featureMetadata Extracted
Language Binding Status
Implementation Details
Python
tuple[str, ExtendedMetadata]TypeScript
hasMetadataSupport()Ruby
PHP
WASM
Breaking Changes
None. All new APIs are additive only. Existing
convert()andconvert_with_inline_images()functions remain unchanged.Testing
Documentation
_rust.pyiVersion
All packages bumped to 2.13.0 with synchronized version management across 39 manifest files.
Related
Closes #[issue-number] (if applicable)