-
Notifications
You must be signed in to change notification settings - Fork 14
Description
below is a problem I solved with claude:
Bug Report: kRSUnicode missing from UNIHAN_MANIFEST in Unihan_IRGSources.txt
Summary
The kRSUnicode and kTotalStrokes fields are missing from the UNIHAN_MANIFEST mapping for Unihan_IRGSources.txt, preventing these fields from being extracted via the Python API even though they exist in the source data file.
Environment
- unihan-etl version: 0.39.0 (bug also present in 0.37.0)
- Python version: 3.13
- Installation method:
uv pip install unihan-etl
Bug Description
When requesting kRSUnicode or kTotalStrokes fields through the Python API, the library fails to load Unihan_IRGSources.txt because these fields are not listed in the internal UNIHAN_MANIFEST constant, despite being present in the actual Unicode data file.
Expected Behavior
When fields are requested that exist in Unihan_IRGSources.txt, the library should:
- Recognize that these fields require loading
Unihan_IRGSources.txt - Load the file and extract the field data
- Include the data in the exported output
Actual Behavior
The library:
- Does not recognize
kRSUnicodeandkTotalStrokesas belonging toUnihan_IRGSources.txt - Does not load
Unihan_IRGSources.txtwhen these fields are requested - Silently omits these fields from the output
Steps to Reproduce
import pathlib
from unihan_etl.core import Packager
from unihan_etl.options import Options
unihan_datapath = pathlib.Path(__file__).parent / 'unihan_data.json'
unihan_options = Options(
format='json',
destination=unihan_datapath,
fields=['kDefinition', 'kMandarin', 'kRSUnicode'],
expand=False,
)
unihan_pack = Packager(options=unihan_options)
unihan_pack.download()
unihan_pack.export()
# Check output - kRSUnicode will be missing from all entriesOutput: The generated JSON contains kDefinition and kMandarin, but kRSUnicode is completely absent.
Debug info from logs:
Loading data: .../Unihan_Readings.txt, .../Unihan_RadicalStrokeCounts.txt
Notice that Unihan_IRGSources.txt is NOT being loaded.
Root Cause
In src/unihan_etl/constants.py, the UNIHAN_MANIFEST mapping for Unihan_IRGSources.txt is missing kRSUnicode and kTotalStrokes:
'Unihan_IRGSources.txt': (
'kCompatibilityVariant', 'kIICore',
'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource',
'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource',
'kIRG_SSource', 'kIRG_TSource', 'kIRG_USource',
'kIRG_UKSource', 'kIRG_VSource'
),
# kRSUnicode and kTotalStrokes are missing!However, the actual Unihan_IRGSources.txt file from Unicode does contain these fields:
# Unihan_IRGSources.txt
# This file contains data on the following fields from the Unihan database:
# kCompatibilityVariant
# kIICore
# kIRG_GSource
# ...
# kRSUnicode <-- Present in file
# kTotalStrokes <-- Present in file
Verification
You can verify that the fields exist in the source file:
grep "kRSUnicode" ~/.cache/unihan_etl/downloads/Unihan_IRGSources.txt | head -5Output:
U+3400 kRSUnicode 1.4
U+3401 kRSUnicode 1.5
U+3402 kRSUnicode 1.5
U+3403 kRSUnicode 2.2
U+3404 kRSUnicode 2.2
Workaround
Users can monkey-patch the constant before creating a Packager:
from unihan_etl import constants
# Fix the manifest
constants.UNIHAN_MANIFEST['Unihan_IRGSources.txt'] = (
'kCompatibilityVariant', 'kIICore',
'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource',
'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource',
'kIRG_SSource', 'kIRG_TSource', 'kIRG_USource',
'kIRG_UKSource', 'kIRG_VSource',
'kRSUnicode', # Add this
'kTotalStrokes', # Add this
)After applying this fix, Unihan_IRGSources.txt is correctly loaded and the fields appear in the output.
Proposed Fix
Update src/unihan_etl/constants.py to include the missing fields in the manifest:
'Unihan_IRGSources.txt': (
'kCompatibilityVariant', 'kIICore',
'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource',
'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource',
'kIRG_SSource', 'kIRG_TSource', 'kIRG_USource',
'kIRG_UKSource', 'kIRG_VSource',
'kRSUnicode', # Add
'kTotalStrokes', # Add
),Additional Context
kRSUnicode(radical-stroke) is a commonly needed field for Chinese character analysis and lookup systems- According to UAX #38,
kRSUnicodehas been inUnihan_IRGSources.txtsince at least Unicode 15.1.0 - The manifest appears to be out of sync with the actual Unicode data file structure
Related Issues
- Issues
krsUnicode: Upcoming 3 apostrophe change #318 and kRSUnicode bug #315 dealt with kRSUnicode apostrophe handling - PR UNIHAN: 11.0.0 -> 15.1.0 (2019 -> 2023) #309 updated UNIHAN from 11.0.0 to 15.1.0 and removed deprecated fields (kRSJapanese, kRSKanWa, kRSKorean)
It's possible the manifest was not updated when these field locations changed in the Unicode specification.
Impact
This bug affects anyone using the Python API to extract kRSUnicode or kTotalStrokes fields, which are essential for:
- Character lookup systems based on radical-stroke indexing
- Chinese language learning applications
- Font and typography tools
- CJK character analysis
Users would receive incomplete data without any error or warning, potentially causing silent data corruption in their applications.