Skip to content

Bug Report: kRSUnicode missing from UNIHAN_MANIFEST in Unihan_IRGSources.txt #345

@Andrei-Pozolotin

Description

@Andrei-Pozolotin

below is a problem I solved with claude:


Bug Report: kRSUnicode missing from UNIHAN_MANIFEST in Unihan_IRGSources.txt

Summary

The kRSUnicode and kTotalStrokes fields are missing from the UNIHAN_MANIFEST mapping for Unihan_IRGSources.txt, preventing these fields from being extracted via the Python API even though they exist in the source data file.

Environment

  • unihan-etl version: 0.39.0 (bug also present in 0.37.0)
  • Python version: 3.13
  • Installation method: uv pip install unihan-etl

Bug Description

When requesting kRSUnicode or kTotalStrokes fields through the Python API, the library fails to load Unihan_IRGSources.txt because these fields are not listed in the internal UNIHAN_MANIFEST constant, despite being present in the actual Unicode data file.

Expected Behavior

When fields are requested that exist in Unihan_IRGSources.txt, the library should:

  1. Recognize that these fields require loading Unihan_IRGSources.txt
  2. Load the file and extract the field data
  3. Include the data in the exported output

Actual Behavior

The library:

  1. Does not recognize kRSUnicode and kTotalStrokes as belonging to Unihan_IRGSources.txt
  2. Does not load Unihan_IRGSources.txt when these fields are requested
  3. Silently omits these fields from the output

Steps to Reproduce

import pathlib
from unihan_etl.core import Packager
from unihan_etl.options import Options

unihan_datapath = pathlib.Path(__file__).parent / 'unihan_data.json'

unihan_options = Options(
    format='json',
    destination=unihan_datapath,
    fields=['kDefinition', 'kMandarin', 'kRSUnicode'],
    expand=False,
)

unihan_pack = Packager(options=unihan_options)
unihan_pack.download()
unihan_pack.export()

# Check output - kRSUnicode will be missing from all entries

Output: The generated JSON contains kDefinition and kMandarin, but kRSUnicode is completely absent.

Debug info from logs:

Loading data: .../Unihan_Readings.txt, .../Unihan_RadicalStrokeCounts.txt

Notice that Unihan_IRGSources.txt is NOT being loaded.

Root Cause

In src/unihan_etl/constants.py, the UNIHAN_MANIFEST mapping for Unihan_IRGSources.txt is missing kRSUnicode and kTotalStrokes:

'Unihan_IRGSources.txt': (
    'kCompatibilityVariant', 'kIICore', 
    'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 
    'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 
    'kIRG_SSource', 'kIRG_TSource', 'kIRG_USource', 
    'kIRG_UKSource', 'kIRG_VSource'
),
# kRSUnicode and kTotalStrokes are missing!

However, the actual Unihan_IRGSources.txt file from Unicode does contain these fields:

# Unihan_IRGSources.txt
# This file contains data on the following fields from the Unihan database:
#	kCompatibilityVariant
#	kIICore
#	kIRG_GSource
#	...
#	kRSUnicode    <-- Present in file
#	kTotalStrokes <-- Present in file

Verification

You can verify that the fields exist in the source file:

grep "kRSUnicode" ~/.cache/unihan_etl/downloads/Unihan_IRGSources.txt | head -5

Output:

U+3400	kRSUnicode	1.4
U+3401	kRSUnicode	1.5
U+3402	kRSUnicode	1.5
U+3403	kRSUnicode	2.2
U+3404	kRSUnicode	2.2

Workaround

Users can monkey-patch the constant before creating a Packager:

from unihan_etl import constants

# Fix the manifest
constants.UNIHAN_MANIFEST['Unihan_IRGSources.txt'] = (
    'kCompatibilityVariant', 'kIICore', 
    'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 
    'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 
    'kIRG_SSource', 'kIRG_TSource', 'kIRG_USource', 
    'kIRG_UKSource', 'kIRG_VSource',
    'kRSUnicode',    # Add this
    'kTotalStrokes', # Add this
)

After applying this fix, Unihan_IRGSources.txt is correctly loaded and the fields appear in the output.

Proposed Fix

Update src/unihan_etl/constants.py to include the missing fields in the manifest:

'Unihan_IRGSources.txt': (
    'kCompatibilityVariant', 'kIICore', 
    'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 
    'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 
    'kIRG_SSource', 'kIRG_TSource', 'kIRG_USource', 
    'kIRG_UKSource', 'kIRG_VSource',
    'kRSUnicode',    # Add
    'kTotalStrokes', # Add
),

Additional Context

  • kRSUnicode (radical-stroke) is a commonly needed field for Chinese character analysis and lookup systems
  • According to UAX #38, kRSUnicode has been in Unihan_IRGSources.txt since at least Unicode 15.1.0
  • The manifest appears to be out of sync with the actual Unicode data file structure

Related Issues

It's possible the manifest was not updated when these field locations changed in the Unicode specification.


Impact

This bug affects anyone using the Python API to extract kRSUnicode or kTotalStrokes fields, which are essential for:

  • Character lookup systems based on radical-stroke indexing
  • Chinese language learning applications
  • Font and typography tools
  • CJK character analysis

Users would receive incomplete data without any error or warning, potentially causing silent data corruption in their applications.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions