Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
修改内容
本次PR增强了中文文本中数字的规范化处理,主要包括:
1. 连续中文数字处理
11、12、1322、2315、16(用、分隔)2. 新增范围识别规则
21-22(两个连续完整数字的范围)3. 数字+英文混合处理
4a级景区4. 单位词处理优化
6万(数字与单位正确分离)5. 日期关键词扩展
测试用例示例
"十一二三月份"→ 输出:"11、12、13月份""四a级景区和六万游客"→ 输出:"4a级景区和6万游客""二零二三财年"→ 输出:"2023财年"(需配合日期识别)"从二十一二十二章"→ 输出:"从21-22章"解决的问题
影响范围