- We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
- MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
- MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
- MegaHan97K contains three distinct subsets: handwritten, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
- MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.
The original data of the dataset is sourced from public channels such as the Internet, and its copyright shall remain with the original providers. The collated and annotated dataset presented in this case is for non-commercial use only and is currently licensed to universities and research institutions. To apply for the use of this dataset, please fill in the corresponding application form in accordance with the requirements specified on the dataset’s official website. The applicant must be a full-time employee of a university or research institute and is required to sign the application form. For the convenience of review, it is recommended to affix an official seal (a seal of a secondary-level department is acceptable).
| Setting | Dataset | status |
|---|---|---|
| General CCR | Baiduyun:k4ch/OneDrive | Released |
| Zero-Shot CCR | Baiduyun:bxde/OneDrive | Released |
- Clone this repo:
git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git- Execute the following command to obtain example samples from the MegaHan97K dataset.
python MegaHan_Dataloader.pyNote:
- The MegaHan97K dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MegaHan97K dataset, please first fill in this Application Form and sign the Legal Commitment and email them to us (eelwjin@scut.edu.cn, cc: lianwen.jin@gmail.com). When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of handwriting analysis and recognition, document image processing, and so on.
- We will give you the decompression password after your application has been received and approved.
- All users must follow all use conditions; otherwise, the authorization will be revoked.
- To access the entire dataset, please first download it, update the
data_rootin the pythonMegaHan_Dataloader.pyscript and then execute
python MegaHan_Dataloader.pyIf you have any questions, feel free to contact Yuyi Zhang at yuyi.zhang11@foxmail.com
-
Illustration of the handwritten-augmented data in MegaHan97K

-
Illustration of the handwritten-augmented data in MegaHan97K

MegaHan97K should be used and distributed under Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.
- This repository can only be used for non-commercial research purposes.
- For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
- Copyright 2025, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.






