📢:Good news! 21,800 hours of multi-label Cantonese speech data are also available at ⭐WenetSpeech-Yue⭐.
📢:Good news! 8000 hours of multi-label Wu dialect data are also available at ⭐WenetSpeech-Wu⭐.
WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
Yuhang Dai1,*, Ziyu Zhang1,*, Shuai Wang4,5, Longhao Li1, Zhao Guo1, Tianlun Zuo1, Shuiyuan Wang1, Hongfei Xue1, Chengyou Wang1, Qing Wang3, Xin Xu2, Hui Bu2, Jie Li3, Jian Kang3, Binbin Zhang5, Lei Xie1,╀
1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
2 Beijing AISHELL Technology Co., Ltd.
3 Institute of Artificial Intelligence (TeleAI), China Telecom
4 School of Intelligence Science and Technology, Nanjing University
5 WeNet Open Source Community
🤖 ASR Models | 👨💻 TTS Models
📑 Paper | 🎤 Demo Page | 💬 Contact Us
This is the official repository 👑 for the WenetSpeech-Chuan dataset and the source code for Chuan-Pipe speech data preprocessing pipeline.
- The WenetSpeech-Chuan dataset is available at WenetSpeech-Chuan.
- The WSChuan-eval benchmark is available at WSChuan-ASR-eval and WSChuan-TTS-eval.
- The ASR models are available at WSChuan-ASR.
- The TTS models are available at WSChuan-TTS.
Note: Bold indicates best performance, underlined indicates second-best performance, and light green background indicates models finetuned on a high-quality internal corpus (to show the system's potential as a foundation model).
| Model | Model Size | WSC-Eval-ASR - Easy | WSC-Eval-ASR - Hard | WSC-Eval-ASR - Total | Magicdata - Conversation | Magicdata - Daily-Use | Avg. |
|---|---|---|---|---|---|---|---|
| with LLM | |||||||
| Kimi-Audio | 7B | 16.65 | 28.66 | 17.66 | 24.67 | 5.77 | 18.68 |
| FireRedASR-LLM | 8.3B | 12.80 | 25.27 | 14.40 | 17.68 | 6.69 | 15.37 |
| Qwen2.5-omni | 3B | 16.94 | 26.01 | 18.20 | 20.40 | 6.32 | 17.69 |
| Qwen2.5-omni-WSC-Finetune⭐ | 3B | 14.36 | 24.14 | 15.61 | 18.45 | 6.15 | 15.74 |
| Qwen2.5-omni+internal data⭐ | 3B | 13.17 | 23.36 | 14.81 | 18.50 | 5.88 | 15.14 |
| Qwen2.5-omni-WSC-Finetune + internal data⭐ | 3B | 12.93 | 23.19 | 14.25 | 17.95 | 5.89 | 14.84 |
| without LLM | |||||||
| SenseVoice-small | 234M | 17.43 | 28.38 | 18.39 | 23.50 | 8.77 | 19.29 |
| Whisper | 244M | 52.06 | 63.99 | 53.59 | 55.88 | 52.03 | 55.51 |
| FireRedASR-AED | 1.1B | 13.29 | 23.64 | 14.62 | 17.84 | 6.69 | 15.14 |
| Paraformer | 220M | 14.34 | 24.61 | 15.66 | 19.81 | 8.16 | 16.52 |
| Paraformer-WSC-Finetune⭐ | 220M | 12.15 | 22.60 | 13.51 | 16.60 | 8.02 | 14.58 |
| Paraformer + internal data⭐ | 220M | 11.93 | 21.82 | 13.14 | 15.61 | 6.77 | 13.85 |
| Paraformer-WSC-Finetune + internal data⭐ | 220M | 11.59 | 21.59 | 12.87 | 14.59 | 6.28 | 13.38 |
Note: Bold indicates best performance, underlined indicates second-best performance, and light green background indicates models models trained on WenetSpeech-Chuan or additionally finetuned on an internal high-quality dataset.
| Model | CER(%)↓ | SIM(%)↑ | IMOS↑ | SMOS↑ | AMOS↑ |
|---|---|---|---|---|---|
| Step-Audio-TTS | 10.83 | 67.66 | 3.81 | 2.86 | 3.15 |
| CosyVoice 2.0 | 7.14 | 70.27 | 3.88 | 3.10 | 3.69 |
| Qwen-TTS | 4.13 | - | 3.95 | - | 3.90 |
| CosyVoice2-WSC | 4.28 | 72.78 | 4.13 | 3.94 | 4.05 |
| CosyVoice2-WSC-SFT | 4.08 | 78.84 | 4.10 | 4.16 | 4.20 |
| Model | CER(%)↓ | SIM(%)↑ | IMOS↑ | SMOS↑ | AMOS↑ |
|---|---|---|---|---|---|
| Step-Audio-TTS | 12.52 | 54.52 | 3.75 | 2.77 | 3.06 |
| CosyVoice 2.0 | 9.06 | 60.10 | 3.96 | 2.73 | 3.81 |
| Qwen-TTS | 7.35 | - | 4.02 | - | 3.88 |
| CosyVoice2-WSC | 8.78 | 62.59 | 3.85 | 2.78 | 3.92 |
| CosyVoice2-WSC-SFT | 7.22 | 67.96 | 4.01 | 3.03 | 3.98 |
- Contains 10,000 hours of large-scale Chuan-Yu dialect speech corpus with rich annotations, the largest open-source resource for Chuan-Yu dialect speech research.
- Stores metadata in a single JSON file, including audio path, duration, text confidence, speaker identity, SNR, DNSMOS, age, gender, and character-level timestamps. Additional metadata tags may be added in the future.
- Covers ten domains: Short videos, Entertainment, Live streams, Documentary, Audiobook, Drama, Interview, News and others.
To address the unique linguistic characteristics of Chuan-Yu dialect, we propose WSChuan-eval, a comprehensive benchmark encompassing both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks.
We introduce WSC-Eval-ASR, a test set developed for Automatic Speech Recognition (ASR) as a key task in speech understanding. It features multi-round manual annotations including text transcripts, emotion, age, and gender labels. The set is divided into Easy and Hard subsets by the source domain and acoustic environment.
| Set | Main domain | Hours (h) |
|---|---|---|
| Easy | audio book, reading | 8.55 |
| Hard | hort videos, entertainment, drama | 1.15 |
We introduce WSChuan-TTS-eval, which is a standardized test set constructed to address the lack of standardized benchmarks for Sichuanese dialects Text-to-Speech (TTS). It comprises two subsets:
- WSC-Eval-TTS-easy, which contains sentences with dialectal words covering various domains;
- WSC-Eval-TTS-hard, which includes long sentences and sentences of diverse styles (e.g., tongue twisters, folk sayings, emotional speech) generated by Large Language Models (LLMs).
For audio prompts, 10 speakers (5 male and 5 female) are selected from MagicData and internal recordings, with each speaker recording 200 sentences to ensure balance in gender, age, and accent variations.
Chuan-Pipe Overview:
Chuan-Pipeilne collects large-scale, in-the-wild speech recordings across diverse domains such as storytelling, drama, commentary, vlogs, food, entertainment, news, and education. These long recordings are segmented into short clips with VAD, yielding utterance-level data for transcription and quality evaluation.
The initial stage of the pipeline focuses on data acquisition, segmentation, and the enrichment of speech segments with multi-dimensional paralinguistic labels. Raw data acquisition begins with mining metadata from online video platforms to identify content potentially containing Sichuanese dialects. Following an initial manual verification to confirm the presence of the target dialect, the acquired audio streams undergo a multi-stage workflow:
VAD & Segmentation: Long audio streams are segmented into 5-25 second clips using Voice Activity Detection (VAD), removing non-speech portions like silence and noise.
Single-Speaker Selection & Clustering: We first employ the pyannote toolkit to isolate single-speaker segments. Subsequently, speaker embeddings are extracted with the CAM++ model and clustered to assign a consistent speaker ID to all utterances from a single individual.
Paralinguistic Annotation: Speaker gender is identified using a pre-trained classifier (98.7% accuracy). Age is estimated via the Vox-Profile benchmark and categorized into age stages(children, teenager, young, middle-aged, old). Emotion is labeled by a majority vote over predictions from Emotion2vec~\cite{ma2023emotion2vec} and SenseVoice, covering seven categories (happy, angry, sad, neutral, fearful, surprised, and disgusted).
To ensure the audio quality of processed data, we implement an automated quality assessment stage. To select data across different quality levels, we use timestamp-aligned speech as input and extract metrics such as duration and Signal-to-Noise Ratio (SNR). These features are then used to compute a Word-level Virtual Mean Opinion Score (WVMOS), which serves as a proxy for perceptual audio quality. Low-quality audio samples are then discarded.
We select three models with the best performance on Cantonese to perform multi-system labeling: SenseVoice, TeleASR, and FireRed-ASR. For each audio file, we obtain the corresponding multi-system transcriptions.
To enhance the accuracy of automatic speech recognition (ASR) transcriptions, building upon prior research (GER, MMGER), we propose a robust ASR transcription framework tailored for the Sichuanese dialects. Our approach, termed LLM Generative Error Correction based ROVER (LLM-GER), aims to merge outputs from multiple ASR systems into a single accurate and reliable transcription. First, three different ASR systems (FireRed-ASR, SenseVoice-Small, and TeleASR) produce initial candidate transcriptions. These are then merged by Qwen3, which leverages its strong dialectal understanding and our carefully designed prompt to perform error correction without altering the original semantics or token length. Finally, transcription confidence is calculated based on the four transcriptions, as detailed in Figure below.
The prompt is presented in the figure below:
![]() |
![]() |
![]() |
![]() |
![]() |
|---|
Please cite our paper if you find this work useful:
@misc{dai2025wenetspeechchuan,
title = {WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing},
author = {Yuhang Dai and Ziyu Zhang and Shuai Wang and Longhao Li and Zhao Guo and Tianlun Zuo and Shuiyuan Wang and Hongfei Xue and Chengyou Wang and Qing Wang and Xin Xu and Hui Bu and Jie Li and Jian Kang and Binbin Zhang and Lei Xie},
year = {2025},
eprint = {2509.18004},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2509.18004}
}
If you are interested in leaving a message to our research team, feel free to email yhdai@mail.nwpu.edu.cn or ziyu_zhang@mail.nwpu.edu.cn.
You’re also welcome to join our WeChat group for technical discussions, updates, and — as mentioned above — access to pre-processed audio data.











