Neyshekar is an open, community-driven Persian speech dataset collected via a web-based crowdsourcing platform at https://ney.shekar.io. It is designed to support research and development in text-to-speech (TTS), automatic speech recognition (ASR), speech representation learning, and other downstream Persian speech applications.
The recordings are provided by a combination of volunteer contributors and paid voice actors, all of whom are native Persian speakers. Each release represents a stable snapshot of the dataset, enabling reproducible research and consistent benchmarking.
Neyshekar is released incrementally. Each release represents a stable snapshot of the dataset at the time of publication.
v2 — 2026-01-15 (download)
- Total samples: 20,020
- Total duration (hours): 29.08
- Average clip duration (seconds): 5.23
- Total tokens: 208,472
- Vocab size: 20,853
- Total samples: 10,044
- Total duration: 14.42 hours
- Average clip duration: 5.17 seconds
- Total tokens: 103,757
- Vocabulary size: 15,224
This dataset is released under the CC0 1.0 Universal license.
It may be used, modified, and redistributed for any purpose without restriction.