Arabic Text-to-Speech remains underdeveloped compared to English, largely because of scarce high-quality open datasets and the critical role of diacritics in determining meaning. This repository supports the paper “A Comprehensive Study of Neural Models for Arabic Text-to-Speech Synthesis”, providing full implementations, training pipelines, evaluation tools, and inference notebooks for all models studied.
The work focuses exclusively on publicly available datasets, using the Arabic Speech Corpus and ClArTTS. Four architectures are investigated:
- FastSpeech2 – Non-autoregressive, stable, and fast.
- FastPitch – Duration-predictive model with pitch-aware control.
- Mixer-TTS – Lightweight hybrid architecture delivering strong quality with limited data.
- Spark-TTS – End-to-end LLM-based TTS with zero-shot voice cloning and prosody control.
All spectrogram-based models use HiFi-GAN as the neural vocoder.
Our experiments reveal clear trade-offs: FastPitch and Mixer-TTS produce high-quality speech even under data constraints, while Spark-TTS offers noticeably superior audio fidelity and naturalness but requires heavier computation and slower inference. The repository aims to make these models reproducible, comparable, and accessible for anyone working on Arabic TTS research or development.
This work uses two publicly available Arabic TTS datasets:
- Arabic Speech Corpus (ASC) – 1 speaker, 4.1 hours of South Levantine (Damascian) speech, professionally recorded. Includes phoneme-level annotations generated with Halabi’s phonemizer.
- Classical Arabic Text-to-Speech (ClArTTS) – 1 male speaker, 12 hours of Modern Standard Arabic from a LibriVox audiobook, manually segmented and annotated.
Preprocessing highlights:
- Non-diacritized transcripts converted to phoneme sequences for accurate pronunciation.
- Audio resampled to 22.05 kHz for uniformity across models.
| Dataset | Language | Speakers | Duration (hrs) | Open-source |
|---|---|---|---|---|
| ASC | Arabic | 1 | 4.1 | ✓ |
| ClArTTS | Arabic | 1 | 12 | ✓ |
This repository includes four TTS architectures:
| Model | Description | Details | Paper |
|---|---|---|---|
| FastSpeech2 | Transformer-based, non-autoregressive spectrogram prediction with variance adaptor for duration, pitch, and energy. | FastSpeech2 | FastSpeech2 Paper |
| FastPitch | Duration-predictive model with explicit pitch conditioning for expressive speech. | FastPitch | FastPitch Paper |
| Mixer-TTS | Lightweight MLP-Mixer based spectrogram predictor with CTC alignment, fully parallel and data-efficient. | MixerTTS | Mixer-TTS Paper |
| Spark-TTS | End-to-end LLM-based TTS, generates waveform directly with semantic and global tokens. Supports zero-shot voice cloning. | SparkTTS | Spark-TTS Paper |
All spectrogram-based models use HiFi-GAN for vocoding.
You can test each model directly using the provided Colab notebooks:
| Model | Colab Notebook | Audio Examples |
|---|---|---|
| FastSpeech2 | Open in Colab | Audio Samples |
| FastPitch | Open in Colab | Audio Samples |
| Mixer-TTS | Open in Colab | Audio Samples |
| Spark-TTS | Open in Colab | Audio Samples |
Each notebook includes a full inference pipeline: load model, input text (with diacritics), generate mel-spectrogram or waveform, and play the resulting audio. The linked folder contains example outputs for all models to compare audio quality directly.
We evaluated all models both objectively and subjectively:
- Objective Metrics: Training and validation losses were monitored to measure convergence and reconstruction quality.
- Subjective Metrics: Mean Opinion Score (MOS) tests were conducted to assess perceptual naturalness and clarity.
A large-scale MOS study was conducted with 200 native Arabic speakers. Each model was evaluated using 20 randomly selected utterances, and the presentation order was randomized to avoid bias. Participants rated each sample on a 5-point scale (1 = Bad, 5 = Excellent).
| Model | MOS ↑ |
|---|---|
| Spark-TTS | 4.57 ± 0.42 |
| FastPitch | 3.57 ± 0.17 |
| Mixer-TTS | 3.51 ± 0.22 |
| FastSpeech2 | 3.11 ± 0.34 |
- FastPitch: High-quality speech on both datasets; particularly effective on ClArTTS with small batch sizes.
- FastSpeech2: Easy to train and fast, but lower naturalness due to limited data and lack of pretrained weights.
- Mixer-TTS: Stable training and reasonable quality, slightly lower MOS than FastPitch.
- Spark-TTS: Best overall performance, benefiting from pretrained weights, zero-shot voice cloning, and prosody control.
- HiFi-GAN Vocoder: Best performance when conditioned on ground-truth mel-spectrograms, particularly trained on ClArTTS.
These results highlight the trade-offs between model complexity, training requirements, and audio quality for open-source Arabic TTS systems.
@INPROCEEDINGS{11231929,
author={Ashraf, Abdelhalim and Fathi, Abdelrahman Ramadan and Ahmed, Ali Adel Sayed and Hassanin, Omar Tamer Mohammed Ameen and Hifny, Yasser and Yehia, Eid Osama Eid El Sayed},
booktitle={2025 7th Novel Intelligent and Leading Emerging Sciences Conference (NILES)},
title={A Comprehensive Study of Neural Models for Arabic Text-to-Speech Synthesis},
year={2025},
pages={153-157},
keywords={Deep learning;Costs;Vocoders;Semantics;Cloning;Text to speech;Mixers},
doi={10.1109/NILES68063.2025.11231929}
}