A-Comprehensive-Study-of-Neural-Models-for-Arabic-Text-to-Speech-Synthesis

Problem Overview

Arabic Text-to-Speech remains underdeveloped compared to English, largely because of scarce high-quality open datasets and the critical role of diacritics in determining meaning. This repository supports the paper “A Comprehensive Study of Neural Models for Arabic Text-to-Speech Synthesis”, providing full implementations, training pipelines, evaluation tools, and inference notebooks for all models studied.

The work focuses exclusively on publicly available datasets, using the Arabic Speech Corpus and ClArTTS. Four architectures are investigated:

FastSpeech2 – Non-autoregressive, stable, and fast.
FastPitch – Duration-predictive model with pitch-aware control.
Mixer-TTS – Lightweight hybrid architecture delivering strong quality with limited data.
Spark-TTS – End-to-end LLM-based TTS with zero-shot voice cloning and prosody control.

All spectrogram-based models use HiFi-GAN as the neural vocoder.

Our experiments reveal clear trade-offs: FastPitch and Mixer-TTS produce high-quality speech even under data constraints, while Spark-TTS offers noticeably superior audio fidelity and naturalness but requires heavier computation and slower inference. The repository aims to make these models reproducible, comparable, and accessible for anyone working on Arabic TTS research or development.

Datasets

This work uses two publicly available Arabic TTS datasets:

Arabic Speech Corpus (ASC) – 1 speaker, 4.1 hours of South Levantine (Damascian) speech, professionally recorded. Includes phoneme-level annotations generated with Halabi’s phonemizer.
Classical Arabic Text-to-Speech (ClArTTS) – 1 male speaker, 12 hours of Modern Standard Arabic from a LibriVox audiobook, manually segmented and annotated.

Preprocessing highlights:

Non-diacritized transcripts converted to phoneme sequences for accurate pronunciation.
Audio resampled to 22.05 kHz for uniformity across models.

Dataset	Language	Speakers	Duration (hrs)	Open-source
ASC	Arabic	1	4.1	✓
ClArTTS	Arabic	1	12	✓

Models

This repository includes four TTS architectures:

Model	Description	Details	Paper
FastSpeech2	Transformer-based, non-autoregressive spectrogram prediction with variance adaptor for duration, pitch, and energy.	FastSpeech2	FastSpeech2 Paper
FastPitch	Duration-predictive model with explicit pitch conditioning for expressive speech.	FastPitch	FastPitch Paper
Mixer-TTS	Lightweight MLP-Mixer based spectrogram predictor with CTC alignment, fully parallel and data-efficient.	MixerTTS	Mixer-TTS Paper
Spark-TTS	End-to-end LLM-based TTS, generates waveform directly with semantic and global tokens. Supports zero-shot voice cloning.	SparkTTS	Spark-TTS Paper

All spectrogram-based models use HiFi-GAN for vocoding.

Test

You can test each model directly using the provided Colab notebooks:

Model	Colab Notebook	Audio Examples
FastSpeech2	Open in Colab	Audio Samples
FastPitch	Open in Colab	Audio Samples
Mixer-TTS	Open in Colab	Audio Samples
Spark-TTS	Open in Colab	Audio Samples

Each notebook includes a full inference pipeline: load model, input text (with diacritics), generate mel-spectrogram or waveform, and play the resulting audio. The linked folder contains example outputs for all models to compare audio quality directly.

Results & Comparison

We evaluated all models both objectively and subjectively:

Objective Metrics: Training and validation losses were monitored to measure convergence and reconstruction quality.
Subjective Metrics: Mean Opinion Score (MOS) tests were conducted to assess perceptual naturalness and clarity.

Subjective Evaluation (MOS)

A large-scale MOS study was conducted with 200 native Arabic speakers. Each model was evaluated using 20 randomly selected utterances, and the presentation order was randomized to avoid bias. Participants rated each sample on a 5-point scale (1 = Bad, 5 = Excellent).

MOS Comparison Across Models (ClArTTS)

Model	MOS ↑
Spark-TTS	4.57 ± 0.42
FastPitch	3.57 ± 0.17
Mixer-TTS	3.51 ± 0.22
FastSpeech2	3.11 ± 0.34

Summary of Findings

FastPitch: High-quality speech on both datasets; particularly effective on ClArTTS with small batch sizes.
FastSpeech2: Easy to train and fast, but lower naturalness due to limited data and lack of pretrained weights.
Mixer-TTS: Stable training and reasonable quality, slightly lower MOS than FastPitch.
Spark-TTS: Best overall performance, benefiting from pretrained weights, zero-shot voice cloning, and prosody control.
HiFi-GAN Vocoder: Best performance when conditioned on ground-truth mel-spectrograms, particularly trained on ClArTTS.

These results highlight the trade-offs between model complexity, training requirements, and audio quality for open-source Arabic TTS systems.

Citation

@INPROCEEDINGS{11231929,
  author={Ashraf, Abdelhalim and Fathi, Abdelrahman Ramadan and Ahmed, Ali Adel Sayed and Hassanin, Omar Tamer Mohammed Ameen and Hifny, Yasser and Yehia, Eid Osama Eid El Sayed},
  booktitle={2025 7th Novel Intelligent and Leading Emerging Sciences Conference (NILES)}, 
  title={A Comprehensive Study of Neural Models for Arabic Text-to-Speech Synthesis}, 
  year={2025},
  pages={153-157},
  keywords={Deep learning;Costs;Vocoders;Semantics;Cloning;Text to speech;Mixers},
  doi={10.1109/NILES68063.2025.11231929}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
FastSpeech2		FastSpeech2
Media		Media
Spark		Spark
FastPitch_MixerTTS_Demo.ipynb		FastPitch_MixerTTS_Demo.ipynb
Fastspeech2_Demo.ipynb		Fastspeech2_Demo.ipynb
README.md		README.md
Spark_Demo.ipynb		Spark_Demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A-Comprehensive-Study-of-Neural-Models-for-Arabic-Text-to-Speech-Synthesis

Problem Overview

Datasets

Models

Test

Results & Comparison

Subjective Evaluation (MOS)

MOS Comparison Across Models (ClArTTS)

Summary of Findings

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

ABDELHALIM9/A-Comprehensive-Study-of-Neural-Models-for-Arabic-Text-to-Speech-Synthesis

Folders and files

Latest commit

History

Repository files navigation

A-Comprehensive-Study-of-Neural-Models-for-Arabic-Text-to-Speech-Synthesis

Problem Overview

Datasets

Models

Test

Results & Comparison

Subjective Evaluation (MOS)

MOS Comparison Across Models (ClArTTS)

Summary of Findings

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages