Skip to content

Arabic Text-to-Speech (TTS) system using FastSpeech2, FastPitch, MixerTTS for mel-spectrogram generation, HiFi-GAN as vocoder, and SparkTTS as a fully end-to-end model. Trained on CLArTTS and Arabic Speech Corpus datasets.

Notifications You must be signed in to change notification settings

ABDELHALIM9/A-Comprehensive-Study-of-Neural-Models-for-Arabic-Text-to-Speech-Synthesis

Repository files navigation

A-Comprehensive-Study-of-Neural-Models-for-Arabic-Text-to-Speech-Synthesis

Problem Overview

Arabic Text-to-Speech remains underdeveloped compared to English, largely because of scarce high-quality open datasets and the critical role of diacritics in determining meaning. This repository supports the paper “A Comprehensive Study of Neural Models for Arabic Text-to-Speech Synthesis”, providing full implementations, training pipelines, evaluation tools, and inference notebooks for all models studied.

The work focuses exclusively on publicly available datasets, using the Arabic Speech Corpus and ClArTTS. Four architectures are investigated:

  • FastSpeech2 – Non-autoregressive, stable, and fast.
  • FastPitch – Duration-predictive model with pitch-aware control.
  • Mixer-TTS – Lightweight hybrid architecture delivering strong quality with limited data.
  • Spark-TTS – End-to-end LLM-based TTS with zero-shot voice cloning and prosody control.

All spectrogram-based models use HiFi-GAN as the neural vocoder.

Our experiments reveal clear trade-offs: FastPitch and Mixer-TTS produce high-quality speech even under data constraints, while Spark-TTS offers noticeably superior audio fidelity and naturalness but requires heavier computation and slower inference. The repository aims to make these models reproducible, comparable, and accessible for anyone working on Arabic TTS research or development.


Datasets

This work uses two publicly available Arabic TTS datasets:

  • Arabic Speech Corpus (ASC) – 1 speaker, 4.1 hours of South Levantine (Damascian) speech, professionally recorded. Includes phoneme-level annotations generated with Halabi’s phonemizer.
  • Classical Arabic Text-to-Speech (ClArTTS) – 1 male speaker, 12 hours of Modern Standard Arabic from a LibriVox audiobook, manually segmented and annotated.

Preprocessing highlights:

  • Non-diacritized transcripts converted to phoneme sequences for accurate pronunciation.
  • Audio resampled to 22.05 kHz for uniformity across models.
Dataset Language Speakers Duration (hrs) Open-source
ASC Arabic 1 4.1
ClArTTS Arabic 1 12

Models

This repository includes four TTS architectures:

Model Description Details Paper
FastSpeech2 Transformer-based, non-autoregressive spectrogram prediction with variance adaptor for duration, pitch, and energy. FastSpeech2 FastSpeech2 Paper
FastPitch Duration-predictive model with explicit pitch conditioning for expressive speech. FastPitch FastPitch Paper
Mixer-TTS Lightweight MLP-Mixer based spectrogram predictor with CTC alignment, fully parallel and data-efficient. MixerTTS Mixer-TTS Paper
Spark-TTS End-to-end LLM-based TTS, generates waveform directly with semantic and global tokens. Supports zero-shot voice cloning. SparkTTS Spark-TTS Paper

All spectrogram-based models use HiFi-GAN for vocoding.


Test

You can test each model directly using the provided Colab notebooks:

Model Colab Notebook Audio Examples
FastSpeech2 Open in Colab Audio Samples
FastPitch Open in Colab Audio Samples
Mixer-TTS Open in Colab Audio Samples
Spark-TTS Open in Colab Audio Samples

Each notebook includes a full inference pipeline: load model, input text (with diacritics), generate mel-spectrogram or waveform, and play the resulting audio. The linked folder contains example outputs for all models to compare audio quality directly.


Results & Comparison

We evaluated all models both objectively and subjectively:

  • Objective Metrics: Training and validation losses were monitored to measure convergence and reconstruction quality.
  • Subjective Metrics: Mean Opinion Score (MOS) tests were conducted to assess perceptual naturalness and clarity.

Subjective Evaluation (MOS)

A large-scale MOS study was conducted with 200 native Arabic speakers. Each model was evaluated using 20 randomly selected utterances, and the presentation order was randomized to avoid bias. Participants rated each sample on a 5-point scale (1 = Bad, 5 = Excellent).

MOS Comparison Across Models (ClArTTS)

Model MOS ↑
Spark-TTS 4.57 ± 0.42
FastPitch 3.57 ± 0.17
Mixer-TTS 3.51 ± 0.22
FastSpeech2 3.11 ± 0.34

Summary of Findings

  • FastPitch: High-quality speech on both datasets; particularly effective on ClArTTS with small batch sizes.
  • FastSpeech2: Easy to train and fast, but lower naturalness due to limited data and lack of pretrained weights.
  • Mixer-TTS: Stable training and reasonable quality, slightly lower MOS than FastPitch.
  • Spark-TTS: Best overall performance, benefiting from pretrained weights, zero-shot voice cloning, and prosody control.
  • HiFi-GAN Vocoder: Best performance when conditioned on ground-truth mel-spectrograms, particularly trained on ClArTTS.

These results highlight the trade-offs between model complexity, training requirements, and audio quality for open-source Arabic TTS systems.


Citation

@INPROCEEDINGS{11231929,
  author={Ashraf, Abdelhalim and Fathi, Abdelrahman Ramadan and Ahmed, Ali Adel Sayed and Hassanin, Omar Tamer Mohammed Ameen and Hifny, Yasser and Yehia, Eid Osama Eid El Sayed},
  booktitle={2025 7th Novel Intelligent and Leading Emerging Sciences Conference (NILES)}, 
  title={A Comprehensive Study of Neural Models for Arabic Text-to-Speech Synthesis}, 
  year={2025},
  pages={153-157},
  keywords={Deep learning;Costs;Vocoders;Semantics;Cloning;Text to speech;Mixers},
  doi={10.1109/NILES68063.2025.11231929}
}

About

Arabic Text-to-Speech (TTS) system using FastSpeech2, FastPitch, MixerTTS for mel-spectrogram generation, HiFi-GAN as vocoder, and SparkTTS as a fully end-to-end model. Trained on CLArTTS and Arabic Speech Corpus datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •