Text-to-Speech Sample

Interactive interface powered by the Kokoro Text-To-Speech model running locally in Unity using Sentis.

Runtime Inference

To power this experience we leverage the Kokoro-82M-v1.0-ONNX model, a high-quality text-to-speech model.

The system processes text inputs through:

Text tokenization and grapheme-to-phoneme conversion using our C# implementation of Misaki for English
Neural voice synthesis using the Kokoro ONNX model
Real-time audio generation with multiple voice options
Configurable speech speed and voice selection

We use this to create a seamless text-to-speech experience with natural-sounding voices.

Features

Multiple Voices: Choose from various pre-trained voice styles
Speed Control: Adjustable speech rate for different use cases
Real-time Generation: Fast GPU-accelerated inference using Sentis
Editor Integration: Available as an Editor window for development and testing
Cross-Platform: Support for all Unity-supported platforms thanks to pure C# implementation
Model Management: Automated model downloading and setup
Phoneme Processing: Automatic grapheme-to-phoneme conversion using dictionary-based lookup with comprehensive English lexicon

Getting Started

Open the Unity project
Download models by navigating to Sentis > Sample > Text-To-Speech > Download Models in the menu
Navigate to Sentis > Sample > Text-To-Speech > Start Kokoro in the menu
Enter text and generate speech with your chosen voice!

Alternatively, you can use the runtime scene at TextToSpeechSample/Assets/Scenes/App.unity, but make sure to download the models beforehand using the editor menu.

The Text-To-Speech interface supports:

Multi-line text input
Voice selection from available models
Real-time speech generation
Audio playback controls

Grapheme-to-Phoneme Conversion

This sample features a complete C# implementation of dictionary-based grapheme-to-phoneme conversion, inspired by the Misaki project. The system includes:

Comprehensive English Lexicon: Uses gold and silver pronunciation dictionaries with over 130,000 word entries
Intelligent Text Processing: Handles contractions, possessives, punctuation, and Unicode characters
Morphological Analysis: Automatic handling of word variants with suffixes (-s, -ed, -ing)
Context-Aware Pronunciation: Adjusts pronunciation based on surrounding context (e.g., "the" before vowels vs consonants)
No External Dependencies: Self-contained implementation requiring no external tools or processes

The phoneme conversion system processes text through sophisticated tokenization, dictionary lookup, and fallback mechanisms to ensure accurate pronunciation for the Kokoro Text-To-Speech model.

Technical Implementation

The sample demonstrates:

Integration of Text-To-Speech models in Unity
Asynchronous model inference with Kokoro
AppUI for modern editor interfaces
State management using Redux patterns
Model scheduling and resource management
Cross-platform audio generation
Advanced text-to-phoneme processing with comprehensive English language support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text-to-Speech Sample

Runtime Inference

Features

Getting Started

Grapheme-to-Phoneme Conversion

Technical Implementation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Text-to-Speech Sample

Runtime Inference

Features

Getting Started

Grapheme-to-Phoneme Conversion

Technical Implementation