Skip to content

Commit 6b45356

Browse files
authored
Turbo model plus other enhancements
Updated README to reflect new features and improvements in Chatterbox TTS Server, including Chatterbox-Turbo support, hot-swappable engines, and paralinguistic tags.
1 parent 69b1bfc commit 6b45356

File tree

1 file changed

+69
-3
lines changed

1 file changed

+69
-3
lines changed

README.md

Lines changed: 69 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Chatterbox TTS Server: OpenAI-Compatible API with Web UI, Large Text Handling & Built-in Voices
22

3-
**Self-host the powerful [Chatterbox TTS model](https://github.com/resemble-ai/chatterbox) with this enhanced FastAPI server! Features an intuitive Web UI, a flexible API endpoint, voice cloning, large text processing via intelligent chunking, audiobook generation, and consistent, reproducible voices using built-in ready-to-use voices and a generation seed feature.**
3+
**Self-host Resemble AI's [Chatterbox](https://github.com/resemble-ai/chatterbox) open-source TTS family (Original + Chatterbox‑Turbo) behind an OpenAI‑compatible API and a modern Web UI. Chatterbox‑Turbo is a streamlined 350M-parameter model with dramatically improved throughput and native paralinguistic tags like `[laugh]`, `[cough]`, and `[chuckle]` for more expressive voice agents and narration. Features voice cloning, large text processing via intelligent chunking, audiobook generation, and consistent, reproducible voices using built-in ready-to-use voices and a generation seed feature.**
44

55
> 🚀 **Try it now!** Test the full TTS server with voice cloning and audiobook generation in Google Colab - no installation required!
66
>
77
> [![Open Live Demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/devnen/Chatterbox-TTS-Server/blob/main/Chatterbox_TTS_Colab_Demo.ipynb)
88
9-
This server is based on the architecture and UI of our [Dia-TTS-Server](https://github.com/devnen/Dia-TTS-Server) project but uses the distinct `chatterbox-tts` engine. Runs accelerated on NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (MPS) GPUs, with a fallback to CPU.
9+
This server is based on the architecture and UI of our [Dia-TTS-Server](https://github.com/devnen/Dia-TTS-Server) project but uses the distinct `chatterbox-tts` engine. Runs accelerated on NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (MPS) GPUs, with a fallback to CPU. Make sure you also check our [Kitten-TTS-Server](https://github.com/devnen/Kitten-TTS-Server) project.
1010

1111
[![Project Link](https://img.shields.io/badge/GitHub-devnen/Chatterbox--TTS--Server-blue?style=for-the-badge&logo=github)](https://github.com/devnen/Chatterbox-TTS-Server)
1212
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](LICENSE)
@@ -28,6 +28,45 @@ This server is based on the architecture and UI of our [Dia-TTS-Server](https://
2828

2929
---
3030

31+
## 🆕 What's New
32+
33+
### ⚡ Chatterbox‑Turbo support (new)
34+
35+
- Added full support for **Chatterbox‑Turbo**, Resemble AI's latest efficiency-focused Chatterbox model.
36+
- Turbo is built on a **streamlined 350M‑parameter architecture**, designed to use less compute/VRAM while keeping high-fidelity output.
37+
- Turbo distills the speech-token-to-mel "audio diffusion decoder" from **10 steps → 1 step**, removing a major inference bottleneck.
38+
- Resemble positions Turbo for real-time/agent workflows and highlights significantly faster-than-real-time performance on GPU (performance varies by hardware/settings).
39+
40+
### 🔁 Hot‑swappable TTS engines (UI)
41+
42+
- Added a new **engine selector** dropdown at the top of the Web UI.
43+
- Instantly hot-swap between **Original Chatterbox** and **Chatterbox‑Turbo**; the backend auto-loads the selected engine.
44+
- All UI + API requests route through the active engine so you can A/B test quality vs latency without changing client code.
45+
46+
### 🎭 Paralinguistic tags (Turbo)
47+
48+
- Turbo adds **native paralinguistic tags** you can write directly into your text, e.g. `…calling you back [chuckle]…`.
49+
- Supported tags include `[laugh]`, `[cough]`, and `[chuckle]`, plus text-based prompting for reactions like sigh, gasp, and cough.
50+
- Added **new presets** in `ui/presets.yaml` demonstrating paralinguistic prompting for agent-style scripts and expressive reads.
51+
52+
### ✅ Original Chatterbox remains first‑class
53+
54+
- The original Chatterbox models remain available (including multilingual), with support for **23 languages**, a **0.5B LLaMA backbone**, **emotion exaggeration control**, and training on **0.5M hours** of cleaned data.
55+
- Chatterbox outputs are **watermarked** (PerTh) for responsible AI usage.
56+
57+
### 🖥️ New NVIDIA / CUDA support
58+
59+
- Updated to support **NVIDIA CUDA 12.8** and **RTX 5090 / Blackwell** generation GPUs.
60+
61+
### 🧰 Automated launcher + easy updates
62+
63+
- New **Automated Launcher** (Windows + Linux) that creates/activates a venv, installs the right dependencies, downloads model files, starts the server, and opens the Web UI.
64+
- Easy maintenance commands:
65+
- `--upgrade` to update code + dependencies.
66+
- `--reinstall` for a clean reinstall when environments get messy.
67+
68+
---
69+
3170
## 🗣️ Overview: Enhanced Chatterbox TTS Generation
3271

3372
The [Chatterbox TTS model by Resemble AI](https://github.com/resemble-ai/chatterbox) provides capabilities for generating high-quality speech. This project builds upon that foundation by providing a robust [FastAPI](https://fastapi.tiangolo.com/) server that makes Chatterbox significantly easier to use and integrate.
@@ -37,6 +76,9 @@ The [Chatterbox TTS model by Resemble AI](https://github.com/resemble-ai/chatter
3776
The server expects plain text input for synthesis and we solve the complexity of setting up and running the model by offering:
3877

3978
* A **modern Web UI** for easy experimentation, preset loading, reference audio management, and generation parameter tuning.
79+
* **Multi-engine support (Original + Turbo):** Choose the TTS engine directly in the Web UI, then generate via the same UI/API surface.
80+
* **Paralinguistic prompting (Turbo):** Native tags like `[laugh]`, `[cough]`, and `[chuckle]` for natural non-speech reactions inside the same generated voice.
81+
* **Original Chatterbox strengths:** High quality English output plus unique "emotion exaggeration control" and 0.5B LLaMA backbone.
4082
* **Multi-Platform Acceleration:** Full support for **NVIDIA (CUDA)**, **AMD (ROCm)**, and **Apple Silicon (MPS)** GPUs, with an automatic fallback to **CPU**, ensuring you can run on any hardware.
4183
* **Large Text Handling:** Intelligently splits long plain text inputs into manageable chunks based on sentence structure, processes them sequentially, and seamlessly concatenates the audio.
4284
* **📚 Audiobook Generation:** Perfect for creating complete audiobooks - simply paste an entire book's text and the server automatically processes it into a single, seamless audio file with consistent voice quality throughout.
@@ -56,6 +98,13 @@ This server application enhances the underlying `chatterbox-tts` engine with the
5698

5799
**🚀 Core Functionality:**
58100

101+
* **Multi-Engine Support:**
102+
* Choose between **Original Chatterbox** and **Chatterbox‑Turbo** via a hot-swappable engine selector in the Web UI.
103+
* Turbo offers significantly faster inference with a streamlined 350M-parameter architecture.
104+
* Original Chatterbox provides multilingual support (23 languages) and emotion exaggeration control.
105+
* **Paralinguistic Tags (Turbo):**
106+
* Write native tags like `[laugh]`, `[cough]`, and `[chuckle]` directly in your text when using Chatterbox‑Turbo.
107+
* New presets demonstrate paralinguistic prompting for agent-style scripts and expressive narration.
59108
* **Large Text Processing (Chunking):**
60109
* Automatically handles long plain text inputs by intelligently splitting them into smaller chunks based on sentence boundaries.
61110
* Processes each chunk individually and seamlessly concatenates the resulting audio, overcoming potential generation limits of the TTS engine.
@@ -96,12 +145,16 @@ This server application enhances the underlying `chatterbox-tts` engine with the
96145
* **Core Chatterbox Capabilities (via [Resemble AI Chatterbox](https://github.com/resemble-ai/chatterbox)):**
97146
* 🗣️ High-quality single-speaker voice synthesis from plain text.
98147
* 🎤 Perform voice cloning using reference audio prompts.
148+
***Chatterbox‑Turbo** for significantly faster inference with paralinguistic tag support.
149+
* 🌍 **Original Chatterbox** with high quality English output and emotion exaggeration control.
99150
* **Enhanced Server & API:**
100151
* ⚡ Built with the high-performance **[FastAPI](https://fastapi.tiangolo.com/)** framework.
101152
* ⚙️ **Custom API Endpoint** (`/tts`) as the primary method for programmatic generation, exposing all key parameters.
102153
* 📄 Interactive API documentation via Swagger UI (`/docs`).
103154
* 🩺 Health check endpoint (`/api/ui/initial-data` also serves as a comprehensive status check).
104155
* **Advanced Generation Features:**
156+
* 🔁 **Hot-Swappable Engines:** Switch between Original Chatterbox and Chatterbox‑Turbo directly in the Web UI.
157+
* 🎭 **Paralinguistic Tags (Turbo):** Native support for `[laugh]`, `[cough]`, `[chuckle]` and other expressive tags.
105158
* 📚 **Large Text Handling:** Intelligently splits long plain text inputs into chunks based on sentences, generates audio for each, and concatenates the results seamlessly. Configurable via `split_text` and `chunk_size`.
106159
* 📖 **Audiobook Creation:** Perfect for generating complete audiobooks from full-length texts with consistent voice quality and automatic chapter handling.
107160
* 🎤 **Predefined Voices:** Select from curated synthetic voices in the `./voices` directory.
@@ -110,6 +163,7 @@ This server application enhances the underlying `chatterbox-tts` engine with the
110163
* 🔇 **Audio Post-Processing:** Optional automatic steps to trim silence, fix internal pauses, and remove long unvoiced segments/artifacts (configurable via `config.yaml`).
111164
* **Intuitive Web User Interface:**
112165
* 🖱️ Modern, easy-to-use interface.
166+
* 🔁 **Engine Selector:** Hot-swap between Original Chatterbox and Chatterbox‑Turbo.
113167
* 💡 **Presets:** Load example text and settings dynamically from `ui/presets.yaml`.
114168
* 🎤 **Reference/Predefined Audio Upload:** Easily upload `.wav`/`.mp3` files.
115169
* 🗣️ **Voice Mode Selection:** Choose between Predefined Voices or Voice Cloning.
@@ -776,17 +830,29 @@ docker compose -f docker-compose-cu128.yml up -d --build
776830

777831
The most intuitive way to use the server:
778832

833+
* **Engine Selector:** Use the dropdown at the top to switch between **Original Chatterbox** and **Chatterbox‑Turbo**. The backend auto-loads the selected engine.
779834
* **Text Input:** Enter your plain text script. **For audiobooks:** Simply paste the entire book text - the chunking system will automatically handle long texts and create seamless audio output.
780835
* **Voice Mode:** Choose:
781836
* `Predefined Voices`: Select a curated voice from the `./voices` directory.
782837
* `Voice Cloning`: Select an uploaded reference file from `./reference_audio`.
783-
* **Presets:** Load examples from `ui/presets.yaml`.
838+
* **Presets:** Load examples from `ui/presets.yaml`. New presets demonstrate Turbo's paralinguistic tags.
784839
* **Reference/Predefined Audio Management:** Import new files and refresh lists.
785840
* **Generation Parameters:** Adjust Temperature, Exaggeration, CFG Weight, Speed Factor, Seed. Save defaults to `config.yaml`.
786841
* **Chunking Controls:** Toggle "Split text into chunks" and adjust "Chunk Size" for long texts.
787842
* **Server Configuration:** View/edit parts of `config.yaml` (requires server restart for some changes).
788843
* **Audio Player:** Play generated audio with waveform visualization.
789844
845+
### Using Paralinguistic Tags (Turbo)
846+
847+
When the engine selector is set to **Chatterbox‑Turbo**, you can include paralinguistic tags inline:
848+
849+
```
850+
Hi there [chuckle] — thanks for calling back.
851+
One moment… [cough] sorry about that. Let's get this fixed.
852+
```
853+
854+
Turbo supports native tags like `[laugh]`, `[cough]`, and `[chuckle]` for more realistic, expressive speech. These tags are ignored when using Original Chatterbox.
855+
790856
### API Endpoints (`/docs` for interactive details)
791857

792858
The primary endpoint for TTS generation is `/tts`, which offers detailed control over the synthesis process.

0 commit comments

Comments
 (0)