ROS Package bob_coquitts
This ROS package provides a robust node that interfaces with the Coqui TTS library, allowing a ROS 2 system to convert text into speech. It intelligently processes incoming text streams, handles voice cloning with models like XTTS, and offers extensive configuration options.
- Intelligent Text Buffering: Waits for pauses in the incoming text stream before processing to ensure complete thoughts are synthesized.
- Dual Splitting Modes: Choose between two modes for sentence splitting via the
split_sentencesparameter:- Manual Mode (Default): Use custom delimiters (
sentence_delimiters) to precisely control how text is split into sentences. - Automatic Mode: Let Coqui's powerful internal splitter handle long, unstructured text blocks.
- Manual Mode (Default): Use custom delimiters (
- Advanced Text Normalization:
- Automatically filters pictorial emojis and symbols by default using a Unicode-aware regex filter.
- Removes user-defined characters (e.g., typographical quotes
„“). - Strips leading and trailing characters (e.g., spaces, punctuation) from sentences before synthesis.
- Normalizes numbers by removing thousands separators (e.g.,
2.500->2500) to ensure correct pronunciation.
- Real-time Feedback: Publishes the exact text chunk being synthesized to a separate ROS topic (
/text_speaking), allowing other nodes to synchronize with the speech output. - Wide Model Support & Voice Cloning: Supports a vast range of Coqui models, including zero-shot voice cloning with XTTS.
- Flexible Output: Optionally plays audio directly or saves it to a WAV file with automatic unique filename generation.
- Hardware Acceleration: Supports both GPU (
cuda) and CPU inference.
- ROS 2 (Humble, Iron, or newer).
- Python 3.8+
- NVIDIA GPU with CUDA installed for GPU acceleration (optional but recommended for XTTS).
- An audio output device.
- System dependencies for
sounddeviceandlibsndfile.
# For Debian/Ubuntu-based systems
sudo apt-get update
sudo apt-get install libportaudio2 libasound-dev libsndfile1-
Clone the Package: Clone this repository into your ROS 2 workspace's
srcdirectory. -
Install Python Dependencies: This node requires the
regexlibrary for full Unicode support (e.g., filtering emojis). It is recommended to use a Python virtual environment.cd ~/ros2_ws # If using a virtual environment, activate it first pip install -r src/bob_coquitts/requirements.txt
The
requirements.txtfile should contain:TTS sounddevice numpy soundfile regex
Source your ROS 2 installation and build the package using colcon.
cd ~/ros2_ws
source /opt/ros/humble/setup.bash
colcon build --packages-select bob_coquittsAfter building, source the workspace's setup.bash file. For detailed troubleshooting, launch the node with --log-level DEBUG.
source ~/ros2_ws/install/setup.bash
ros2 run bob_coquitts ttsThis example uses the powerful XTTS v2 model. We override the sentence_strip_chars parameter to only remove colons, which can cause unnatural-sounding audio.
ros2 run bob_coquitts tts --ros-args \
-p model_name:='tts_models/multilingual/multi-dataset/xtts_v2' \
-p reference_wav:='/path/to/your/voice.wav' \
-p language:='en' \
-p device:='cuda' \
-p sentence_strip_chars:="':'"
# In another terminal, publish text with a colon
ros2 topic pub --once /text std_msgs/msg/String "data: 'Here is my statement:'"
# In a third terminal, listen to the cleaned text being spoken
ros2 topic echo /text_speaking
# Output will be: data: Here is my statementIf you are feeding a large, unstructured block of text, it's best to let Coqui handle the splitting.
ros2 run bob_coquitts tts --ros-args -p split_sentences:=True
# Publish a long paragraph
ros2 topic pub --once /text std_msgs/msg/String "data: 'This is the first sentence. This is the second sentence which is much longer and might exceed the character limit if not handled properly. Coquis splitter will take care of it.'"| Topic Name | Message Type | Description |
|---|---|---|
/text |
std_msgs/msg/String |
The text to be synthesized. The node buffers incoming text and processes it after a pause. |
| Topic Name | Message Type | Description |
|---|---|---|
/text_speaking |
std_msgs/msg/String |
Publishes the cleaned, normalized sentence or chunk of text exactly as it is being sent to the TTS model. |
| Parameter Name | Type | Default Value | Description |
|---|---|---|---|
| General | |||
model_name |
string | tts_models/en/ljspeech/vits |
The Coqui TTS model to use. (env: COQUITTS_MODEL_NAME) |
language |
string | '' |
Language code for multi-lingual models (e.g., en, de). (env: COQUITTS_LANGUAGE) |
device |
string | cpu |
Compute device for inference (cuda or cpu). (env: COQUITTS_DEVICE) |
reference_wav |
string | '' |
Path to a reference WAV file for voice cloning. (env: COQUITTS_REFERENCE_WAV) |
| Audio Output | |||
sample_rate |
integer | 24000 |
Audio sample rate for playback. Must match the model's native rate. (env: COQUITTS_SAMPLE_RATE) |
play_audio |
boolean | True |
If true, plays the generated audio directly. (env: COQUITTS_PLAY_AUDIO) |
output_wav_path |
string | '' |
Path to save the output WAV file. (env: COQUITTS_OUTPUT_WAV_PATH) |
| Text Processing | |||
split_sentences |
boolean | False |
Mode switch for splitting. If True, Coqui handles splitting. If False (default), the node uses manual splitting below. (env: COQUITTS_SPLIT_SENTENCES) |
sentence_delimiters |
string | .!?\n |
Characters used for manual splitting (only when split_sentences is False). (env: COQUITTS_SENTENCE_DELIMITERS) |
sentences_max |
integer | 1 |
Max number of sentences to process at once in manual mode. (env: COQUITTS_SENTENCES_MAX) |
min_char_length_for_synthesis |
integer | 3 |
If a text chunk is shorter than this, a period is appended to stabilize TTS synthesis. Set to 0 to disable. (env: COQUITTS_MIN_CHAR_LENGTH) |
number_thousands_separator |
string | . |
Character to remove from between digits (e.g., . in 1.234). (env: COQUITTS_NUMBER_THOUSANDS_SEPARATOR) |
sentence_strip_chars |
string | .,:!? |
Characters to remove from the beginning and end of a processed text chunk. (env: COQUITTS_SENTENCE_STRIP_CHARS) |
text_filter_chars |
string | „”‘“’*—#<> |
Specific characters to remove from the entire text. (env: COQUITTS_TEXT_FILTER_CHARS) |
text_filter_regex |
string | [\p{Emoji_Presentation}\p{Extended_Pictographic}] |
Regex to remove patterns from the entire text. Requires regex pip package. Default filters emojis. (env: COQUITTS_TEXT_FILTER_REGEX) |
| XTTS Tuning | |||
temperature |
double | 0.2 |
Controls randomness. Lower is more deterministic. (env: COQUITTS_TEMPERATURE) |
length_penalty |
double | 1.0 |
Factor to penalize longer sequences. (env: COQUITTS_LENGTH_PENALTY) |
repetition_penalty |
double | 2.0 |
Penalty for repeating tokens. (env: COQUITTS_REPETITION_PENALTY) |
top_k |
integer | 40 |
Samples from the k most likely next tokens. (env: COQUITTS_TOP_K) |
top_p |
double | 0.9 |
Samples from tokens with a cumulative probability of p. (env: COQUITTS_TOP_P) |