Skip to content

This ROS package provides a node that interfaces with the Coqui TTS library, allowing a ROS system to convert text into speech. It can use standard single-speaker models or advanced multi-speaker, zero-shot voice cloning models like XTTS.

License

Notifications You must be signed in to change notification settings

bob-ros2/bob_coquitts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ROS Package bob_coquitts

This ROS package provides a robust node that interfaces with the Coqui TTS library, allowing a ROS 2 system to convert text into speech. It intelligently processes incoming text streams, handles voice cloning with models like XTTS, and offers extensive configuration options.

Features

  • Intelligent Text Buffering: Waits for pauses in the incoming text stream before processing to ensure complete thoughts are synthesized.
  • Dual Splitting Modes: Choose between two modes for sentence splitting via the split_sentences parameter:
    1. Manual Mode (Default): Use custom delimiters (sentence_delimiters) to precisely control how text is split into sentences.
    2. Automatic Mode: Let Coqui's powerful internal splitter handle long, unstructured text blocks.
  • Advanced Text Normalization:
    • Automatically filters pictorial emojis and symbols by default using a Unicode-aware regex filter.
    • Removes user-defined characters (e.g., typographical quotes „“).
    • Strips leading and trailing characters (e.g., spaces, punctuation) from sentences before synthesis.
    • Normalizes numbers by removing thousands separators (e.g., 2.500 -> 2500) to ensure correct pronunciation.
  • Real-time Feedback: Publishes the exact text chunk being synthesized to a separate ROS topic (/text_speaking), allowing other nodes to synchronize with the speech output.
  • Wide Model Support & Voice Cloning: Supports a vast range of Coqui models, including zero-shot voice cloning with XTTS.
  • Flexible Output: Optionally plays audio directly or saves it to a WAV file with automatic unique filename generation.
  • Hardware Acceleration: Supports both GPU (cuda) and CPU inference.

Prerequisites

  • ROS 2 (Humble, Iron, or newer).
  • Python 3.8+
  • NVIDIA GPU with CUDA installed for GPU acceleration (optional but recommended for XTTS).
  • An audio output device.
  • System dependencies for sounddevice and libsndfile.
# For Debian/Ubuntu-based systems
sudo apt-get update
sudo apt-get install libportaudio2 libasound-dev libsndfile1

Installation

  1. Clone the Package: Clone this repository into your ROS 2 workspace's src directory.

  2. Install Python Dependencies: This node requires the regex library for full Unicode support (e.g., filtering emojis). It is recommended to use a Python virtual environment.

    cd ~/ros2_ws
    # If using a virtual environment, activate it first
    pip install -r src/bob_coquitts/requirements.txt

    The requirements.txt file should contain:

    TTS
    sounddevice
    numpy
    soundfile
    regex
    

Building

Source your ROS 2 installation and build the package using colcon.

cd ~/ros2_ws
source /opt/ros/humble/setup.bash
colcon build --packages-select bob_coquitts

Usage

After building, source the workspace's setup.bash file. For detailed troubleshooting, launch the node with --log-level DEBUG.

source ~/ros2_ws/install/setup.bash
ros2 run bob_coquitts tts

Example 1: XTTS Voice Cloning with Text Cleaning

This example uses the powerful XTTS v2 model. We override the sentence_strip_chars parameter to only remove colons, which can cause unnatural-sounding audio.

ros2 run bob_coquitts tts --ros-args \
-p model_name:='tts_models/multilingual/multi-dataset/xtts_v2' \
-p reference_wav:='/path/to/your/voice.wav' \
-p language:='en' \
-p device:='cuda' \
-p sentence_strip_chars:="':'"

# In another terminal, publish text with a colon
ros2 topic pub --once /text std_msgs/msg/String "data: 'Here is my statement:'"

# In a third terminal, listen to the cleaned text being spoken
ros2 topic echo /text_speaking
# Output will be: data: Here is my statement

Example 2: Using Coqui's Internal Splitter for Long Text

If you are feeding a large, unstructured block of text, it's best to let Coqui handle the splitting.

ros2 run bob_coquitts tts --ros-args -p split_sentences:=True

# Publish a long paragraph
ros2 topic pub --once /text std_msgs/msg/String "data: 'This is the first sentence. This is the second sentence which is much longer and might exceed the character limit if not handled properly. Coquis splitter will take care of it.'"

ROS Interface

Subscribed Topics

Topic Name Message Type Description
/text std_msgs/msg/String The text to be synthesized. The node buffers incoming text and processes it after a pause.

Published Topics

Topic Name Message Type Description
/text_speaking std_msgs/msg/String Publishes the cleaned, normalized sentence or chunk of text exactly as it is being sent to the TTS model.

Parameters

Parameter Name Type Default Value Description
General
model_name string tts_models/en/ljspeech/vits The Coqui TTS model to use. (env: COQUITTS_MODEL_NAME)
language string '' Language code for multi-lingual models (e.g., en, de). (env: COQUITTS_LANGUAGE)
device string cpu Compute device for inference (cuda or cpu). (env: COQUITTS_DEVICE)
reference_wav string '' Path to a reference WAV file for voice cloning. (env: COQUITTS_REFERENCE_WAV)
Audio Output
sample_rate integer 24000 Audio sample rate for playback. Must match the model's native rate. (env: COQUITTS_SAMPLE_RATE)
play_audio boolean True If true, plays the generated audio directly. (env: COQUITTS_PLAY_AUDIO)
output_wav_path string '' Path to save the output WAV file. (env: COQUITTS_OUTPUT_WAV_PATH)
Text Processing
split_sentences boolean False Mode switch for splitting. If True, Coqui handles splitting. If False (default), the node uses manual splitting below. (env: COQUITTS_SPLIT_SENTENCES)
sentence_delimiters string .!?\n Characters used for manual splitting (only when split_sentences is False). (env: COQUITTS_SENTENCE_DELIMITERS)
sentences_max integer 1 Max number of sentences to process at once in manual mode. (env: COQUITTS_SENTENCES_MAX)
min_char_length_for_synthesis integer 3 If a text chunk is shorter than this, a period is appended to stabilize TTS synthesis. Set to 0 to disable. (env: COQUITTS_MIN_CHAR_LENGTH)
number_thousands_separator string . Character to remove from between digits (e.g., . in 1.234). (env: COQUITTS_NUMBER_THOUSANDS_SEPARATOR)
sentence_strip_chars string .,:!? Characters to remove from the beginning and end of a processed text chunk. (env: COQUITTS_SENTENCE_STRIP_CHARS)
text_filter_chars string „”‘“’*—#<> Specific characters to remove from the entire text. (env: COQUITTS_TEXT_FILTER_CHARS)
text_filter_regex string [\p{Emoji_Presentation}\p{Extended_Pictographic}] Regex to remove patterns from the entire text. Requires regex pip package. Default filters emojis. (env: COQUITTS_TEXT_FILTER_REGEX)
XTTS Tuning
temperature double 0.2 Controls randomness. Lower is more deterministic. (env: COQUITTS_TEMPERATURE)
length_penalty double 1.0 Factor to penalize longer sequences. (env: COQUITTS_LENGTH_PENALTY)
repetition_penalty double 2.0 Penalty for repeating tokens. (env: COQUITTS_REPETITION_PENALTY)
top_k integer 40 Samples from the k most likely next tokens. (env: COQUITTS_TOP_K)
top_p double 0.9 Samples from tokens with a cumulative probability of p. (env: COQUITTS_TOP_P)

About

This ROS package provides a node that interfaces with the Coqui TTS library, allowing a ROS system to convert text into speech. It can use standard single-speaker models or advanced multi-speaker, zero-shot voice cloning models like XTTS.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published