|
35 | 35 | "source": [ |
36 | 36 | "## Basics: Generating Speech with Riva TTS APIs\n", |
37 | 37 | "\n", |
38 | | - "The Riva TTS service is based on a two-stage pipeline: Riva generates a mel spectrogram using the first model, then uses the mel spectrogram to generate speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.\n", |
| 38 | + "The Riva TTS service is based on a two-stage pipeline: Riva models like FastPitch and RadTTS++ first generates a mel-spectrogram, and then generates\n", |
| 39 | + "speech using the HifiGAN model while MagpieTTS Multilingual generates tokens and then generates speech using the Audio Codec model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.\n", |
39 | 40 | "\n", |
40 | 41 | "Riva provides two state-of-the-art voices (one male and one female) for English, that can easily be deployed with the Riva Quick Start scripts. Riva also supports easy customization of TTS in various ways, to meet your specific needs. \n", |
41 | 42 | "Subsequent Riva releases will include features such as model registration to support multiple languages/voices with the same API and support for resampling to alternative sampling rates. \n", |
|
114 | 115 | "source": [ |
115 | 116 | "### TTS modes\n", |
116 | 117 | "\n", |
117 | | - "Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. But when making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests. <br> \n", |
| 118 | + "Riva TTS supports both streaming and offline inference modes. In offline mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. But when making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests. <br> \n", |
118 | 119 | "\n", |
119 | 120 | "\n", |
120 | 121 | "\n", |
|
153 | 154 | "- ``language_code`` - Language of the generated audio. ``en-US`` represents English (US) and is currently the only language supported OOTB.\n", |
154 | 155 | "- ``encoding`` - Type of audio encoding to generate. ``LINEAR_PCM`` and ``OGGOPUS`` encodings are supported.\n", |
155 | 156 | "- ``sample_rate_hz`` - Sample rate of the generated audio. Depends on the microphone and is usually ``22khz`` or ``44khz``.\n", |
156 | | - "- ``voice_name`` - Voice used to synthesize the audio. Currently, Riva offers two OOTB voices (``English-US.Female-1``, ``English-US.Male-1``)." |
| 157 | + "- ``voice_name`` - Voice used to synthesize the audio. Currently, Riva offers two OOTB voices (``English-US.Female-1``, ``English-US.Male-1``).\n", |
| 158 | + "- ``custom_pronunciation`` - Dictionary of words and their custom pronunciations. For ease of use, the python API accepts a dictionary of words and their custom pronunciations. While the gRPC API accepts a string of comma seperated entries of words and their custom pronunciations with the format ``word1 pronunciation1,word2 pronunciation2``." |
157 | 159 | ] |
158 | 160 | }, |
159 | 161 | { |
|
227 | 229 | "Let's look at customization of Riva TTS with these SSML tags in some detail." |
228 | 230 | ] |
229 | 231 | }, |
| 232 | + { |
| 233 | + "cell_type": "markdown", |
| 234 | + "metadata": {}, |
| 235 | + "source": [ |
| 236 | + "\n", |
| 237 | + "##### Note\n", |
| 238 | + "Magpie TTS Multilingual supports only ``phoneme`` tag." |
| 239 | + ] |
| 240 | + }, |
230 | 241 | { |
231 | 242 | "attachments": {}, |
232 | 243 | "cell_type": "markdown", |
|
332 | 343 | "<audio controls src=\"https://raw.githubusercontent.com/nvidia-riva/tutorials/stable/audio_samples/tts_samples/ssml_sample_0.wav\" type=\"audio/ogg\"></audio>\n", |
333 | 344 | "\n", |
334 | 345 | "#### Note\n", |
335 | | - "If the audio controls are not seen throughout notebook. Open the notebook in github dev or view it in the [riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-python-basics-and-customization-with-ssml.html)\n" |
| 346 | + "If the audio controls are not seen throughout notebook. Open the notebook in github dev or view it in the [riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-basics-customize-ssml.html)\n" |
336 | 347 | ] |
337 | 348 | }, |
338 | 349 | { |
|
457 | 468 | "#### Arpabet\n", |
458 | 469 | "The full list of phonemes in the CMUdict can be found at [cmudict.phone](https://github.com/cmusphinx/cmudict/blob/master/cmudict.phones). The list of supported symbols with stress can be found at [cmudict.symbols](https://github.com/cmusphinx/cmudict/blob/master/cmudict.symbols). For a mapping of these phones to English sounds, refer to the [ARPABET Wikipedia page](https://en.wikipedia.org/wiki/ARPABET).\n", |
459 | 470 | "\n", |
| 471 | + "#### Custom pronunciations\n", |
| 472 | + "\n", |
| 473 | + "We also support passing custom pronunciations for words with the request which will override the default pronunciation for the word for the request. For ease of use, the python API accepts a dictionary of words and their custom pronunciations. While the gRPC API accepts a string of comma seperated entries of words and their custom pronunciations with the format ``word1 pronunciation1,word2 pronunciation2``.\n", |
| 474 | + "\n", |
460 | 475 | "Let's look at an example showing this custom pronunciation for Riva TTS:" |
461 | 476 | ] |
462 | 477 | }, |
|
481 | 496 | "ssml_text = '<speak>You say <phoneme alphabet=\"ipa\" ph=\"təˈmeɪˌtoʊ\">tomato</phoneme>, I say <phoneme alphabet=\"ipa\" ph=\"təˈmɑˌtoʊ\">tomato</phoneme>.</speak>'\n", |
482 | 497 | "# Older arpabet version\n", |
483 | 498 | "# ssml_text = '<speak>You say <phoneme alphabet=\"x-arpabet\" ph=\"{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}\">tomato</phoneme>, I say <phoneme alphabet=\"x-arpabet\" ph=\"{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}\">tomato</phoneme>.</speak>'\n", |
| 499 | + "custom_pronunciation = {\n", |
| 500 | + " \"tomato\": \"təˈmeɪˌtoʊ\"\n", |
| 501 | + "}\n", |
| 502 | + "print(\"Raw Text: \", raw_text)\n", |
| 503 | + "print(\"SSML Text: \", ssml_text)\n", |
| 504 | + "\n", |
| 505 | + "req[\"text\"] = ssml_text\n", |
| 506 | + "# Request to Riva TTS to synthesize audio\n", |
| 507 | + "resp = riva_tts.synthesize(**req)\n", |
| 508 | + "\n", |
| 509 | + "# Playing the generated audio from Riva TTS request\n", |
| 510 | + "audio_samples = np.frombuffer(resp.audio, dtype=np.int16)\n", |
| 511 | + "ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))\n", |
| 512 | + "\n", |
| 513 | + "# Passing custom pronunciation dictionary\n", |
| 514 | + "ssml_text = '<speak>You say tomato, I say <phoneme alphabet=\"ipa\" ph=\"təˈmɑˌtoʊ\">tomato</phoneme>.</speak>'\n", |
484 | 515 | "\n", |
485 | 516 | "print(\"Raw Text: \", raw_text)\n", |
486 | 517 | "print(\"SSML Text: \", ssml_text)\n", |
487 | 518 | "\n", |
488 | 519 | "req[\"text\"] = ssml_text\n", |
| 520 | + "req[\"custom_pronunciation\"] = custom_pronunciation\n", |
489 | 521 | "# Request to Riva TTS to synthesize audio\n", |
490 | 522 | "resp = riva_tts.synthesize(**req)\n", |
491 | 523 | "\n", |
|
500 | 532 | "source": [ |
501 | 533 | "#### Expected results if you run the tutorial:\n", |
502 | 534 | "`You say <phoneme alphabet=\"ipa\" ph=\"təˈmeɪˌtoʊ\">tomato</phoneme>, I say <phoneme alphabet=\"ipa\" ph=\"təˈmɑˌtoʊ\">tomato</phoneme>.` \n", |
| 535 | + "\n", |
| 536 | + "<audio controls src=\"https://raw.githubusercontent.com/nvidia-riva/tutorials/stable/audio_samples/tts_samples/ssml_sample_9.wav\" type=\"audio/wav\"></audio> \n", |
| 537 | + "\n", |
| 538 | + "`You say tomato, I say <phoneme alphabet=\"ipa\" ph=\"təˈmɑˌtoʊ\">tomato</phoneme>.`\n", |
| 539 | + "\n", |
503 | 540 | "<audio controls src=\"https://raw.githubusercontent.com/nvidia-riva/tutorials/stable/audio_samples/tts_samples/ssml_sample_9.wav\" type=\"audio/wav\"></audio> \n" |
504 | 541 | ] |
505 | 542 | }, |
|
0 commit comments