[Feature request]  Token start/end time in the audio stream


**🚀 Feature Description**

When generating audio stream, possible to also return the word(or token) for current chunk?  or more precisely, is it possible to know the start time and end time in the stream for each token?



**Additional context**

Source code is [here](https://github.com/idiap/coqui-ai-TTS/blob/4c593c620854d9cd2e177382abf48082f7c9f2ae/TTS/tts/models/xtts.py#L654C1-L679C36)