University of Cambridge > Talks.cam > CUED Speech Group Seminars > The applications of discrete speech tokens for robust and context-aware text-to-speech synthesis

The applications of discrete speech tokens for robust and context-aware text-to-speech synthesis

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Simon Webster McKnight.

In a conventional neural text-to-speech (TTS) pipeline, there are typically two stages: firstly, the prediction of a mel-spectrogram from text through an acoustic model, followed by the generation of waveform data from the mel-spectrogram with a vocoder. However, such systems often suffer from suboptimal quality and sensitivity to the quality of the training data. We propose for the first time to leverage discrete speech tokens from self-supervised models as the intermediate feature of TTS pipeline, leading to a significant improvement in the robustness. Building upon this novel pipeline, we extend its applications to context-aware TTS tasks, where speech coherence with the context is taken into account during the speech generation process.

This talk is part of the CUED Speech Group Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity