University of Cambridge > Talks.cam > CUED Speech Group Seminars > Unsupervised Speech Disentanglement for Speech Style Transfer

Unsupervised Speech Disentanglement for Speech Style Transfer

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Dr Jie Pu.

This talk will be on zoom

Abstract: Speech information can be roughly decomposed into four components: linguistic content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is useful in speech analysis and generation applications. Among them, non-parallel many-to-many voice conversion can convert between many speakers without training on parallel data, which is the most challenging speech style transfer paradigm. We did a series of three works to solve the challenges progressively. First, we proposed AutoVC, the first zero-shot non-parallel timbre conversion framework that solves the over-smoothness problem of the VAE -based methods and the unstable training problem of the GAN -based methods using a simple autoencoder with a carefully designed bottleneck. The second work proposed SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed bottlenecks. SpeechSplit is among the first algorithms to separately perform style transfer on timbre, pitch, and rhythm without text transcriptions. The third work proposed AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by self-expressive representation learning. AutoPST is among the first algorithms to effectively convert prosody style in an unsupervised manner.

Bio: Kaizhi Qian is currently doing research in MIT -IBM AI Waston Lab. He received his Ph.D. in Electrical and Computer Engineering from UIUC under the supervision of Prof. Mark Hasegawa-Johnson. His work focuses specifically on applications of deep generative models for speech and time-series processing. He has recently been working on unsupervised speech disentanglement for low-resource language processing.

This talk is part of the CUED Speech Group Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity