COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > CUED Speech Group Seminars > Unsupervised Speech Disentanglement for Speech Style Transfer
Unsupervised Speech Disentanglement for Speech Style TransferAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Dr Jie Pu. This talk will be on zoom Abstract: Speech information can be roughly decomposed into four components: linguistic content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is useful in speech analysis and generation applications. Among them, non-parallel many-to-many voice conversion can convert between many speakers without training on parallel data, which is the most challenging speech style transfer paradigm. We did a series of three works to solve the challenges progressively. First, we proposed AutoVC, the first zero-shot non-parallel timbre conversion framework that solves the over-smoothness problem of the VAE -based methods and the unstable training problem of the GAN -based methods using a simple autoencoder with a carefully designed bottleneck. The second work proposed SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed bottlenecks. SpeechSplit is among the first algorithms to separately perform style transfer on timbre, pitch, and rhythm without text transcriptions. The third work proposed AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by self-expressive representation learning. AutoPST is among the first algorithms to effectively convert prosody style in an unsupervised manner. Bio: Kaizhi Qian is currently doing research in MIT -IBM AI Waston Lab. He received his Ph.D. in Electrical and Computer Engineering from UIUC under the supervision of Prof. Mark Hasegawa-Johnson. His work focuses specifically on applications of deep generative models for speech and time-series processing. He has recently been working on unsupervised speech disentanglement for low-resource language processing. This talk is part of the CUED Speech Group Seminars series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsRECOUP Seminars Cambridge University Polish Society Forum for Youth Participation and DemocracyOther talksIsotopologue ratios in exoplanet atmospheres Tea and Coffee Break The launch of the online NatHistFest Early Career Researchers - Elevator Pitches Wolbachia, African-River Blindness and Big Sur TBC |