COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > CUED Speech Group Seminars > SoundStream: An End-to-End Neural Audio Codec
SoundStream: An End-to-End Neural Audio CodecAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Dr Jie Pu. This talk will be on zoom Abstract: Audio codecs (mp3, Opus), are compression algorithms used whenever one needs to transmit audio, whether when streaming a song or during a conference call. In this talk, I will present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU . In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech. Bio: Neil Zeghidour is a Senior Research Scientist at Google Brain in Paris, and teaches automatic speech processing at Ecole Normale Supérieure. He previously graduated with a PhD in Machine Learning from Ecole Normale Superieure in Paris, jointly with Facebook AI Research. His main research interest is to integrate signal processing and deep learning into fully learnable architectures for audio understanding and generation. This talk is part of the CUED Speech Group Seminars series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsCCIMI The political economy of AIDS in Africa Wall Street meets Lincoln's Inn!Other talksFractional-variable-order digital controller design and tuning for automatic voltage regulator system Lung epithelia under attack: Infections, Interferon, Inflammation Talk 5 - Mathematical Modelling and Accounting for Social Determinants Representing The Solution Operator For The Navier-Stokes Equation Optimisation Training for Industry (Physical) |