University of Cambridge > > CUED Speech Group Seminars > SoundStream: An End-to-End Neural Audio Codec

SoundStream: An End-to-End Neural Audio Codec

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Dr Jie Pu.

This talk will be on zoom

Abstract: Audio codecs (mp3, Opus), are compression algorithms used whenever one needs to transmit audio, whether when streaming a song or during a conference call. In this talk, I will present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU . In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.

Bio: Neil Zeghidour is a Senior Research Scientist at Google Brain in Paris, and teaches automatic speech processing at Ecole Normale Supérieure. He previously graduated with a PhD in Machine Learning from Ecole Normale Superieure in Paris, jointly with Facebook AI Research. His main research interest is to integrate signal processing and deep learning into fully learnable architectures for audio understanding and generation.

This talk is part of the CUED Speech Group Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity