University of Cambridge > Talks.cam > CUED Speech Group Seminars > Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition

Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Dr Kate Knill.

Seminar on zoom

Abstract: Recently, speech recognition and understanding studies have shifted their focus from single-speaker automatic speech recognition (ASR) in controlled scenarios to more challenging and realistic multispeaker conversation analysis based on ASR and speaker diarization. The CHiME speech separation and recognition challenge is one of the attempts to tackle these new paradigms. This talk first describes the introduction and challenge results of the latest CHiME-6 challenge, focusing on recognizing multispeaker conversations in a dinner party scenario. The second part of this talk is to tackle this problem based on an emergent technique based on an end-to-end neural architecture. We first introduce an end-to-end single-microphone multispeaker ASR technique based on a recurrent neural network and transformer to show the effectiveness of the proposed method. Second, we extend this approach to leverage the benefit of the multi-microphone input and realize simultaneous speech separation and recognition within a single neural network trained only with the ASR objective. Finally, we also introduce our recent attempts of speaker diarization based on end-to-end neural architecture, including basic concepts, on-line extensions, and handling unknown numbers of speakers.

Bio: Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar in Georgia institute of technology, Atlanta, GA in 2009, a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017, and an associate research professor at Johns Hopkins University, Baltimore, MD from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has been published more than 200 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He served as an Associate Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP).

This talk is made possible through the ISCA International Virtual Seminars.

This talk is part of the CUED Speech Group Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity