University of Cambridge > > CUED Speech Group Seminars > ICASSP presentations

ICASSP presentations

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Rogier van Dalen.

The ICASSP conference will be on 4–9 May. These people from the speech group will present their ICASSP papers:


  • Pierre Lanchantin et al., Multiple-Average-Voice-based Speech Synthesis
  • Jingzhou Yang et al., Infinite Structured Support Vector Machines for Speech Recognition
  • Chao Zhang et al., Standalone Training of Context-Dependent Deep Neural Network Acoustic Models
  • Xie Chen et al., Impact Of Single-Microphone Dereverberation On DNN -Based Meeting Transcription Systems
  • Pirros Tsiakoulis et al., Dialogue Context Sensitive HMM -Based Speech Synthesis
  • Anton Ragni et al., Investigation Of Unsupervised Adaptation of DNN Acoustic Models With Filter Bank Input


Pierre Lanchantin, Mark Gales, Simon King, Junichi Yamagishi

Multiple-Average-Voice-based Speech Synthesis

This paper describes a novel approach for the speaker adaptation of statistical parametric speech synthesis systems based on the interpolation of a set of average voice models (AVM). Recent results have shown that the quality/naturalness of adapted voices depends on the distance from the average voice model used for speaker adaptation. This suggests the use of several AVMs trained on carefully chosen speaker clusters from which a more suitable AVM can be selected/interpolated during the adaptation. In the proposed approach a set of AVMs, a multiple-AVM, is trained on distinct clusters of speakers which are iteratively re-assigned during the estimation process initialised according to metadata. During adaptation, each AVM from the multiple-AVM is first adapted towards the target speaker. The adapted means from the AVMs are then interpolated to yield the final speaker adapted mean for synthesis. It is shown, performing speaker adaptation on a corpus of British speakers with various regional accents, that the quality/naturalness of synthetic speech of adapted voices is significantly higher than when considering a single factor-independent AVM selected according to the target speaker characteristics.

Jingzhou Yang, Rogier van Dalen, Shi-Xiong Zhang and Mark Gales

Infinite Structured Support Vector Machines for Speech Recognition

Discriminative models, like support vector machines (SVMs), have been successfully applied to speech recognition and improved performance. A Bayesian non-parametric version of the SVM , the infinite SVM , improves on the SVM by allowing more flexible decision boundaries. However, like SVMs, infinite SVMs model each class separately, which restricts them to classifying one word at a time. A generalisation of the SVM is the structured SVM , whose classes can be sequences of words that share parameters. This paper studies a combination of Bayesian non-parametrics and structured models. One specific instance called infinite structured SVM is discussed in detail, which brings the advantages of the infinite SVM to continuous speech recognition.

Chao Zhang, Phil Woodland

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models

Recently, context-dependent (CD) deep neural network (DNN) hidden Markov models (HMMs) have been widely used as acoustic models for speech recognition. However, the standard method to build such models requires target training labels from a system using HMMs with Gaussian mixture model output distributions (GMM-HMMs). In this paper, we introduce a method for training state-of-the-art CD-DNN-HMMs without relying on such a pre-existing system. We achieve this in two steps: build a context-independent (CI) DNN iteratively with word transcriptions, and then cluster the equivalent output distributions of the untied CD-DNN HMM states using the decision tree based state tying approach. Experiments have been performed on the Wall Street Journal corpus and the resulting system gave comparable word error rates (WER) to CD-DNNs built based on GMM -HMM alignments and state-clustering.

Takuya Yoshioka, Xie Chen, and Mark Gales

Impact Of Single-Microphone Dereverberation On Dnn-Based Meeting Transcription Systems

Over the past few decades, a range of front-end techniques have been proposed to improve the robustness of automatic speech recognition systems against environmental distortion. While these techniques are effective for small tasks consisting of carefully designed data sets, especially when used with a classical acoustic model, there has been limited evidence that they are useful for a state-of-theart system with large scale realistic data. This paper focuses on reverberation as a type of distortion and investigates the degree to which dereverberation processing can improve the performance of various forms of acoustic models based on deep neural networks (DNNs) in a challenging meeting transcription task using a single distant microphone. Experimental results show that dereverberation improves the recognition performance regardless of the acoustic model structure and the type of the feature vectors input into the neural networks, providing additional relative improvements of 4.7% and 4.1% to our best configured speaker-independent and speaker adaptive DNN -based systems, respectively.

Pirros Tsiakoulis, Catherine Breslin, Milica Gasic, Matthew Henderson, Dongho Kim, Martin Szummer, Blaise Thomson, Steve Young

Dialogue Context Sensitive HMM -Based Speech Synthesis

The focus of this work is speech synthesis tailored to the needs of spoken dialogue systems. More specifically, the framework of HMM -based speech synthesis is utilized to train an emphatic voice that also considers dialogue context for decision tree state clustering. To achieve this, we designed and recorded a speech corpus comprising system prompts from human-computer interaction, as well as additional prompts for slot-level emphasis. This corpus, combined with a general purpose text-to-speech one, was used to train voices using a) baseline context features, b) additional emphasis features, and c) additional dialogue context features. Both emphasis and dialogue context features are extracted from the dialogue act semantic representation. The voices were evaluated in pairs for dialogue appropriateness using a preference listening test. The results show that the emphatic voice is preferred to the baseline when emphasis markup is present, while the dialogue context-sensitive voice is preferred to the plain emphatic one when no emphasis markup is present and preferable to the baseline in both cases. This demonstrates that including dialogue context features for decision tree state clustering significantly improves the quality of the synthetic voice for dialogue.

Takuya Yoshioka, Anton Ragni, Mark J. F. Gales

Investigation Of Unsupervised Adaptation of DNN Acoustic Models With Filter Bank Input

Adaptation to speaker variations is an essential component of speech recognition systems. One common approach to adapting deep neural network (DNN) acoustic models is to perform global constrained maximum likelihood linear regression (CMLLR) at some point of the systems. Using CMLLR (or more generally, generative approaches) is advantageous especially in unsupervised adaptation scenarios with high baseline error rates. On the other hand, as the DNNs are less sensitive to the increase in the input dimensionality than GMMs, it is becoming more popular to use rich speech representations, such as log mel-filter bank channel outputs, instead of conventional low-dimensional feature vectors, such as MFC Cs and PLP coefficients. This work discusses and compares three different configurations of DNN acoustic models that allow CMLLR -based speaker adaptive training (SAT) to be performed in systems with filter bank inputs. Results of unsupervised adaptation experiments conducted on three different data sets are presented, demonstrating that, by choosing an appropriate configuration, SAT with CMLLR can improve the performance of a well-trained filter bank-based speaker independent DNN system by 10.6% relative in a challenging task with a baseline error rate above 40%. It is also shown that the filter bank features are advantageous than the conventional features even when they are used with SAT models. Some other insights are also presented, including the effects of block diagonal transforms and system combination.

This talk is part of the CUED Speech Group Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2023, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity