No Cover Image

E-Thesis 300 views 108 downloads

Objective assessment of speech intelligibility. / Wei Ming Liu

Swansea University Author: Wei Ming Liu

Abstract

This thesis addresses the topic of objective speech intelligibility assessment. Speech intelligibility is becoming an important issue due most possibly to the rapid growth in digital communication systems in recent decades; as well as the increasing demand for security-based applications where intel...

Full description

Published: 2008
Institution: Swansea University
Degree level: Doctoral
Degree name: Ph.D
URI: https://cronfa.swan.ac.uk/Record/cronfa42738
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract: This thesis addresses the topic of objective speech intelligibility assessment. Speech intelligibility is becoming an important issue due most possibly to the rapid growth in digital communication systems in recent decades; as well as the increasing demand for security-based applications where intelligibility, rather than the overall quality, is the priority. Afterall, the loss of intelligibility means that communication does not exist. This research sets out to investigate the potential of automatic speech recognition (ASR) in intelligibility assessment, the motivation being the obvious link between word recognition and intelligibility. As a pre-cursor, quality measures are first considered since intelligibility is an attribute encompassed in overall quality. Here, 9 prominent quality measures including the state-of-the-art Perceptual Evaluation of Speech Quality (PESQ) are assessed. A large range of degradations are considered including additive noise and those introduced by coding and enhancement schemes. Experimental results show that apart from Weighted Spectral Slope (WSS), generally the quality scores from all other quality measures considered here correlate poorly with intelligibility. Poor correlations are observed especially when dealing with speech-like noises and degradations introduced by enhancement processes. ASR is then considered where various word recognition statistics, namely word accuracy, percentage correct, deletion, substitution and insertion are assessed as potential intelligibility measure. One critical contribution is the observation that there are links between different ASR statistics and different forms of degradation. Such links enable suitable statistics to be chosen for intelligibility assessment in different applications. In overall word accuracy from an ASR system trained on clean signals has the highest correlation with intelligibility. However, as is the case with quality measures, none of the ASR scores correlate well in the context of enhancement schemes since such processes are known to improve machine-based scores without necessarily improving intelligibility. This demonstrates the limitation of ASR in intelligibility assessment. As an extension to word modelling in ASR, one major contribution of this work relates to the novel use of a data-driven (DD) classifier in this context. The classifier is trained on intelligibility information and its output scores relate directly to intelligibility rather than indirectly through quality or ASR scores as in earlier attempts. A critical obstacle with the development of such a DD classifier is establishing the large amount of ground truth necessary for training. This leads to the next significant contribution, namely the proposal of a convenient strategy to generate potentially unlimited amounts of synthetic ground truth based on a well-supported hypothesis that speech processings rarely improve intelligibility. Subsequent contributions include the search for good features that could enhance classification accuracy. Scores given by quality measures and ASR are indicative of intelligibility hence could serve as potential features for the data-driven intelligibility classifier. Both are in investigated in this research and results show ASR-based features to be superior. A final contribution is a novel feature set based on the concept of anchor models where each anchor represents a chosen degradation. Signal intelligibility is characterised by the similarity between the degradation under test and a cohort of degradation anchors. The anchoring feature set leads to an average classification accuracy of 88% with synthetic ground truth and 82% with human ground truth evaluation sets. The latter compares favourably with 69% achieved by WSS (the best quality measure) and 68% by word accuracy from a clean-trained ASR (the best ASR-based measure) which are assessed on identical test sets.
Keywords: Computer science.
College: Faculty of Science and Engineering