Audio Segmentation for Meetings Speech Processing
Boakye, Kofi Agyeman
Technical Report Identifier: EECS-2008-170
December 18, 2008
Abstract: Perhaps more than any other domain, meetings represent a rich source of content for spoken language research and technology. Two common (and complementary) forms of meeting speech processing are automatic speech recognition (ASR) --- which seeks to determine what was said --- and speaker diarization --- which seeks to determine who spoke when. Because of the complexity of meetings, however, such forms of processing present a number of challenges. In the case of speech recognition, crosstalk speech is often the primary source of errors for audio from the personal microphones worn by participants in the various meetings. This crosstalk typically produces insertion errors in the recognizer, which mistakenly processes this non-local speech audio. With speaker diarization, overlapped speech generates a significant number of errors for most state-of-the-art systems, which are generally unequipped to deal with this phenomenon. These errors appear in the form of missed speech, where overlap segments are not identified, and increased speaker error from speaker models negatively affected by the overlapped speech data.
This thesis sought to address these issues by appropriately employing audio segmentation as a first step to both automatic speech recognition and speaker diarization in meetings. For ASR, the segmentation of nonspeech and local speech was the objective while for speaker diarization, nonspeech, single-speaker speech, and overlapped speech were the audio classes to be segmented. A major focus was the identification of features suited to segmenting these audio classes: For crosstalk, cross-channel features were explored, while for monaural overlapped speech, energy, harmonic, and spectral features were examined. Using feature subset selection, the best combination of auxiliary features to baseline MFCCs in the former scenario consisted of normalized maximum cross-channel correlation and log-energy di erence; for the latter scenario, RMS energy, harmonic energy ratio, and modulation spectrogram features were determined to be the most useful in the realistic multi-site farfield audio condition. For ASR, improvements to word error rate of 13.4% relative were made to the baseline on development data and 9.2% relative on validation data. For speaker diarization, results proved less consistent, with relative DER improvements of 23.25% on development, but no significant change on a randomly selected validation set. Closer inspection revealed performance variability on the meeting level, with some meetings improving substantially and others degrading. Further analysis over a large set of meetings confirmed this variability, but also showed many meetings benefitting significantly from the proposed technique.