Finding Difficult Speakers in Automatic Speaker Recognition
Stoll, Lara Lynn
Technical Report Identifier: EECS-2011-152
December 16, 2011
Abstract: The task of automatic speaker recognition, wherein a system verifies or determines a speaker's identity using a sample of speech, has been studied for a few decades. In that time, a great deal of progress has been made in improving the accuracy of the system's decisions. In general, errors can be expected to have one or more causes, involving both intrinsic and extrinsic factors. Extrinsic factors correspond to external influences, including reverberation, noise, and channel or microphone effects. Intrinsic factors relate inherently to the speaker himself, and include sex, age, dialect, accent, emotion, speaking style, and other voice characteristics. I investigate the phenomenon that some speakers within a given population have a tendency to cause a large proportion of errors, and explore ways of finding such speakers.
First, I establish the dependence of system performance on speaker characteristics by exploring two data sets: one that is an older collection of telephone channel conversational speech, and one that is a more recent collection of conversational speech recorded on a variety of channels, including the telephone, as well as various types of microphones. In addition to considering a traditional speaker recognition system approach, for the second data set I utilize the outputs of a more contemporary approach that is better able to handle variations in channel. The results of such analysis repeatedly show variations in behavior across speakers, both for true speaker and impostor speaker cases. Variation occurs both at the level of speech utterances, wherein a given speaker's performance can depend on which of his speech utterances is used, as well as on the speaker level, wherein some speakers have overall tendencies to cause false rejection or false alarm errors. Lamb-ish speaker behavior (where the speaker tends to produce false alarms as the target) is correlated with wolf-ish behavior (where the speaker tends to produce false alarms as the impostor). On the more recent data set, 50% of the false rejection and false alarm errors are caused by only 15-25% of the speakers.
Next, I investigate an approach to predict speakers that will be difficult for a system. I use a variety of features to calculate feature statistics that are used to compute a measure of similarity between speaker pairs. By ranking these similarity measures for a set of impostor speaker pairs, I determine those speaker pairs that are easy- or difficult-to-distinguish. A variety of these simple distance measures could successfully select both easy- and difficult-to-distinguish speaker pairs, as evaluated by differences in detection cost and false alarm probability across a large number of systems. Of the performance measures tested, the best was the Euclidean distance between vectors of the mean first, second, and third formant frequencies. Even greater success was attained by the Kullback-Liebler (KL) divergence between pairs of speaker-specific GMMs. An examination of the smallest and biggest distances (as computed by the KL divergence) revealed individual speaker tendencies to consistently fall among the most (or least) difficult-to-distinguish speaker pairs.
I then develop an approach for finding those individual speakers who will be difficult for the system, using a set of feature statistics calculated over regions of speech. A support vector machine (SVM) classifier is trained to distinguish between difficult and easy speaker examples, in order to produce an overall measure of speaker difficulty as a target or impostor. The resulting precision and recall measures were over 0.8 for difficult impostor speaker detection, and over 0.7 for difficult target speaker detection.