Spectral Features for Automatic TextIndependent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University of Joensuu Based on a True Story … T. Kinnunen: Spectral Features for Automatic Text-Independent Speaker Recognition, Ph.Lic. thesis, 144 pages, Department of Computer Science, University of Joensuu, 2004. Downloadable in PDF from : http://cs.joensuu.fi/pages/tkinnu/research/index.html Introduction Why Study Feature Extraction ? • As the first component in the recognition chain, the accuracy of classification is strongly determined by its selection Why Study Feature Extraction ? (cont.) • Typical feature extraction methods are directly “loaned” from the speech recognition task Quite contradictory, considering the “opposite” nature of the two tasks • In general, it seems that currently we are at the best guessing what might be invidual in our speech ! • Because it is interesting & challenging! Principle of Feature Extraction Studied Features 1. FFT-implemented filterbanks (subband processing) 2. FFT-cepstrum 3. LPC-derived features 4. Dynamic spectral features (delta features) Speech Material & Evaluation Protocol • Each test file is splitted into segments of T=350 vectors (about ~ 3.5 seconds of speech) • Each segment is classified by vector quantization • Speaker models are constructed from the training data by RLS clustering algorithm • Performance measure = classification error rate (%) 1. Subband Features Computation of Subband Features Windowed speech frame Magnitude spectrum by FFT Smoothing by a filterbank Nonlinear mapping of the filter outputs Compressed filter ouputs f = (f1,f2, … , fM)T Parameters of the filterbank: • Number of subbands • Filter shapes & bandwidths • Type of frequency warping • Filter output nonlinearity Frequency Warping… What’s That?! • “Real” frequency axis (Hz) is stretched and compressed locally according to a (bijective) warping function A 24-channel barkwarped filterbank Bark scale shape: triangular, warping: Bark 25 1 0.8 15 Gain Frequency [Bark] 20 0.6 10 0.4 5 0.2 0 0 0.5 1 1.5 2 2.5 Frequency [kHz] 3 3.5 4 0 0 500 1000 1500 2000 2500 Frequency [Hz] 3000 3500 4000 Discrimination of Individual Subbands (F-ratio) (Fixed parameters: 30 linearly spaced triangular filters) TIMIT F-ratio Helsinki Frequency Frequency Low-end (~0-200 Hz) and mid/high frequencies (~ 2 - 4 kHz) are important, region ~200-2000 Hz less important. (However, not consistently!) Subband Features : The Effect of the Filter Output Nonlinearity 1. Linear f(x) = x 2. Logarithmic: f(x) = log(1 + x) 3. Cubic: f(x) = x1/3 Helsinki Fixed parameters: 30 linearly spaced triangular filters TIMIT Consistent ordering (!) : cubic < log < linear Subband Features : The Effect of the Filter Shape 1. Rectangular 2. Triangular 3. Hanning Helsinki Fixed parameters: 30 linearly spaced filters, log-compression TIMIT The differences are small, no consistent ordering probably the filter shape is not as crucial as the other parameters Subband Features : The Number of Subbands (1) Experiment 1: From 5 to 50 Helsinki Fixed parameters: linearly spaced / triangular-shaped filters, log-compression TIMIT Observation: error rates decrease monotonically with increasing number of subbands (in most cases) … Subband Features : The Number of Subbands (2) Experiment 2: From 50 to 250 Fixed parameters: linearly spaced / triangular-shaped filters, log-compression Helsinki: (Almost) monotonous decrease in errors with increasing number of subbands TIMIT: Optimum number of bands is in the range 50..100 Differences between corpora are (partly) explained by the discrimination curves Discussion of the Subband Features • (Typically used) log-compression should be replaced with cubic compression or some better nonlinearity • Number of subbands should be relatively high (at least 50 based on these experiments) • Shape of the filter does not seem to be important • Discriminative information is not evenly spaced along the frequency axis • The relative discriminatory powers of subbands depends on the selected speaker population/language/speech content… 2. FFT-Cepstral Features Computation of FFT-Cepstrum Windowed speech frame Magnitude spectrum by FFT Processing is very similar to “raw” subband processing Smoothing by a filterbank Common steps Nonlinear mapping of the filter outputs Decorrelation by DCT Coefficient selection Cepstrum vector c = (c1,…,cM)T FFT-Cepstrum : Type of Frequency Warping 1. Linear warping 2. Mel-warping 3. Bark-warping 4. ERB-warping Helsinki Fixed parameters: 30 triangular filters, logcompression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c TIMIT Helsinki: Mel-frequency warped cepstrum gives the best results on average TIMIT: Linearly warped cepstrum gives the best results on average Same explanation as before: discrimination curves FFT-Cepstrum : Number of Cepstral Coefficients ( Fixed parameters: mel-frequency warped triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c, codebook size = 64) Helsinki TIMIT Minimum number of coefficients around ~ 10, rather independent of the number of filters Discussion About the FFT-Cepstrum • Same performance as with the subband features, but smaller number of features For computational and modeling reasons, cepstrum is the preferred method of these two in automatic recognition • The commonly used mel-warped filterbank is not the best choice in general case ! There is no reason to assume that it would be, since mel-cepstrum is based on modeling of human hearing and originally meant for speech recognition purposes • I prefer / recommend to use linear frequency warping, since: It is easier to control the amount resolution on desired subbands (e.g. by linear weighting). In nonlinear warping, the relationship between the “real” and “warped” frequency axes is more complicated 3. LPC-Derived Features What Is Linear Predictive Coding (LPC) ? • In time domain, current sample is approximated as a linear combination of the past p samples : • The objective is to determine the LPC coefficients a[k] k=1,…,p such that the squared prediction error is minimized • In the frequency domain, LPC’s define an all-pole IIR-filter whose poles correspond to local maximae of the magnitude spectrum An LPC pole Computation of LPC and LPC-Based Features Windowed speech frame Autocorrelation computation Solving of YuleWalker AR equations Levinson-Durbin algorithm LPC coefficients (LPC) Complex polynomial expansion Root-finding algorithm Line spectral frequencies (LSF) LPC pole finding Formants (FMT) Reflection coefficients (REFL) Atal’s recursion Linear Predictive Cepstral Coefficients (LPCC) LAR conversion Log area ratios (LAR) asin(.) Arcus sine coefficients (ARCSIN) Linear Prediction (LPC) : Number of LPC coefficients Helsinki TIMIT • Minimum number around ~ 15 coefficients (not consistent, however) • Error rates surprisingly small in general ! • LPC coefficients were used directly in Euclidean-distance -based classifier. In literature there is usually warning of the following form : “Do not ever use LPC’s directly, at least with the Euclidean metric.” Comparison of the LPC-Derived Features Fixed parameters: LPC predictor order p = 15 Helsinki TIMIT • Overall performance is very good • Raw LPC coefficients gives worst performance on average A programming • Differences between feature sets are rather small Other factors to be considered: • Computational complexity • Ease of implementation bug??? LPC-Derived Formants Fixed parameters: Codebook size = 64 Helsinki TIMIT • Formants give comparable, and surprisingly good results ! • Why “surprisingly good” ? 1. Analysis procedure was very simple (produces spurious formants) 2. Subband processing, LPC, cepstrum, etc… describe the spectrum continuously - formants on the other hand pick only a discrete number of maximum peaks’ amplitudes from the spectrum (and a small number!) Discussion About the LPC-Derived Features • In general, results are promising, even for the raw LPC coefficients • The differences between feature sets were small – From the implementation and efficiency viewpoint the following are the most attractive: LPCC, LAR and ARCSIN • Formants give (surprisingly) good results also, which indicates indirectly: – The regions of spectrum with high amplitude might be important for speaker recognition 0.5 An idea for future study : 0.45 Magnitude [dB] 0.4 0.35 How about selecting subbands around local maximae? 0.3 0.25 0.2 0.15 0.1 0.05 0 1000 2000 3000 4000 Frequency [Hz] 5000 6000 4. Dynamic Features Dynamic Spectral Features • Dynamic feature: an estimate of the time derivate of the feature • Can be applied to any feature Time trajectory of the original feature Estimate of the 1st time derivative (-feature) Estimate of the 2nd time derivative ( -feature) • Two widely used estimatation methods are differentiator and linear regression method : (M = number of neigboring frames, typically M = 1..3) • Typical phrase : “Don’t use differentiator, it emphasizes noise” Delta Features : Comparison of the Two Estimation Methods TIMIT Differentiator Helsinki Best: -ARCSIN (8.1 %), M=4 Regression Best : -LSF (7.0 %), M=1 Best : -LSF (10.6 %), M=2 Best : -ARCSIN (8.8 %), M=1 Delta Features : Comparison with the Static Features Discussion About the Delta Features : • Optimum order is small (In most cases M=1,2 neighboring frames) • The differentiator method is better in most cases (surprising result, again!) • Delta features are worse than static features but might provide uncorrelated extra information (for multiparameter recognition) • The commonly used delta-cepstrum gives quite poor results ! Towards Concluding Remarks ... FFT-Cepstrum Revisited : Question : Is Log-Compression / Mel-Cepstrum Best ? Please note: Now segment length is reduced down to T=100 vectors, that’s why absolute recognition rates are worse than before (ran out of time for the thesis…) Helsinki TIMIT Answer: NO ! FFT- vs. LPC-Cepstrum: Question: Is it really that “FFT-cepstrum is more accurate” ? Helsinki TIMIT Answer: NO ! (TIMIT shows this quite clearly) The Essential Difference Between the FFT- and LPC-Cepstra ? • FFT-cepstrum approximates the spectrum by linear combination of cosine functions (non-parametric model) • LPC makes a least-squares fit of the allpole filter to the spectrum (parametric model) • FFT-cepstrum first smoothes the original spectrum by filterbank, whereas LPC filter is fitted directly to the original spectrum LPC captures more “details” FFT-cepstrum represents “smooth” spectrum However, one might argue that we could drop out the filterbank from FFT-cepstrum ... General Summary and Discussion • Number of subbands should be high (30-50 for these corpora) • Number of cepstral coefficients (LPC/FFT-based) should high ( 15) • In particular, number of subbands, coefficients, and LPC order are clearly higher than in speech recognition generally • Formants give (surprisingly) good performance • Number of formants should be high ( 8) • In most cases, the differentiator method outperforms the regression method in delta-feature computation All of these indicate indirectly the importance of spectral details and rapid spectral changes “Philosophical Discussion” • The current knowledge of speaker individuality is far from perfect : • Engineers concentrete on tuning complex feature compensation methods but don’t (necessarily) understand what’s individual in speech • Phoneticians try to find the “individual code” in the speech signal, but they don’t (necessarily) know how to apply engineers’ methods • Why do we believe that speech would be any less individual than e.g. fingerprints ? • Compare the history “fingerprint” and “voiceprint” : • Fingerprints have been studied systematically since the 17th century (1684) • Spectrograph wasn’t invented until 1946 ! How could we possibly claim that we know what speech is with research of less than 60 years? • Why do we believe that human beings are optimal speaker discriminators? Our ear can be fooled already (e.g. MP3 encoding). That’s All, Folks !