Report

Representing Acoustics with Mel Frequency Cepstral Coefficients Lecture 7 Spoken Language Processing Prof. Andrew Rosenberg Representing Acoustic Information • 16-bit samples 44.1kHz sampling rate – ~86kB/sec – ~5MB/min • Waves repeat -- Much of this data is redundant. • A good representation of speech (for recognition) – Keeps all of the information to discriminate between phones – Is Compact. i.e. Gets rid of everything else 1 Frame Based analysis • Using a short window of analysis, analyze the wave form every 10ms (or other analysis rate) • Usually performed with overlapping windows. • e.g. FFT and Spectrogram 2 Overlapping frames • Spectrograms allow for visual inspection of spectral information. • We are looking for a compact, numerical representation 10ms 10ms 10ms 10ms 10ms 3 Example Spectrogram 4 Standard Representation in the field • Mel Frequency Cepstral Coefficients – MFCC PreEmphasis FFT window energy 12 MFCC 12 ∆ MFCC 12∆∆ MFCC 1 energy 1 ∆ energy 1 ∆∆ energy Mel-Filter Bank log 12 MFCC Deltas FFT-1 5 Pre-emphasis • Looking at spectrum for voiced segments, there is more energy at the lower frequencies than higher frequencies. • Boosting high frequencies helps make the high frequency information more available. – First-order high-pass filter for pre-emphasis. 6 Windowing • Overlapping windows allow analysis centered at a frame point, while using more information. 7 Hamming Windowing • Discontinuities at the edge of the window can cause problems for the FFT • Hamming window smoothes-out the edges. 8 Hamming Windowing • Discontinuities at the edge of the window can cause problems for the FFT • Hamming window smoothes-out the edges. 9 Discrete Fourier Transform • The algorithm for calculating the Discrete Fourier Transform (DFT) is the Fast Fourier Transform. Australian male /i:/ from “heed” FFT analysis window 12.8ms http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html 10 Mel Filter Bank and Log • Human hearing is not equally sensitive at all frequency regions. • Modeling human hearing sensitivity helps phone recognition. • MFCC approach: Warp frequencies from Hz to Mel frequency scale. • Mel: pairs of sounds that are perceptually equidistant in pitch are separated by an equal number of mels. 11 Mel frequency Filter bank • Create a bank of filters collecting energy from each frequency band, 10 filters linearly spaced below 1000Hz, logarithmic spread over 1000Hz. 12 Cepstrum • Separation of source and filter. • Source differences are speaker dependent • Filter differences are phone dependent. • Cepstrum is the “Spectrum of the Log of the Spectrum” – inverse DFT of the log magnitude of the DFT of the signal 13 Cepstrum Visualization • Peak at 120 samples represents the glottal pulse, corresponding to the F0 • Large values closer to zero correspond to vocal tract filter (tongue position, jaw opening, etc.) • Common to take the first12 coefficients 14 Deltas and Energy • Energy within a frame is just the sum of the power of the samples. • The spectrum of some phones change over time – the stop closure to stop burst, or slope of a formant. • Taking the delta or velocity and double delta or acceleration incorporates this information 15 Summary: MFCC • Commonly MFCCs have 39 Features 39 12 12 12 1 1 1 MFCC Features Cepstral Coefficients Delta Cepstral Coefficients Delta Delta Cepstral Coefficieints Energy Coefficients Delta Energy Coefficients Delta Delta Energy Coefficients 16 Next Class • Introduction to Statistical Modeling and Classification • Reading: J&M 9.4, optional 6.6 17