Speech Representations - MFCC

Report
Representing Acoustics
with Mel Frequency
Cepstral Coefficients
Lecture 7
Spoken Language Processing
Prof. Andrew Rosenberg
Representing Acoustic Information
• 16-bit samples 44.1kHz sampling rate
– ~86kB/sec
– ~5MB/min
• Waves repeat -- Much of this data is
redundant.
• A good representation of speech (for
recognition)
– Keeps all of the information to discriminate
between phones
– Is Compact. i.e. Gets rid of everything else
1
Frame Based analysis
• Using a short window of analysis, analyze
the wave form every 10ms (or other
analysis rate)
• Usually performed with overlapping
windows.
• e.g. FFT and Spectrogram
2
Overlapping frames
• Spectrograms allow for visual inspection of
spectral information.
• We are looking for a compact, numerical
representation
10ms
10ms
10ms
10ms
10ms
3
Example Spectrogram
4
Standard Representation in the field
• Mel Frequency Cepstral Coefficients
– MFCC
PreEmphasis
FFT
window
energy
12 MFCC
12 ∆ MFCC
12∆∆ MFCC
1 energy
1 ∆ energy
1 ∆∆ energy
Mel-Filter
Bank
log
12 MFCC
Deltas
FFT-1
5
Pre-emphasis
• Looking at spectrum for voiced segments,
there is more energy at the lower frequencies
than higher frequencies.
• Boosting high frequencies helps make the
high frequency information more available.
– First-order high-pass filter for pre-emphasis.
6
Windowing
• Overlapping windows allow analysis
centered at a frame point, while using
more information.
7
Hamming Windowing
• Discontinuities at the edge of the window can
cause problems for the FFT
• Hamming window smoothes-out the edges.
8
Hamming Windowing
• Discontinuities at the edge of the window can
cause problems for the FFT
• Hamming window smoothes-out the edges.
9
Discrete Fourier Transform
• The algorithm for calculating the Discrete
Fourier Transform (DFT) is the Fast
Fourier Transform.
Australian male /i:/ from “heed” FFT analysis window 12.8ms
http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html
10
Mel Filter Bank and Log
• Human hearing is not equally sensitive at
all frequency regions.
• Modeling human hearing sensitivity helps
phone recognition.
• MFCC approach: Warp frequencies from
Hz to Mel frequency scale.
• Mel: pairs of sounds that are perceptually
equidistant in pitch are separated by an
equal number of mels.
11
Mel frequency Filter bank
• Create a bank of filters collecting energy from
each frequency band, 10 filters linearly
spaced below 1000Hz, logarithmic spread
over 1000Hz.
12
Cepstrum
• Separation of source and filter.
• Source differences are speaker
dependent
• Filter differences are phone dependent.
• Cepstrum is the “Spectrum of the Log of
the Spectrum” – inverse DFT of the log
magnitude of the DFT of the signal
13
Cepstrum Visualization
• Peak at 120 samples represents the glottal
pulse, corresponding to the F0
• Large values closer to zero correspond to
vocal tract filter (tongue position, jaw
opening, etc.)
• Common to take the first12 coefficients
14
Deltas and Energy
• Energy within a frame is just the sum of the power
of the samples.
• The spectrum of some phones change over time –
the stop closure to stop burst, or slope of a
formant.
• Taking the delta or velocity and double delta or
acceleration incorporates this information
15
Summary: MFCC
• Commonly MFCCs have 39 Features
39
12
12
12
1
1
1
MFCC Features
Cepstral Coefficients
Delta Cepstral Coefficients
Delta Delta Cepstral Coefficieints
Energy Coefficients
Delta Energy Coefficients
Delta Delta Energy Coefficients
16
Next Class
• Introduction to Statistical Modeling and
Classification
• Reading: J&M 9.4, optional 6.6
17

похожие документы