S`(f) = Y(f)

Background Noise
• Definition: an unwanted sound or an unwanted
perturbation to a wanted signal
• Examples:
Clicks from microphone synchronization
Ambient noise level: background noise
Roadway noise
Additional speakers
Background activities: TV, Radio, dog barks, etc.
– Classifications
• Stationary: doesn’t change with time (i.e. fan)
• Non-stationary: changes with time (i.e. door closing, TV)
Noise Spectrums
Power measured relative to frequency f
• White Noise: constant over range of f
• Pink Noise: Decreases by 3db per octave; perceived equal across f
but actually proportional to 1/f
• Brown(ian): Decreases proportional to 1/f2 per octave
• Red: Decreases with f (either pink or brown)
• Blue: increases proportional to f
• Violet: increases proportional to f2
• Gray: proportional to a psycho-acoustical curve
• Orange: bands of 0 around musical notes
• Green: noise of the world; pink, with a bump near 500 HZ
• Black: 0 everywhere except 1/fβ where β>2 in spikes
• Colored: Any noise that is not white
Audio samples: http://en.wikipedia.org/wiki/Colors_of_noise
Signal Processing Information Base: http://spib.rice.edu/spib.html
• ASR:
– Prevent significant degradation in noisy environments
– Goal: Minimize recognition degradation with noise present
• Sound Editing and Archival:
– Improve intelligibility of audio recordings
– Goals: Eliminate noise that is perceptible; recover audio from
old wax recordings
• Mobile Telephony:
– Transmission of audio in high noise environments
– Goal: Reduce transmission requirements
• Comparing audio signals
– A variety of digital signal processing applications
– Goal: Normalize audio signals for ease of comparison
Signal to Noise Ratio (SNR)
• Definition: Power ratio between a signal and noise
that interferes.
• Standard Equation in decibels:
SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise)
• For digitized speech
SNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2)
where sf is an array holding samples from
frame, f; and nf is an array of noise samples.
• Note: if sf(n) = nf(x), SNRf = 0
Stationary Noise Suppression
• Requirements
– low residual noise
– low signal distortion
– low complexity (efficient calculation)
• Problems
– Tradeoff between removing noise and distorting the signal
– More noise removal normally increases the signal distortion
• Popular approaches
– Time domain: Moving average filter (distorts frequency domain)
– Frequency domain: Spectral Subtraction
– Time domain: Weiner filter (autoregressive)
Auto regression
• Definition: An autoregressive process is one where
a value can be determined by a linear combination of
previous values
• Formula: Xt = c + ∑0,P-1ai Xt-i + nt
• This is linear prediction; noise is the residue
– Convolute the signal with the linear coefficient
coefficients to create a new signal
– Disadvantage: The fricative sounds, especially
those that are unvoiced, are distorted by the
Spectral Subtraction
• Noisy signal: yt = st + nt where st is the clean signal and nt is
additive noise
• Therefore: yt = xt – nt and estimated y’t = xt – n’t
• Algorithm (Estimate Noise from segments without speech)
Compute FFT to compute X(f)
IF not speech THEN
Adaptively adjust the previous noise spectrum estimate N(f)
ELSE FOR EACH frequency bin: Y’(f)2 = (|Y(f)|a – |N(f)|a)1/a
Perform an inverse FFT to produce a filtered signal
• Note: (|Y(f)|a – |N’(f)|a)1/a is a generalization of (|Y(f)|2 – |N’(f)|2)½
S. F. Boll, “Suppression of acoustic noise in speech using
spectral subtraction," IEEE Trans. Acoustics, Speech, Signal
Processing, vol. ASSP-27, Apr. 1979.
Spectral Subtraction Block Diagram
Note: Gain refers to the factor to apply to the frequency bins
• Noise is relatively stationary
– within each segment of speech
– The estimate in non-speech segments is a valid predictor
• The phase differences between the noise signal and
the speech signal can be ignored
• The noise is a linear signal
• There is no correlation between the noise and
speech signals
• There is no correlation between noise in the current
sample with noise in previous samples
Implementation Issues
1. Question: How do we estimate the noise?
Answer: Use the frequency distribution during times when no voice is
2. Question: How do we know when voice is present?
Answer: Use Voice Activity Detection algorithms (VAD)
3. Question: Even if we know the noise amplitudes, what about phase
differences between the clean and noisy signals?
Answer: Since human hearing largely ignores phase differences, assume
the phase of the noisy signal.
4. Question: Is the noise independent of the signal?
Answer: We assume that it is.
5. Question: Are noise distributions really stationary?
Answer: We assume yes.
Phase Distortions
• Problem: We don’t know how much of the phase in an FFT
is from noise and from speech
• Assumption: The algorithm assumes the phase of both are
the same (that of the noisy signal)
• Result: When SNR approaches 0db the noise filtered audio
has an hoarse sounding voice
• Why: The phase assumption means that the expected noise
magnitude is incorrectly calculated
• Conclusion: There is a limit to spectral subtraction utility
when SNR is close to zero
• The signal is typically framed with a 50% overlap
• Rectangular windows lead to significant echoes in the filtered
noise reduced signal
• Solution: Overlapping windows by 50% using Bartlet (triangles), Hanning,
Hamming, or Blackman windows reduces this effect
• Algorithm
– Extract frame of a signal and apply window
– Perform FFT, spectral subtraction, and inverse FFT
– Add inverse FFT time domain to the reconstructed signal
• Note: Hanning tends to work best for this
application because with 50% overlap,
Hanning windows do not alter the power
of the original signal power on reconstruction
Musical noise
Definition: Random isolated tone bursts across the frequency.
Why? Subtraction could cause some bins to have negative power
Solution: Most implementations set frequency bin magnitudes to zero if
noise reduction would cause them to become negative
Green dashes: noisy signal, Solid line: noise estimate
Black dots: projected clean signal
• Advantages: Easy to understand and implement
• Disadvantages
– The noise estimate is not exact
• When too high, speech portions will be lost
• When too low, some noise remains
• When a noise frequency exceeds the noisy sound
frequency, a negative frequency results
– Incorrect assumptions: Negligible with large SNR
values; significant impact with small SNR values.
Ad hoc Enhancements
• Eliminate negative frequencies:
– S’(f) = Y(f)( max{1 – (|N’(f)|/Y(f))a )1/a, t}
– Result: minimize the source of musical noise
• Reduce the noise estimate
– S’(f) = Y(f)( max{1 – b(|N’(f)|/Y(f))a )1/a, t}
– Apply different constants for a, b, t in different frequency bands
• Turn to psycho-acoustical methods: Don’t attempt to adjust
masked frequencies
• Maximum likeliood: S’(f) = Y(f)( max{½–½(|N’(f)|/Y(f))a )1/a,t}
• Smooth spectral subtractions over adjacent time periods:
GS(p) = λFGS(p-1)+(1-λF)G(p)
• Exponentially average noise estimate over frames
|W (m,p)|2 = λN|W(m,p-1)|2 + (1-λN)|X(m,p)2, m = 0,…,M-
Acoustic Noise Suppression
• Take advantage of the properties of human
hearing related to masking
• Preserve only the relevant portions of the
speech signal
• Don’t attempt to remove all noise, only that
which is audible
• Utilize: Mel or Bark Scales
• Perhaps utilize overlapping filter banks in the
time domain
Acoustical Effects
• Characteristic Frequency (CF): The frequency that causes
maximum response at a point of the Basilar Membrane
• Saturation: Neuron exhibit a maximum response for 20 ms and
then decrease to a steady state, recovering a short time after
the stimulus is removed
• Masking effects: can be simultaneous or temporal
– Simultaneous: one signal drowns out another
– Temporal: One signal masks the ones that follow
– Forward: still audible after masker removed (5ms–150ms)
– Back: weak signal masked from a strong one following (5ms)
Threshold of Hearing
• The limit of the internal noise of the auditory system
• Tq(f) = 3.64(f/1000)-0.8 – 6.5e-0.6(f/1000-3:3)^2 + 10-3(f/1000)4 (dB SPL)
Non Stationary Noise
• Example: A door slamming, a clap
– Characterized by sudden rapid changes in: Time Domain
signal, Energy, or in the Frequency domain
– Large amplitudes outside the normal frequency range
– Short duration in time
– Possible solutions: compare to energy, correlation,
frequency of previous frames and delete frames
considered to contain non-stationary noise
• Example: cocktail party (background voices)
– What would likely happen to happen in the frequency
domain? How about in the time domain? How to minimize
the impact? Any Ideas?

similar documents