### Document

```Recognizer Issues
• Problem: Our recognizer translates the audio
to a possible string of text. How do we know
the translation is correct.
• Problem: How do we handle a string of text
containing words that are not in the
dictionary.
• Problem: How do we handle strings with valid
words, but which do not form sentences with
semantics that makes sense
Correcting Recognizer Ambiguities
• Problem: Resolving words not in the dictionary
• Question: How different is a recognized word from
those that are in the dictionary?
• Solution: Count the single step transformations
necessary to convert one word into another.
• Example: caat  cat with removal of one letter
• Example: flpc  fireplace requires adding the
letters ire after f and a before c and e at the end
Spelling Error Types
• Levenshtein distance: the smallest number of insertion,
deletion, or substitution operations that transforms one string
into another
• Examples: differences from the word, “cat”
– Insertion: catt
– Deletion: ct
– Substitution: car
– Transposition: cta
Note: There are similar well-defined rules for pronunciation variations
Spell Check Algorithm
FOR unrecognized words or those out of context
Generate a list of candidates
– Those that differ by a one step transformation
– Those that exist in the lexicon
Order possibilities using language-based statistics
RETURN most likely word
Note: We could extend the algorithm to consider multiple
transformations when we cannot find a single step solution
Example
Mispelled word: acress
Context
Context * P(c)
Candidates – with probabilities of use and use within context
Which correction is most likely?
Misspelled word: accress
• Word frequency percentage is not enough
– We need p(typo|candidate) * p(candidate)
• How likely is the particular error?
–
–
–
–
–
–
–
Deletion of a t after a c and before an r
Insertion of an a at the beginning
Transpose a c and an a
Substitute a c for an r
Substitute an o for an e
Insert an s before the last s, or after the last s
Context of the word within a sentence or paragraph
• What if there is more than one error per word?
– Possible Solution: Compute minimum edit distance
and allow more than one transformation
• Some words in the lexicon may not appear in the
frequency corpus that we use
– Possible Solution: Smoothing algorithms
• Is P(typo|candidate) accurate?
– Possible Solution: Train the algorithm
• Choice often depends on context
– Possible Solution: Context sensitive algorithms
• Names and places likely do not exist in the lexicon
– This is a difficult issue
Dynamic Programming
Definition: Nesting small decision problems inside larger
decisions (Richard Bellman)
Two approaches:
1. Top Down: Start out with a method call to compute a final
solution. The method proceeds to solve sub-problems of the
same kind. Eventually the algorithm reaches and solves the
“base case.” The algorithm can avoid repetitive calculations by
storing the answers to the sub-problems for later access.
2. Bottom Up: Start with the base cases, and work upwards, using
previous results to reach the final answer.
The examples on the slides that follow use the bottom up approach
8
Dynamic Programming Algorithms
Implement recursive algorithms using arrays
• Recursion is popular for solving ‘divide and conquer’ problems
– Divide problem into smaller problems of the same kind
– Define the base case
– Works from the original problem down to the base case
• Dynamic programming is another approach that uses arrays
– Initialize a variable or table
– Use loops fill in entries the table. The algorithm always fills in
entries in the table before they are needed
– Works top-down or bottom-up towards the final solution
– Avoids repetitive calculations because the technique stores
results in a table for later use
– Eliminates recursion activation record overhead
Dynamic programming is a widely used algorithmic technique
Fibonacci Sequence
{0 1 1 2 3 5 8 13 21 …} where F(n)=F(n-1)+F(n-2)
Bottom Up
int fibo(n)
{
if (n <=1) return n;
int fMinusTwo = 0;
int fMinusOne = 1;
for (int i = 2; i <= n; i++)
{
f = fAMinusTwo + fMinusOne;
fMinusTwo = fMinusOne;
fAtNMinusOne = f;
}
return(f)
}
Top Down
int[] fibs = new int[MAX];
fibs[0] = 0;
fibs[1] = 1;
int max = 1;
int fibo(n)
{
if (n<=max) return fibs[n];
else fibs[n] = fibo(n-1)+fibo(n-2);
max = n;
return fibs[n];
}
Example: Knapsack Problem
A thief enters a house with a knapsack of a particular size. Which
items of different values does he choose to fill his knapsack.
• Values[v] containing
values of the N items
• Weights[v] contains the
weights of the N items
• K = knapsack capacity
• dynTable[v][w] contains
the dynamic algorithm
table
KnapSack Algorithm
knapSack(int[] w, int[] v, int K)
{
int[][] dynTable[v.length+1][K];
for (int w = 0 to K) a[0][w] = 0;
for (int v=1; v<=v.length; v++)
{
for (int w=0; w<W; w++)
{
dyn[v][w] = dynTable[v-1][w]; // Copy up
if (w[v] <= w && )
dynTable[v][w] = // Try for better choice
Math.max(dynTable[v][w],
values[v]+dynTable[v-1,w-weights[v]]);
}
}
return dynTable[N,W];
}
Note: Row for each item to consider; column for integer weights
Knapsack Example
Goal: Maximum profit with a 15kg knapsack?
Array cells represent value stuffed into knapsack of that weight
0
1
2
3
4
5
Weights
?
2
3
4
5
8
Values
?
3
2
6
10
9
Weight
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
0
I
T 1
E 2
M 3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
3
3
3
3
3
3
3
3
3
3
3
3
3
0
0
3
3
3
5
5
5
5
5
5
5
5
5
5
5
0
0
3
3
6
6
9
9
9
11 11 11 11 11 11 11
4
0
0
3
3
6
10 10 13 13 16 16 19 19 19 21 21
5
0
0
3
3
6
10 10 13 13 16 16 19 19 19 21 22
Example: Minimum Edit Distance
A useful dynamic programming algorithm
• Problem: How can we measure how different one word is
from another word (ie spell checker)?
– How many operations will transform one word into another?
– Examples: caat --> cat, fplc --> fireplace
• Definition:
– Levenshtein distance: smallest number of insertion, deletion,
or substitution operations to transform one string into another
– Each insertion, deletion, or substitution is one operation, with
a cost of 1.
• Requires a two dimension array
– Rows: source word positions, Columns: spelled word positions
– Cells: distance[r][c] is the distance (cost) up to that point
Pseudo Code (minDistance(target, source))
n = character in source
m = characters in target
Create array, distance, with dimensions n+1, m+1
FOR r=0 TO n distance[r,0] = r
FOR c=0 TO m distance[0,c] = c
FOR each row r
FOR each column c
IF source[r]=target[c]
cost = 0
ELSE cost = 1
distance[r,c]=minimum of
distance[r-1,c] + 1,
//insertion
distance[r, c-1] + 1,
//deletion
and distance[r-1,c-1] + cost) //substitution
Result is in distance[n,m]
Example
0
G
1
A
2
M
3
B
4
O
5
L
6
G
U
M
B O
1
2
3
4 5
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Initialization
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
A
2
1
M
3
2
B
4
3
O
5
4
L
6
5
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 1
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
A
2
1
1
M
3
2
2
B
4
3
3
O
5
4
4
L
6
5
5
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 2
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
2
A
2
1
1
2
M
3
2
2
1
B
4
3
3
2
O
5
4
4
3
L
6
5
5
4
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 3
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
2
3
A
2
1
1
2
3
M
3
2
2
1
2
B
4
3
3
2
1
O
5
4
4
3
2
L
6
5
5
4
3
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 4
Example
G
U
M
B O
0
1
2
3
4 5
G
1
0
1
2
3 4
A
2
1
1
2
3 4
M
3
2
2
1
2 3
B
4
3
3
2
1 2
O
5
4
4
3
2 1
L
6
5
5
4
3 2
• Source: GAMBOL, Target: GUMBO
• Algorithm Step: Column 5
• Result: Distance equals 2
Another Example
E
X
E
C
U
T
I
O
N
0
1
2
3
4
5
6
7
8
9
I
1
1
2
3
4
5
6
6
7
8
N
2
2
2
3
4
5
6
7
7
7
T
3
3
3
3
4
5
5
6
7
8
E
4
3
4
3
4
5
6
6
7
8
N
5
4
4
4
4
5
6
7
7
7
T
6
5
5
5
5
5
5
6
7
8
I
7
6
6
6
6
6
6
5
6
7
O
8
7
7
7
7
7
7
6
5
6
N
9
8
8
8
8
8
8
7
6
5
Comparing Audio Frames
• Patterns: Database of audio samples. Match to
the ones that are closest
• Templates: Database of features extracted from
audio samples
• Training: Use a training set to create a vector of
patterns or templates
• Distortion Measure: algorithm to measure how
far a pair of templates or patterns are apart
Will Minimum Edit Distance Work?
• Maybe: Distances to higher array indices are functions of the costs of
smaller indices; a dynamic programming approach may work
• Issues
–
–
–
–
–
–
–
The algorithm may be too slow
A binary equal or not equal comparison does not work
A distance metric is needed
Speaking rates change, even with a single speaker
Do we compare the raw data or frame-based features?
Do we assign cost to adjacent cells or to those further away?
Other issues: Phase, energy, pitch misalignment, Presence of noise,
length of vowels, Phoneme pronunciation variances, etc.
Incorrect comparisons occur when the algorithm isn’t carefully designed
Dynamic Time Warping
Goal: Find “best” alignment between pairs of audio frames
(A)
The matrix to the right shows the
optimal alignment path (warping)
between frames from utterance A
with those of utterance B
time (frame) of (B)
(B)
time (frame) of (A)
Dynamic Time Warping (DTW) Overview
•
Computes the “distance” between 2 frames of speech
o Measures frame by frame distances to compute
dissimilarities between speech
o Allows the comparison to warp the comparison to account
for differences in speaking rates
o Requires a cost function to account for different paths
through frames of speech
o Uses dynamic programming algorithm to find best warping.
o Computes a total “distortion score” for best warped path.
•
Assumptions
o Constrain begin and end times to be (1,1) and (TA,TB)
o Allow only monotonically increasing time
o Don’t allow too many frames to be skipped
o Can express results in terms of “paths” with “slope
weights”
26
Assumptions
• Does not require that both patterns have the same length
• One speech pattern is the “input” and the other speech pattern is
the “template” to compare against
• Divide speech signal into equally-spaced frames (10-30ms) with
approximately 50% overlap and compute a frame-based feature
set
• The local distance measure (d) is the distance between features
at a pair of frames (one from A, one from B).
• The Global distortion from beginning of utterance until current
pair of frames called G.
27
Algorithm Efficiency
• The algorithm complexity is O(m*n) where m
and n are the respective number of frames
between the two utterances. If m=n, the
algorithm is O(n2). Why?: Count the number
of cells that need to be filled in.
• O(n2) may be too slow. Alternate solutions
have been devised.
– Don’t fill in all of the cells.
– Use a multi-level approach
Don’t Fill in all of the Cells
Disadvantage: The algorithm may miss the optimal path
The Multilevel Approach
Concept
1.
2.
3.
4.
5.
Coarsen the array of features
Run the algorithm
Refine the array
Repeat steps 3-4 till the original array of
features is restored
Notes
• The multilevel approach is a common
technique for increasing many algorithms’
complexity from O(n2) to O(n lg n)
• Example: partitioning a graph to balance work
Which Audio Features?
• Cepstrals: They are statistically independent
and phase differences are removed
• ΔCepstrals, or ΔΔCepstrals: Reflects how the
signal is changing from one frame to the next
• Energy: Distinguish the frames that are voiced
verses those that are unvoiced
• Normalized LPC Coefficients: Represents the
shape of the vocal track normalized by vocal
tract length for different speakers.
These are some of the popular speech recognition features
Distance Metric Requirements
Definition: Measure similarity of two frames of speech.
The vector xt , yt contain the features from frames of two signals
A distance measure should have the following properties:
0  d(xt,yt)  
(positive definiteness)
0 = d(xt,yt) iff xt = yt
d(xt,yt) = d(xt,yt)
(symmetry)
d(xt,yt)  d(xt,zt) + d(zt,yt)
(triangle inequality)
A speech distance metric should correlate with perceived distance.
Perceptually-warped spectral features work well in practice
32
Which Distance Metric?
• General Formula:
array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}
• Assumption : There is no cost assessed for duplicate or
eliminated frames.
• Distance Formula:
– Euclidian: sum the square of one metric minus another squared
– Linear: sum the absolute value of the distance between features
• Example of a distance metric using linear distance
∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for
signals a and b. w[i] is that feature’s weight
Which Local Distance Measure?
where xt(f), and yt(f) are
frame frequency values at
time t; f is a feature index
• Euclidean distance:
• Mahalanobis distance:
F 1
 x ( f )  y ( f )
2
t
t
f 0
Other distance measures
Itakura-Saito distortion, COSH, likelihood ratio, etc…
Note: we can weight the features by multiplying differences by
weighting factors to emphasize/deemphasize certain features
Dynamic Time Warping Termination Step
Divide the total computed cost by a normalizing factor
The normalizing factor is necessary to compare results between input
speech and various templates to which it is compared
1. One quick and effective normalizing method divides by the number
of frames in the template.
2. Another method is divide the result by the length of the path taken,
where we adjust the length by the slope weights at each transition.
This requires backtracking to sum the slope values, but can
sometimes be more accurate.
35
Frame transition cost heuristics
½ P1=(1,1)(1,0)
P2 ½
P2=(1,1)
P3=(1,1)(0,1)
½
P1
P1
½
1
P2
P3
P1=(1,0)
P2=(1,1)
P3=(1,2)
P3
Heuristic 1
Heuristic 2
• Path P and slope weight m determined heuristically
• Paths considered backward from target frame
• Larger weight values for less preferable paths
• Optimal paths always go up, right (monotonically forward in time)
• Only evaluate P if all frames have meaningful values (e.g.
don’t evaluate a path if one frame is at time 1, because there
36
is no data for time 1).
Dynamic Time Warping (DTW) Algorithm
1. Initialization (time 1 is first time frame)
D(1,1) = d(1,1)
min
2. Recursion D( x, y) 
D( x, y)   ((x, y), ( x, y))
( x, y)
L
 ((x, y), ( x, y))   d ( xl , yl )m(l )
l 1
l  index along valid pathfrom( x' , y' ) to ( x, y), from1 to L
where (=zeta) is a function of previous distances and slopes
3. Termination
result
D ( xT , yT )
M sometimes defined as Tx, or Tx+Ty,
T
2+ T 2)½
or
(T
x
y
M  m ( k )
k 1
A convenient value for M is the length of the template.
37
Dynamic Time Warping (DTW) Example
2
3
2
2
2
2
3
2
3
1
1
1
1
2
2
2
1
2
2
2
2
1
2
2
1
3
3
1
1
1
1
2
3
3
1
1
3
3
3
3
3
6
7
8
D(1,1) =
D(2,1) =
D(3,1) =
D(4,1) =
6
7
D(1,2) =
D(2,2) =
D(3,2) =
D(4,2) =
7
9
D(2,3) =
D(3,3) =
D(4,3) =
D(5,3) =
5 5
8
11
D(3,4) =
D(4,4) =
D(5,4) =
D(6,4) =
6
9
D(4,5) =
D(5,5) =
D(6,5) =
D(7,5) =
D(4,6) =
D(5,6) =
D(6,6) =
D(7,6) =
6
5 5
5
4
4
2
4
2 3
1
2
6
4
5
heuristic paths:
3
8
11 14
12
1
1
1
P1=(1,0)
P2=(1,1)
P3=(1,2)
begin at (1,1), end at (7,6)
17
normalized distortion = 8/6 = 1.33
normalized distortion = 8/7 = 1.14
38
Dynamic Time Warping (DTW) Example
3
2
3
2
2
2
3
2
3
1
1
1
2
2
2
1
2
2
2
8
9
2
1
3
2
1
8
1
2
3
1
2
3
3
3
3
13 11 12 12 12 13
10
heuristic paths:
1
1
1 P1=(1,0)
P2=(1,1)
P3=(0,1)
begin at (1,1), end at (6,6)
D(4,1) = 9 …
D(1,1) = 1
D(2,1) = 3
D(3,1) = 6
11 11
D(1,2) = 3
D(2,2) = 2
D(3,2) = 10 D(4,2) = 7 …
9 10 10 10
D(1,3) = 5
D(2,3) = 10 D(3,3) = 11 D(4,3) = 9 …
D(2,4) = 7
9 10 10
7
7
5
10 11
9
8 11
D(1,4) = 7
3
2
10
7
9 12
D(1,5) = 10 D(2,5) = 9
1
3
6
9
12 15
D(3,4) = 9
D(4,4) = 10 …
D(3,5) = 10 D(4,5) = 10 …
D(1,6) = 13 D(2,6) = 11 D(3,6) = 12 D(4,6) = 12 …
normalized distortion = 13/6 = 2.17
39
Dynamic Time Warping (DTW) Example
9
8
5
3
1
2
7
7
4
1
3
4
8
6
1
2
3
5
7
5
2
1
3
4
5
3
2
2
4
6
4
2
3
1
3
7
heuristic paths:
½
½
1
½ P1=(1,1)(1,0)
½
P2=(1,1)
P3=(1,1)(0,1)
begin at (1,1), end at (6,6)
D(1,1) =
D(2,1) =
D(3,1) =
D(4,1) =
D(1,2) =
D(2,2) =
D(3,2) =
D(4,2) =
D(2,3) =
D(3,3) =
D(4,3) =
D(5,3) =
D(3,4) =
D(4,4) =
D(5,4) =
D(6,4) =
D(3,5) =
D(4,5) =
D(5,5) =
D(6,5) =
D(3,6) =
D(4,6) =
D(5,6) =
D(6,6) =
40
Singularities
• Assumption
– The minimum distance comparing two signals only
depends on the previous adjacent entries
– The cost function accounts for the varied length of a
particular phoneme, which causes the cost in particular
array indices to no longer be well-defined
• Problem: The algorithm can compute incorrectly due
to mismatched alignments
• Possible solutions:
– Compare based on the change of feature values between
windows instead of the values themselves
– Pre-process to eliminate the causes of the mismatches
Possible Preprocessing
• Normalize the energy of voiced audio:
– Compute the energy of both signals
– Multiply the larger by the percentage difference
• Brick Wall Normalize the peaks and valleys:
– Find the average peak and valley value
– Set values larger than the average equal to the average
• Normalize the pitch: Use PSOLA to align the pitch of the two signals
• Remove duplicate frames: Auto correlate frames at pitch points
• Implement a noise removal algorithm
• Normalize the speaking rate
Dynamic Time Warping/Hidden Markov Models
•
Dynamic Time Warping (DTW)
o Comparing speech with a number of templates
o The algorithm selects the template with the lowest
normalized distortion to determine the recognized
word.
• Hidden Markov Models (HMMs)
o Refines the DTW technology.
o HMMs compare speech against “probabilistic
templates”
o HMMs compute most likely paths using
probabilities
43
Phoneme Marking
• Goal: Mark the start and end of phoneme boundaries
• Research
– Unsupervised text (language) independent algorithms have
been proposed
– Accuracy: 75% to 80%, which is 5-10% lower than supervised
algorithms that make assumptions about the language
• If successful, a database of phonemes can be used in
conjunction with dynamic time warping to simplify the
speech recognition problem
```