Simulation, learning, and optimization techniques in Watson*s game

Simulation, learning,
and optimization
techniques in Watson’s
game strategies
Presented by Christina Ortiz
Why were advances in questionanswering (QA) technology necessary?
IBM Watson needed rapid fire answers to challenging
natural-language questions
Broad general knowledge
High precision
Accurate confidence estimates
This was achieved over 4 years with 24 IBM researchers
Four Strategies of Jeopardy
wagering on a Daily double
wagering during Final Jeopardy
selecting the next square when in control of the board
deciding whether to attempt to answer: “buzz in”
Different Circumstances
-player must select a square with accordance to getting the
daily double
-if a player’s score is just below half the leader’s score they
may make a “desperation buzz” to avoid a sure loss
-if a player’s score is just above half the leader’s score the
strategy might be to not buzz in
Jeopardy! Simulation Model
IBM researchers first needed to create a simulator that would
simulate contests between Watson and human contestants
Achieved through extensive research of past episodes with a
focus on:
-Statistical performance profiles of human contestants
-Tendencies in wagering
-Tendencies in square selection
Jeopardy! Simulation Model
Four methods for designing, learning and optimizing
1) Properties of the game environment (rules of play,
DD placement possibilities, etc.).
2) Performance profiles of human contestants,
including tendencies in wagering and square selection
3) Performance profiles of Watson, along with
Watson’s actual actual strategy algorithms
4)Buzz-in thresholds in endgames using Approximate
Dynamic Programming and estimate user’s relative
“buzzability” capabiblity
Properties of Game
3,000 past episodes, as early as mid-1990s
Recorded the order in which the questions were
played, right and wrong contestant answers, daily
double and final jeopardy wagers, and daily double
Three human models: Average Contestant, Champion
model, and Grand Champion
Daily Double Placement
Found that DD s tend to be found in the lower levels, where they
cash reward is higher
DDs basically never appear in the top row
The two second-round DDs never appear in the same column
Row location appears to be set independently of the column
location and of the rows of other DDs in the game
Round 2 column-pair statistics are mostly consistent with
independent placement
The simulator assigns DD location in Round 1, and the first DD
location in Round 2 according to respective row-column
Daily Double
accuracy/betting model
Accuracy in Average Contestant Model: 64%
Champion Model: 75%
Grand Champion: 80.5%
The lead tend toward by choosing the more
conservative cars to maintain their lead
Final Jeopardy!
Accuracy/Betting Model
Average Round 2 DD bets of
human contestants in A)
First place, B) Second place,
and C) Third place
Bets as a function of clues
played in round
Bets of First Place Player (A)
Bets of Second Place Player (B)
-High density line corresponding to the
well-known strategy of betting to cover
in the case that B’s score doubles to 2B
-2 high density lines:
one where B bets everything and one
where B bets just enough to overtake
-Notice that there is apparent
randomization apart from
deterministic wagering principles
Betting Strategies
Wagering models A, B, and C
(B > 3/4A, B > 2C) : bet “bankroll” nearly everything, 26% probability
(Just below B-2c): “keepout C”, 27% probability
(Slightly above A-B): “overtake A”, 15% probability
(Just below 3B-2A): “two-thirds limit”, 8% probability
(Random bets): 24% probability
These betting models track actual human win rates
Comparison of actual human win rates with model
win rates by historic replacement in 2,092 nonlocked
(no clear advantage by one player) Final Jeopardy!
Regular Question Model
-Researches found the correlation between players attempt to buzz in and
players having a correct answer
-mean buzz attempt rate: b
-buzz correlation: Pb
-mean precision: P
-right/wrong correlation: Pp
b = 0.61, Pp = 0.2, P = 0.87, Pp = 0.2
The right/wrong correlation is due to
one player giving the wrong answer
which significantly helps the rebound
player deduce the correct answer
The knowledge correlation is 0.3, with
the tip-off of -0.1 producing a net
positive correlation 0f 0.2
Regular Question Model
The right/wrong correlation is due to one player giving the
wrong answer which significantly helps the rebound player
deduce the correct answer
The knowledge correlation is 0.3, with the tip-off of -0.1
producing a net positive correlation 0f 0.2
Champion model:
Substantial increase in attempt rate, (b= 0.61 to b= 0.8)
Slight increase in precision (p = 0.87 to p = 0.89)
Grand Champion model:
b= 0.855 p= 0.915
Square Selection Model
Greatest human tendency is to select squares in topto-bottom order within a given category, or to stay in
the same category
Weaker tendency to select categories moving left-toright across the board
90% probability if the contestant chooses within the
same category
Champion and Grand Champion square selection is
seeking based on DD placement
Multigame Wagering
Researchers must take into account the different
wagering strategies for Games 1 and 2
Found that competitors Jennings and Rutter would be
very aggressive in their Daily Double and Final Jeopardy
wagers in Game 1
Betting in Game 2 would most likely follow this:
If B can “keep out” C, B bets a “bankroll” with 35%
probability, small random amount satisfying 2/3 and keep out
limits with 45% probability, bet to satisfy all limits with 22%
Optimizing Watson
Testing Watson’s performance with two simulated
human opponents showed probability of buzzability
(likelihood to win the buzz against humans of all
ability levels)
Watson’s buzzability against average contestants
80%, against Champions 73%, and Grand Champs 70%
Computation speed vital factor because wagering,
square selection and buzz-in decisions occur in just
Daily Double Wagering
DD betting is based on estimating Watson’s
likelihood of answering the DD question correctly
and how a given bet will impact Watson’s overall
chances of winning
Based on “in-category DD confidence” model:
estimates DD accuracy given the number of seen
questions and right/wrong
Reinforcement learning and the GSE, game-state
Daily Double Wagering
Combination of GSE with in-category confidence:
E(bet) : the “equity”, expected winning chances of a bet
according to
E(bet) = PDD x V(Sw + bet,…) = (1 – PDD) x V(Sw - bet,…)
PDD: in-category confidence
Sw: Watson’s current score
V(): game state evaluation after score increases or decreases
by bet and DD has been removed from board
Using this equation, you can obtain the optimal risk-neutral
bet by selecting the bet with the highest equity
Equity estimates getting the DD right
-Bet equity curves at five difference
in-category confidence levels from
45% ($5) to 85% ($11,000)
-Black dots: optimal risk-neutral bet
increases with confidence
-Risk mitigation: lowered Watson’s
equity by 0.2%, but reduced
downside risk by more than 10% if
DD was wrong
Multigame DD
Able to estimate the expected probabilities of
Watson ending Game 2 in first, second, or third place
based on any combination of Game 1 final scores
One issue was how to assign relative utilities to
finishing in first, second and third places
Ultimately wagering on full credit for first, half for
second and zero for third, this kept equal emphasis
on finishing in first and avoiding finishing in third
Computation of Best Response Strategy
-estimate Watson’s confidence given category
-give Watson’s confidence human accuracy and
correlation parameters as stated earlier
-use the order of 10,000 Monte Carlo samples of
bets to evaluate the bet with the highest equity
Final Jeopardy!
Best Response algorithms for Game 1 and Game based on
Monte Carlo samples of human betting models
Unable to evaluate the Watson’s confidence in Game 1
because there is a second game to play. Instead use
interpolation over look up tables
Game 2: Logic betting rules guarantee a win as A if Watson
answers correctly, as B rules would finish ahead of A if it
did not decrease Watson’s chances of finishing ahead of C
This comes from assigning half-credit for second place.
Wagering as C, the Best Response was unable to derive
human-interpretable rules
Square Selection
Factors: finding the DDs, retaining control of board if
DD is not found, learning the essence of the category
Against Champion and Grand Champion models
where DD seeking is very aggressive Watson’s win
rate is mostly first attributed to finding DDs and
second to retaining control
Learning the essence of the category has some affect
only after all DDs have been found
Square Selection
Unrevealed DDs a square i* is selected that maximizes
Pdd(i) + a* Prc(i), where Pdd(i) is the probability
that square i contains a DD, Prc(i) is the probability that
Watson will retain control if i does not contain a DD
After all DDs are found the algorithm switches to
selecting the lowest dollar value in a category to learn
about the category
Square Selection
Simulation win rates vs Grand Champions using various square selection strategies
3 Factors: Watson’s win rate against simulated Grand Champions using the strategy of
selecting columns with highest estimated accuracy in top-to-bottom order
There is a 6.6% improvement by switching to Bayesian DD (if DDs are available)
0.3% improvement by previous strategy with no remaining DDs
0.1% improvement by including a * Prc(i), a = 0.1
Confidence Threshold
Watson attempts to buzz if confidence in its toprated answer exceeds a threshold value
Threshold value is usually set to maximize expected
earnings, however throughout the game the
threshold value may vary
Four states in which Watson may buzz: the initial
state, the first rebound where human #1 answered
incorrectly, the first rebound where human #2
answered incorrectly, and the second rebound where
both humans answered incorrectly
Confidence Threshold
-Recursion relation between the value of a current game state with K questions
remaining before FJ and values of the possible successor states with K-1 questions
P(c): probability density of Watson’s confidence
P(Dj): probability that next square selected will be in row j with dollar value
Dj= 400 *
P(/B,c): probability of various unit score change combinations 
S’: various possible successor states after Dj square hasabeen played
Confidence Threshold
Need to use Approximate Dynamic Programming (DP)
techniques because it is faster
Approximate DP used to evaluate Vk in terms of Vk-1
where Vk-1 values are based on plain Monte Carlo trials
Because of the slowness of the calculation they were
unable to estimate for values of K greater than 5
However approximate DP gave threshold estimates within
5% of exact value
Buzz-in algorithm: “desperation buzz” in which Watson
must buzz in and answer correctly to avoid a lock out, this
is a risk free chance to try to buzz and win the game

similar documents