### Maximum liklihood

```The Estimation Problem
How would we select parameters in the limiting case
where we had ALL the data?
The probability (a) of transitioning from state k to state l
Counts of k to l transitions
k
→
l
|
|
all the data
a k→ l
=
| Sl’ k→ l’ |
Counts of k to l transitions summed over all possible states l
Intuitively, the actual frequencies of all the transitions
would best describe the parameters we seek
The Estimation Problem
What about when we only have a sample? Consider:
X = “S--+++”
Before we collected the data, the probability of this
sequence is a function of , our set of unknown parameters:
P(X|) = P(“S--+++”|)
P(X|) = as→-a-→-a-→+a+→+a+→+
However, our data is fixed. We have already collected it.
The parameters are also fixed, but unknown.
We can therefore imagine values for the parameters, and
treat the probability of the observed data as a function of 
The Estimation Problem
The Likelihood Function
When we treat the probability of the observed data as a function
of the parameters, we call this the likelihood function
L(|X) = P(“S--+++”|)
L(|X) = as→-a-→-a-→+a+→+a+→+
A few things to notice:
• The probability of any particular sample we get is generally going to
be pretty low regardless of the true values of 
• The likelihood here still tells us some valuable information! We
know, for instance that a-→+ is not zero, etc.
Caution! The likelihood function does not define a probability distribution or
density and it does not encompass an area of 1
Maximum Likelihood Estimation
Maximum Likelihood Estimation seeks the solution
that “best” explains the observed dataset
ML

Or
= argmax P(X|)

= argmax log P(X|)

Translation: “select as our maximum likelihood parameters
those parameters that resulted in a maximization of the
probability of the observation given those parameters”.
i.e. we seek to maximize P(X|) over all possible
This is sometimes called the maximum likelihood criterion
Maximum Likelihood Estimation
Log likelihood is often very handy as we often would
otherwise need to deal with a long product of terms…
P
S
k
ML =  log
P(xi|)
i=1
k
=
log P(xi|)
i=1
This often comes about because there are multiple
outcomes that need to be considered
Maximum Likelihood Estimation
Sometimes proving some parameter choice maximizes
the likelihood function is the “tricky bit”
In general case, this is often done by finding the zeros of the
derivative of the likelihood function, or by some other trick such
as forcing the function into some particular form and relying on
an inequality to prove it must be maximum
Let’s skip the gory details, and try to motivate this intuitively…
The Estimation Problem
Maybe it’s enough to convince ourselves that…
sample data
ak → l
| k→ l|
=
| Sl’ k→ l’ |
will approach…..
P(k→l|All the data)
as the amount of sample data increases to the
limit where we finally have all the data….
Let’s see how this plays out with a simple simulation…
Maximum Likelihood Estimation
Typical plot of single sample of 10 nucleotides
is pronedistribution
to overfitting
the
data
in the case
the
TheMLE
underlying
this
was
sampled
fromwhere
was uniform
(pA = 0.25, pCsample
= 0.25,is
pGsmall
= 0.25, pT = 0.25)
Maximum Likelihood Estimation
Typical plot of 10 samples of 10 nucleotides
The underlying distribution this was sampled from was uniform
(pA = 0.25, pC = 0.25, pG = 0.25, pT = 0.25)
Maximum Likelihood Estimation
Typical plot of 100 samples of 10 nucleotides
The underlying distribution this was sampled from was uniform
(pA = 0.25, pC = 0.25, pG = 0.25, pT = 0.25)
Maximum Likelihood Estimation
Typical plot of 1000 samples of 10 nucleotides
The underlying distribution this was sampled from was uniform
(pA = 0.25, pC = 0.25, pG = 0.25, pT = 0.25)
```