Motif discovery - University of Washington

Report
Motif discovery
Prof. William Stafford Noble
Department of Genome Sciences
Department of Computer Science and Engineering
University of Washington
[email protected]
Outline
• One-minute response
• Revision
• Motifs
– Gibbs sampling
– Expectation maximization
• Python
One-minute responses
• Are we able to read from the smooth curve that
we learned to create today what is a “good
enough p-value”?
– No, the curve gives you the p-value, but deciding what
is “good enough” requires that you know the costs
associated with false positives and false negatives.
• The concepts were clear (for today).
• Keep doing more explanation on the board.
• Can you please explain the last part of converting
scores to p-values?
• Continue with revision every day.
Other questions and comments
• We would prefer to know how we did on the
first assessment before you give us the second
assessment.
Converting scores to p-values
0 1 2 3 4 …
A
10 67 59 44
C
60 39 49 29
G
0 71 50 54
400
T 100 43 13 64
• Say that your motif has N rows. Create a matrix that
has N rows and 100N columns.
• The entry in row i, column j is the number of
different sequences of length i that can have a score
of j.
Converting scores to p-values
A
10 67 59 44
C
60 39 49 29
G
0 71 50 54
0 1 2 3 4 …
10
60
100
1
1
1
1
400
T 100 43 13 64
• For each value in the first column of your motif, put a
1 in the corresponding entry in the first row of the
matrix.
• There are only 4 possible sequences of length 1.
Converting scores to p-values
A
10 67 59 44
C
60 39 49 29
G
0 71 50 54
0 1 2 3 4 …
10
1
1
60 77
1
100
400
1
1
T 100 43 13 64
• For each value x in the second column of your
motif, consider each value y in the zth column
of the first row of the matrix.
• Add y to the x+zth column of the matrix.
Converting scores to p-values
A
10 67 59 44
C
60 39 49 29
G
0 71 50 54
0 1 2 3 4 …
10
1
1
60 77
1
100
400
1
1
T 100 43 13 64
• For each value x in the second column of your motif, consider
each value y in the zth column of the first row of the matrix.
• Add y to the x+zth column of the matrix.
• What values will go in row 2?
– 10+67, 10+39, 10+71, 10+43, 60+67, …, 100+43
• These 16 values correspond to all 16 strings of length 2.
Converting scores to p-values
A
10 67 59 44
C
60 39 49 29
G
0 71 50 54
0 1 2 3 4 …
10
1
1
60 77
1
100
400
1
1
T 100 43 13 64
• In the end, the bottom row contains the
scores for all possible sequences of length N.
• Use these scores to compute a p-value.
Sample problem
• List the scores of all 4 length-1 DNA sequences relative to this
motif:
A 10 15 100 80
C 100 30
7 21
G
0 30 22 14
T 10 35
9 51
– A=10, C=100, G=0, T=10
• List the scores of all 16 length-2 DNA sequences relative to the
same motif.
– AA=15, AC=40, AG=40, AT=45, CA=115, CC=130, CG=130, CT=135,
GA=15, GC=30, GG=30, GT=35, TA=25, TC=40, TG=40, TT=45
• Draw the dynamic programming matrix for this motif and indicate
where the 20 scores you computed would go.
• 15x2, 25, 30x2, 40x4
Sample problem
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135
1
2
1
2
1 2 1 4 1
1
2
1
• Draw the dynamic programming matrix for this motif and
indicate where the 20 scores you computed would go.
– AA=15, GA=15, TA=25, GC=30, GG=30, GT=35, AC=40, AG=40,
TC=40, TG=40, AT=45, TT=45, CA=115, CC=130, CG=130, CT=135
• How many distinct scores do you observe for sequences of
length 2?
– 9
• How many calculations will you need to perform to
compute scores for sequences of length 3?
– 9 x 4 = 36
Revision
• How many distinct amino acid sequences of
length 10 exist?
– 2010=1.024 x 1012
• Say that you use dynamic programming to
compute the distribution of scores for a motif of
width 10. You then observe a sequence with a
score of 28. Describe how you would compute
the p-value of 28 from the output of the dynamic
programming.
– Compute the sum S of the counts for scores ≥28.
– The p-value is S/2010.
1 ¥
p(x = 28) = 20 å si
10 i=28
Motif discovery problem
• Given sequences
seq. 1
seq. 2
seq. 3
• Find motif
IGRGGFGEVY at position 515
LGEGCFGQVV at position 430
VGSGGFGQVY at position 682
seq. 1
seq. 2
seq. 3
Motif discovery problem
• Given:
– a sequence or family of sequences.
• Find:
– the number of motifs
– the width of each motif
– the locations of motif occurrences
Why is this hard?
• Input sequences are long
(thousands or millions of
residues).
• Motif may be subtle
– Instances are short.
– Instances are only slightly
similar.
?
?
HAHU
HAOR
HADK
HBHU
HBOR
HBDK
MYHU
MYOR
IGLOB
GPUGNI
GPYL
GGZLB
xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx
V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA
M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA
V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA
VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD
VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG
VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT
G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED
G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED
M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD
A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE
GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ
M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE
HAHU
HAOR
HADK
HBHU
HBOR
HBDK
MYHU
MYOR
IGLOB
GPUGNI
GPYL
GGZLB
xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx
QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL
QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL
QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL
AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL
AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL
AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL
EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII
EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII
T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR
NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG
NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE
Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK
HAHU
HAOR
HADK
HBHU
HBOR
HBDK
MYHU
MYOR
IGLOB
GPUGNI
GPYL
GGZLB
xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x
VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R
VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R
VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R
VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H
IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H
IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H
QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G
HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G
TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L
ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E
AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA
EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E
Globin motifs
A Concrete Example:
Transcription Factor Binding Sites
• Transcription factor
proteins bind to DNA
and regulate gene
expression.
• The promoter is a
region near the start of
the gene where
transcription factors
bind.
A Concrete Example:
Transcription Factor Binding Sites
We are given a set of promoters from
co-regulated genes.
TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
…HIS7
ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
…ARO4
CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
…ILV6
TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
…THR4
ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
…ARO1
ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
…HOM2
GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…PRO3
A Concrete Example:
Transcription Factor Binding Sites
An unknown transcription factor binds to positions
unknown to us, on either DNA strand.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
A Concrete Example:
Transcription Factor Binding Sites
The DNA binding motif of the
transcription factor can be
described by a position-specific
scoring matrix (PSSM).
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
A Concrete Example:
Transcription Factor Binding Sites
The sequence motif discovery problem is to discover the
sites (or the motif) given just the sequences.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
Gibbs sampling
Alternating approach
1. Guess an initial weight matrix
2. Use weight matrix to predict instances in the
input sequences
3. Use instances to predict a weight matrix
4. Repeat 2 & 3 until satisfied.
Initialization
• Randomly guess an instance si from each of t
input sequences {S1, ..., St}.
sequence 1
sequence 2
sequence 3
sequence 4
sequence 5
ACAGTGT
TTAGACC
GTGACCA
ACCCAGG
CAGGTTT
Gibbs sampler
• Initially: randomly guess an instance si from each of t
input sequences {S1, ..., St}.
• Steps 2 & 3 (search):
– Throw away an instance si: remaining (t - 1) instances
define weight matrix.
– Weight matrix defines instance probability at each position
of input string Si
– Pick new si according to probability distribution
• Return highest-scoring motif seen
Sampler step illustration:
ACAGTGT
TAGGCGT
ACACCGT
???????
CAGGTTT
ACAGTGT
TAGGCGT
ACACCGT
ACGCCGT
CAGGTTT
A
C
G
T
.45 .45 .45 .05 .05 .05 .05
.25 .45 .05 .25 .45 .05 .05
.05 .05 .45 .65 .05 .65 .05
.25 .05 .05 .05 .45 .25 .85
sequence 4
ACGCCGT:20%
11%
ACGGCGT:52%
The MEME Algorithm
MEME uses expectation maximization (EM)
to discover sequence motifs.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
Slides courtesy of Tim Bailey
The MEME Algorithm
The positions (and strands) of the sites are
the missing information variables, Z =
{Zi,j}.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
The MEME Algorithm
If we knew the Z = {Zi,j}, we could
estimate the motif PSSM using
maximum likelihood.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
The MEME Algorithm
Alignment
i
1 AAAAGAGTCA
2 AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
N AAATGAGTCA
12 …
w
j
Count Matrix
A
C
G
T
Frequency Matrix
A
C
G
T
12 …
w
12 …
w
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
The MEME Algorithm
MEME starts Expectation
Maximization from an initial
estimate of θ based on a
subsequence in the data.
θ(1)
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
The MEME Algorithm
EM maximizes the expected
joint likelihood of the
sequences (X) and missing
information (Z) under a
probabilistic model.
θ(1)
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
The MEME Algorithm
The current estimates of the
parameters of the data model
are φ(t), and include θ(t).
θ(1)
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
The MEME Algorithm
EM iteratively refines φ(t).
The MEME Algorithm
E-step: Use φ(t) to estimate the probabilities of each
position in the data being a site.
The MEME Algorithm
M-step: Use Z(t) to find the value of φ that maximizes the
expectation of the joint conditional probability.
Alternating approach
1. Guess an initial weight matrix
2. Use weight matrix to predict instances in the
input sequences
3. Use instances to predict a weight matrix
4. Repeat 2 & 3 until satisfied.
Expectation-maximization
foreach subsequence of width W
convert subsequence to a matrix
do {
re-estimate motif occurrences from matrix
EM
re-estimate matrix model from motif occurrences
} until (matrix model stops changing)
end
select matrix with highest score
Comparison
• Both EM and Gibbs sampling involve iterating over
two steps
• Convergence:
– EM converges when the PSSM stops changing.
– Gibbs sampling runs until you ask it to stop.
• Solution:
– EM may not find the motif with the highest score.
– Gibbs sampling will provably find the motif with the
highest score, if you let it run long enough.
Sample problem #1
• You are implementing a Gibbs sampler for motif discovery. Your
code has just scanned a sequence S with a motif PSSM to produce a
list of scores x1, …, xn. You will write the code to randomly select
the new motif occurrence on the basis of the score list.
• Given:
– a list of scores x1, …, xn
• Return:
– Print to standard output an integer i in the range 1, …, n and the
corresponding score xi, where i is selected with probability
proportional to xi.
> ./randomly-select-4.py scores.txt
Read 10000 scores from scores.txt.
Sum of scores = 20042.7
Random value = 3298.62
1655 1.41589
Sample problem #2
• Given:
– a list of DNA sequences,
– the motif width,
– the integer index of the (currently estimated) motif
occurrence within each sequence,
– the index j of the sequence for which to update the
motif occurrence, and
– the list of scores for sequence j.
• Return:
– a new alignment with the motif occurrence for
sequence j randomly replaced.
>./do-gibbs-step.py crp0.txt 10 crp0-offsets.txt 1 crp0-scores.txt > foo
Read 18 sequences from crp0.txt.
Read 95 values from crp0-scores.txt.
Read 18 values from crp0-indices.txt.
Sum of scores = 189.341
Random value = 69.6017
Selected 36 with score 2.81402.
> cat foo
TTTGTGGCAT
CAGAAAAGTC
ACTGTGAGCA
TATGCAAAGG
GGTGTTAAAT
ATTGTGATGT
AATTTATTCC
AACGTGATCA
AATGTGAGTT
TCTGTAACAG
TCTGTGAACT
ATTGTGACAC
TGCCTGACGG
ATTGTGATTC
GTTGTGATGT
GGTGTGAAAT
AATGAGACGT
TTTGTGAGTG

similar documents