Flowers, Bees, and Algorithms: Adventures in Cophylogenetics

Report
Ran Libeskind-Hadas
Department of Computer Science
Harvey Mudd College
Joint work with:
Mike Charleston (Univ. of Sydney)
Chris Conow (USC)
Ben Cousins (Clemson)
Daniel Fielder (HMC)
John Peebles (HMC)
Tselil Schramm (HMC)
Anak Yodpinyanee (HMC)
Integrated CS/Bio Course
Send e-mail to: [email protected]
Overview
• A 75-minute “research lecture” to first-year
students in our CS/Bio intro course
• Show first-year students that what they’ve
learned is relevant to current research
• Showcase research done with senior students
• What have they have done so far?
– Biology: Genes, alignment, phylogenetic trees,
RNA folding
– CS: Programming, recursion, “memoization”
Specifically…
• Pairwise global alignment and RNA folding
– Why you should care
– Designed and implemented recursive solutions
– Why are they slow?
– How do we make them faster?
– “Memoization” idea
– Wow, that’s fast! (but no actual analysis yet)
– Designed and implemented “memoized” versions
– Used their implementations to investigate questions
Around 10 lines of Python code!
Specifically…
• Phylogenetic trees
– Why you should care
– Implemented simple algorithm (e.g. UPGMA)
– Used their implementation to answer questions…
– Existence and relative merits of other algorithms
(mention maximum likelihood… but it’s slow!)
A 75-minute lecture in 30 minutes
(or less)
Actual 75-minute lecture starts here! (Also a chapter in new B4B)
Cophylogenetics
“ I can understand how a flower and a bee might slowly
become, either simultaneously or one after the other,
modified and adapted in the most perfect manner to
each other, by the continued preservation of
individuals presenting mutual and slightly favourable
deviations of structure.”
Charles Darwin, The Origin of Species
Obligate Mutualism of
Figs and Fig Wasps
ovipostor
From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004
The Cophylogeny Problem
From Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of
parasites and their hosts. Nature 1988, 332:258-259
Indigobirds and Finches
• High level of host specificity (e.g. mouth markings)
www.indigobirds.com
The Question…
Given a host tree, parasite tree, and tip mapping, what is the most plausible
mapping between the trees and is it suggestive of coevolution?
This seems to
be a “hard” problem!
Measuring the “Hardness” of
Computational Problems
There are three kinds of problems…
1. Easy
2. Hard
3. Impossible!
“Easy” Problems
Sorting a list of n numbers: [42, 3, 17, 26, … , 100]
Multiplying two n x n matrices:
n
(
3
1
2
9
5
6
4
3
2 7
8 9
6 10
2 12
n
)(
1 5 5
5 12 8
7 6 1
9 23 5
n
4
6
5
8
) (
)
=
n
n
Global Alignment is “easy”!
• Reminder of 2n running time of alignment
• Informally motivate n2 running time of
memoized version
“Hard” Problems
Snowplows of Northern Minnesota
Burrsburg
Tundratown
Freezeapolis
Frostbite City
Shiversville
“Hard” Problems
Snowplows of Northern Minnesota
Burrsburg
Frostbite City
Tundratown
Freezeapolis
Shiversville
Brute-force? Greed?
2
n
versus
n
2
The Ran-O-Matic performs 109 operations/sec
n = 10
n = 30
n2
100
< 1 sec
900
< 1 sec
2n
1024
< 1 sec
109
1 sec
n = 50
n = 70
2500
< 1 sec
4900
< 1 sec
2
n
versus
n
2
The Ran-O-Matic performs 109 operations/sec
n = 10
n = 30
n = 50
n = 70
4900
< 1 sec
n2
100
< 1 sec
900
< 1 sec
2500
< 1 sec
2n
1024
< 1 sec
109
1 sec
1015
13 days
2
n
versus
n
2
The Ran-O-Matic performs 109 operations/sec
n = 10
n = 30
n = 50
n = 70
n2
100
< 1 sec
900
< 1 sec
2500
< 1 sec
4900
< 1 sec
2n
1024
< 1 sec
109
1 sec
1015
13 days
1021
37 trillion
years
2
n
versus
n
2
The Ran-O-Matic performs 109 operations/sec
n = 10
n = 30
n = 50
n = 70
n2
100
< 1 sec
900
< 1 sec
2500
< 1 sec
4900
< 1 sec
2n
1024
< 1 sec
109
1 sec
1015
13 days
1021
37 trillion
years
Computers double in speed every
2 years. Let’s just wait 10 years!
37 trillion years ->
2
n
versus
n
2
The Ran-O-Matic performs 109 operations/sec
n = 10
n = 30
n = 50
n = 70
n2
100
< 1 sec
900
< 1 sec
2500
< 1 sec
4900
< 1 sec
2n
1024
< 1 sec
109
1 sec
1015
13 days
1021
37 trillion
years
Computers double in speed every
2 years. Let’s just wait 10 years!
37 trillion years ->
37 billion years!
Snowplows and Travelling
Salesperson Revisited!
Tens of thousands of other known problems go in this
cloud!!
Snowplow Problem
Travelling
Salesperson
Problem
Protein Folding
Phylogenetic trees
by maximum
likelihood
Multiple sequence
alignment
NP-complete
problems
“I can’t find an efficient algorithm. I guess I’m
too dumb.”
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm because no such
algorithm is possible!”
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm, but neither
can all these famous people.”
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
$1 million
Vinay Deolalikar
Coping with NP-completeness…
• Brute force
• Ad hoc Heuristics
• Meta heuristics
• Approximation algorithms
Obligate Mutualism of
Figs and Fig Wasps
ovipostor
From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004
The Cophylogeny Problem…
Host tree
Parasite tree
e
d
a
b
c
The Cophylogeny Problem
Host tree
Parasite tree
e
d
a
Tips associations
b
c
Possible Solutions
Input
e
e
d
d
a
b
c
a
b c
Event Cost Model
cospeciation
e
e
d
cospeciation
cospeciation
d
a
b
c
a
b c
Event Cost Model
duplication
e
duplication
e
d
d
a
b
c
a
b c
Event Cost Model
host-switch
e
e
d
host-switch
d
a
b
c
a
b c
Event Cost Model
loss
e
loss
loss
e
d
loss
loss
d
a
b
c
a
b c
Event Cost Model
Cost = cospeciation
+ host-switch + loss
Cost = duplication
+ cospeciation + 3 * loss
e
e
d
cospeciation
loss
loss
duplication
cospeciation
loss
loss
host-switch
d
a
b
c
a
b c
Some typical costs
Cost = 8
Cost = 5
e
duplication
+2
d
cospeciation
+0
loss
+2
loss
+2
a
e
loss
+2
loss
+2
b
c
a
cospeciation
+0
host-switch
+3
d
b c
This problem is hard!
• How hard? NP-complete!
• The host-switches are the culprits
(Joint work with Charleston, Ovadia, Conow, Fielder)
h
e
f
g
Existing Methods
TreeMap
Tarzan/CoRe-PA
Technique
Brute force
Ignore timing incompatibilities
Solution
Optimal
Can be BETTER than optimal!
Running Time
Exponential
Polynomial,
Very fast
Tree Builder
No
Yes
Solution Viewer
Yes
Yes
A Metaheuristic Approach
• Fix a timing
• We can solve the problem optimally for a
given timing using
Dynamic Programming
Compute Cost[a,su,2]
parasite
a
r
t=0
b
s
t=1
c
a
t
t=2
t=3
t=4
u
v
w
x
y
Dynamic Programming
Compute Cost[a,su,2]
r
t=0
b
s
t=1
c
a
t
t=2
b
Cost[b,tw,3]
t=3
t=4
parasite
a
v
w
x
u
y
c
Cost[c,y,4]
Dynamic Programming
Compute Cost[a,su,2]
r
t=0
b
s
t=1
c
a host-switch
losst
t=2
b
Cost[b,tw,3]
t=3
t=4
parasite
a
v
w
x
u
loss
y
c
Cost[c,y,4]
Dynamic Programming
Candidate for Cost[a,su,2]:
Cost[b, tw, 3] + Cost[c, uy, 4] + 2 * loss + host-switch
r
t=0
s
t=1
a host-switch
losst
t=2
b
Cost[b,tw,3]
t=3
t=4
v
w
x
u
loss
y
c
Cost[c,y,4]
Dynamic Programming
Running Time
• O(n3) cells to fill in
• O(n2) positions for first child
• O(n2) positions for second child
• O(n) to count #losses from each child, but this is precomputable
O(n3 x (n2 x n2)) = O(n7) total
Dynamic Programming
Running Time
• O(n3) cells to fill in
• O(n2) positions for first child
• O(n2) positions for second child
• O(n) to count #losses from each child, but this is precomputable
O(n3 x (n2 x n2)) = O(n7) total
Can be improved to O(n3)
Genetic Algorithm
Existing Software
TreeMap
Tarzan/CoRe-PA
Jane 2
Technique
Brute force
DP, Ignore timing
incompatibilities
Genetic algorithm
DP
Solution
Optimal
Can be BETTER than
optimal!
Sometimes suboptimal
Running Time
Exponential
Polynomial,
Very fast
Polynomial, a lot faster!
Can control running time
Tree Builder
No
Yes
No, but Jane 2 can read
CoRe-PA’s trees
Solution Viewer
Yes
Yes
Yes
Also Interactive
The Fig/Wasp Challenge
Results
The Fig/Wasp Dataset…
Randomly
Generated Problem
Instances
Original
Problem Instance
Paper recently completed…
30 Coauthors
18 Institutes
10 Countries
Results
Results
Demo
Future Work…
•
•
•
•
One parasite, many hosts (“failure to diverge”)
Reticulate phylogenies
Multifurcations
Suggestions?
Questions/Comments

similar documents