Report

Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins (Clemson) Daniel Fielder (HMC) John Peebles (HMC) Tselil Schramm (HMC) Anak Yodpinyanee (HMC) Integrated CS/Bio Course Send e-mail to: [email protected] Overview • A 75-minute “research lecture” to first-year students in our CS/Bio intro course • Show first-year students that what they’ve learned is relevant to current research • Showcase research done with senior students • What have they have done so far? – Biology: Genes, alignment, phylogenetic trees, RNA folding – CS: Programming, recursion, “memoization” Specifically… • Pairwise global alignment and RNA folding – Why you should care – Designed and implemented recursive solutions – Why are they slow? – How do we make them faster? – “Memoization” idea – Wow, that’s fast! (but no actual analysis yet) – Designed and implemented “memoized” versions – Used their implementations to investigate questions Around 10 lines of Python code! Specifically… • Phylogenetic trees – Why you should care – Implemented simple algorithm (e.g. UPGMA) – Used their implementation to answer questions… – Existence and relative merits of other algorithms (mention maximum likelihood… but it’s slow!) A 75-minute lecture in 30 minutes (or less) Actual 75-minute lecture starts here! (Also a chapter in new B4B) Cophylogenetics “ I can understand how a flower and a bee might slowly become, either simultaneously or one after the other, modified and adapted in the most perfect manner to each other, by the continued preservation of individuals presenting mutual and slightly favourable deviations of structure.” Charles Darwin, The Origin of Species Obligate Mutualism of Figs and Fig Wasps ovipostor From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004 The Cophylogeny Problem From Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988, 332:258-259 Indigobirds and Finches • High level of host specificity (e.g. mouth markings) www.indigobirds.com The Question… Given a host tree, parasite tree, and tip mapping, what is the most plausible mapping between the trees and is it suggestive of coevolution? This seems to be a “hard” problem! Measuring the “Hardness” of Computational Problems There are three kinds of problems… 1. Easy 2. Hard 3. Impossible! “Easy” Problems Sorting a list of n numbers: [42, 3, 17, 26, … , 100] Multiplying two n x n matrices: n ( 3 1 2 9 5 6 4 3 2 7 8 9 6 10 2 12 n )( 1 5 5 5 12 8 7 6 1 9 23 5 n 4 6 5 8 ) ( ) = n n Global Alignment is “easy”! • Reminder of 2n running time of alignment • Informally motivate n2 running time of memoized version “Hard” Problems Snowplows of Northern Minnesota Burrsburg Tundratown Freezeapolis Frostbite City Shiversville “Hard” Problems Snowplows of Northern Minnesota Burrsburg Frostbite City Tundratown Freezeapolis Shiversville Brute-force? Greed? 2 n versus n 2 The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n2 100 < 1 sec 900 < 1 sec 2n 1024 < 1 sec 109 1 sec n = 50 n = 70 2500 < 1 sec 4900 < 1 sec 2 n versus n 2 The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 4900 < 1 sec n2 100 < 1 sec 900 < 1 sec 2500 < 1 sec 2n 1024 < 1 sec 109 1 sec 1015 13 days 2 n versus n 2 The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 2n 1024 < 1 sec 109 1 sec 1015 13 days 1021 37 trillion years 2 n versus n 2 The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 2n 1024 < 1 sec 109 1 sec 1015 13 days 1021 37 trillion years Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years -> 2 n versus n 2 The Ran-O-Matic performs 109 operations/sec n = 10 n = 30 n = 50 n = 70 n2 100 < 1 sec 900 < 1 sec 2500 < 1 sec 4900 < 1 sec 2n 1024 < 1 sec 109 1 sec 1015 13 days 1021 37 trillion years Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years -> 37 billion years! Snowplows and Travelling Salesperson Revisited! Tens of thousands of other known problems go in this cloud!! Snowplow Problem Travelling Salesperson Problem Protein Folding Phylogenetic trees by maximum likelihood Multiple sequence alignment NP-complete problems “I can’t find an efficient algorithm. I guess I’m too dumb.” Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson “I can’t find an efficient algorithm because no such algorithm is possible!” Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson “I can’t find an efficient algorithm, but neither can all these famous people.” Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson $1 million Vinay Deolalikar Coping with NP-completeness… • Brute force • Ad hoc Heuristics • Meta heuristics • Approximation algorithms Obligate Mutualism of Figs and Fig Wasps ovipostor From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004 The Cophylogeny Problem… Host tree Parasite tree e d a b c The Cophylogeny Problem Host tree Parasite tree e d a Tips associations b c Possible Solutions Input e e d d a b c a b c Event Cost Model cospeciation e e d cospeciation cospeciation d a b c a b c Event Cost Model duplication e duplication e d d a b c a b c Event Cost Model host-switch e e d host-switch d a b c a b c Event Cost Model loss e loss loss e d loss loss d a b c a b c Event Cost Model Cost = cospeciation + host-switch + loss Cost = duplication + cospeciation + 3 * loss e e d cospeciation loss loss duplication cospeciation loss loss host-switch d a b c a b c Some typical costs Cost = 8 Cost = 5 e duplication +2 d cospeciation +0 loss +2 loss +2 a e loss +2 loss +2 b c a cospeciation +0 host-switch +3 d b c This problem is hard! • How hard? NP-complete! • The host-switches are the culprits (Joint work with Charleston, Ovadia, Conow, Fielder) h e f g Existing Methods TreeMap Tarzan/CoRe-PA Technique Brute force Ignore timing incompatibilities Solution Optimal Can be BETTER than optimal! Running Time Exponential Polynomial, Very fast Tree Builder No Yes Solution Viewer Yes Yes A Metaheuristic Approach • Fix a timing • We can solve the problem optimally for a given timing using Dynamic Programming Compute Cost[a,su,2] parasite a r t=0 b s t=1 c a t t=2 t=3 t=4 u v w x y Dynamic Programming Compute Cost[a,su,2] r t=0 b s t=1 c a t t=2 b Cost[b,tw,3] t=3 t=4 parasite a v w x u y c Cost[c,y,4] Dynamic Programming Compute Cost[a,su,2] r t=0 b s t=1 c a host-switch losst t=2 b Cost[b,tw,3] t=3 t=4 parasite a v w x u loss y c Cost[c,y,4] Dynamic Programming Candidate for Cost[a,su,2]: Cost[b, tw, 3] + Cost[c, uy, 4] + 2 * loss + host-switch r t=0 s t=1 a host-switch losst t=2 b Cost[b,tw,3] t=3 t=4 v w x u loss y c Cost[c,y,4] Dynamic Programming Running Time • O(n3) cells to fill in • O(n2) positions for first child • O(n2) positions for second child • O(n) to count #losses from each child, but this is precomputable O(n3 x (n2 x n2)) = O(n7) total Dynamic Programming Running Time • O(n3) cells to fill in • O(n2) positions for first child • O(n2) positions for second child • O(n) to count #losses from each child, but this is precomputable O(n3 x (n2 x n2)) = O(n7) total Can be improved to O(n3) Genetic Algorithm Existing Software TreeMap Tarzan/CoRe-PA Jane 2 Technique Brute force DP, Ignore timing incompatibilities Genetic algorithm DP Solution Optimal Can be BETTER than optimal! Sometimes suboptimal Running Time Exponential Polynomial, Very fast Polynomial, a lot faster! Can control running time Tree Builder No Yes No, but Jane 2 can read CoRe-PA’s trees Solution Viewer Yes Yes Yes Also Interactive The Fig/Wasp Challenge Results The Fig/Wasp Dataset… Randomly Generated Problem Instances Original Problem Instance Paper recently completed… 30 Coauthors 18 Institutes 10 Countries Results Results Demo Future Work… • • • • One parasite, many hosts (“failure to diverge”) Reticulate phylogenies Multifurcations Suggestions? Questions/Comments