Report

A Simplified View of DCJIndel Distance Phillip Compeau University of California-San Diego Department of Mathematics 1 Abstract • Braga et al., 2010: Solved problem of DCJ-indel sorting in linear time. • Goals: 1. “Hardwire” DCJ sorting into DCJ-indel sorting. 2. Characterize solution space for DCJ-indel sorting. • DCJ solution space known (Braga and Stoye, 2010). 2 Section 1: Preliminaries 1. 2. 3. 4. 5. Preliminaries Encoding Indels as DCJs DCJ-Indel Sorting The Solution Space of DCJ-Indel Sorting Conclusion 3 The Discrete Genome • Genome (Π): formed of two matchings • genes g(Π): each numbered gene has a head and a tail. • adjacencies (a(Π)): a blue matching on V(g(Π)) Π Γ 4 The Discrete Genome • Chromosome: component of Π (alternating path or cycle) • Linear or circular depending on path or cycle of Π • Telomere: path endpoint of Π; has null adjacency {v, Ø} Π Γ 5 union of a(Π ) and a(Γ ), where adjacencies of Γ are colored red (Fig Observe t hat B (Π , Γ ) is also a disjoint union of pat hs and cycles, which al between red and blue edges. T he length of a component of B(Π , Γ ) is it s n The Double-Cut-and-Join Operation of edges; we consider an isolat ed vert ex in B(Π , Γ ) t o be a pat h of lengt A double cut and join operat ion (DCJ) on Π ([9]) uses one or two adja • Double-cut-and-join operation (DCJ; Yancopoulos et al., of Π via one of t he following four operat ions t o produce a new genome Π 2005): “cuts” genome in two places and rejoins adjacencies. 1. 2. 3. 4. { v, w} , { x, y} { v, w} , { x, ∅} { v, ∅} , { w, ∅} { v, w} −→ { v, x} , { w, y} −→ { v, x} , { w, ∅} −→ { v, w} −→ { v, ∅} , { w, ∅} T he DCJ incorporat es an array of genome rearrangement s, as shown in For t he part icular case t hat Π and Γ have t he same genes (i.e., g(Π ) = G), t he DCJ distance between Π and Γ , writ t en dD CJ (Π , Γ ), is t he mi number of DCJs required t o t ransform Π int o Γ . A closed formula fo dist ance was derived in [10] and t ranslat ed int o breakpoint graph not a [13]: • DCJ Distance (dDCJ(Π, Γ)): minimum # of DCJs required to transform Π into Γ (having the same genes). peven (Π , Γ ) dD CJ (Π , Γ ) = N − c(Π , Γ ) − 2 6 The DCJ Incorporates Many Operations (a) (b) v w Translocation v w v x y Translocation x y x v x w Reversal v w w v Translocation x w x v y (Afﬁx) Reversal Reversal x Translocation v y w v v w Fission v w x y Fusion x y x w (Afﬁx) Reversal x w Fission v Fusion x w (c) v w Excision v w x y Integration x y v w Reversal v y x Reversal y x w Fusion (#3) v v w w Fission (#4) Circularization (#3) v w v w Linearization (#4) 7 The Breakpoint Graph • B(Π, Γ) is formed from the adjacencies of Π and Γ. • B(Π, Γ) also comprises (alternating) red-blue paths and cycles. 8 J distance between Π and Γ , writ t en dD CJ (Π , Γ ), is th DCJs required to t ransform Π into Γ . A closed formu DCJ Distance Formula as derived in [10] and translated into breakpoint graph • Bergeron et al., 2006: If Π and Γ share the same genes, then the DCJ distance is given by the following formula: peven (Π , Γ ) dD CJ (Π , Γ ) = N − c(Π , Γ ) − 2 • N = # of genes • c(Π, Γ) = # of cycles in B(Π, Γ) • peven(Π, Γ) = # of even paths in B(Π, Γ) 9 Indels and the DCJ-Indel Distance • Indel: The insertion or deletion of a chromosome or chromosomal interval (consecutive genes). • Assumption: we can’t remove a gene common to Π and Γ ab cd Ø a bc Ø a b b a c d Ø • DCJ-Indel Distance (dindDCJ(Π, Γ)): Minimum # of DCJs and indels required to transform Π into Γ. • Braga et al., 2010: Solve DCJ-indel sorting in linear time. • Lots of cases…can we simplify it? 10 Section 2: Encoding Indels as DCJs 1. 2. 3. 4. 5. Preliminaries Encoding Indels as DCJs DCJ-Indel Sorting The Solution Space of DCJ-Indel Sorting Conclusion 11 Deletion DCJ Creating Circular Chromosome • Ma et al., 2009: View deletion as formation and removal of a circular chromosome. ab cd Ø DCJ a bc Ø DCJ a b Ø b a c d DCJ DCJ a b b c ad a b Ø a d b c c • Idea: Indel = DCJ creating circular chromosome • Wait…what about the deletion of circular chromosomes? 12 Apparent Exceptions • Apparent Exception #1: Two deleted circular chromosomes are created from a single DCJ. b a c d DCJ a d b c 3 Operations 13 Apparent Exceptions • Apparent Exception #1: Two deleted circular chromosomes are created from a single DCJ. b a c d c d 1 Operation DCJ a d b a b c 3 Operations 14 Apparent Exceptions • Apparent Exception #2: A deleted circular chromosome is never involved in a DCJ • Circular singleton of Π: A circular chromosome of Π that shares no genes with Γ. • Question: Can we delete all circular singletons first? 15 Apparent Exceptions • Apparent Exception #2: A deleted circular chromosome is never involved in a DCJ • Circular singleton of Π: A circular chromosome of Π that shares no genes with Γ. • Question: Can we delete all circular singletons first? YES! 16 Handling Circular Singletons • Proposition: When transforming Π into Γ via a minimum collection of DCJs and indels, no gene belonging to a circular singleton of Π can ever appear in the same chromosome as a gene of Γ. • Corollary 1: If Π* is formed from Π by removing a circular singleton from Π, then dindDCJ(Π*, Γ) = dindDCJ(Π, Γ) – 1. • Let sing(Π, Γ) = # of circular singletons of Π and Γ. • Corollary 2: If Π0 and Γ0 are formed by removing all circular singletons from Π and Γ, then dindDCJ(Π, Γ) = dindDCJ(Π0 , Γ0) + sing(Π, Γ) 17 a(Π ) is composed of a(Π ) t ogether wit h a perfect mat ching on V (Π We call t he adjacencies of a(Π ) − a(Π ) new. Note that t he chromos n occur as a ﬁnal step in the transformat ion the of Πcomponents into Γ , we embed as chromosomes of Π and Distance that of may Π − int Π rf A Novel View of DCJ-Indel e following framework. because new adjacencies form a perfect mat ching on V (Π ) − V (Π now without ambiguity these circular chromosomes Deﬁne completion of Π as acall genome Πthathaving g(Π forin • a WLOG we may henceforth assume sing(Π, Γ) =) 0.= ofG Πandthe A completion of )atpair , Γ ) is mat simply a pair , Γ )) −forV Π ) is composed of a(Π ogetof hergenomes wit h a(Π perfect ching on (Π V (Π Γ are completions and Γ , respect e call t heand adjacencies of a(Π ) of −Π a(Π ) new. Not eively. that Our thecorrespondence chromosomes completion of is and aDCJ-indel genome such that: s of Π − Π form following equation dist ance: bed •as A chromosomes of ΠΠfor that tΠ’ he component • g(Π’) = g(Π) Uform g(Γ)aindperfect mat ching on V (Π ) − V (Π ); we cause new adjacencies dD CJ (Π , Γ ) = min { dD CJ (Π , Γ )} w wit hout ambiguity these circular of Π the indels ( Π ,Γ ) • a(Π’) = a(Π)call U perfect matchingchromosomes on V(Π’) – V(Π) completion of at he pairminimum of genomes (Π ,over Γ ) is pairof(Π whi where is taken allsimply completaions (Π ,, ΓΓ).) Aforcompl ∗ d Γ are completions ofif Π and Γ ,the respect ively.inOur correspondence yielfo Γ ) is optimal it at tains minimum (3). Applying t he closed • New chromosomes of Π’ are circular: the indels of Π’ owing equat DCJ-indel distimmediat ance: ely produces the following result DCJion distfor ance in (1) to (3) T heor em 3. The DCJ-indel distance is given by the following equat • Theorem: dind D CJ (Π , Γ ) = min { dD CJ (Π , Γ )} ( Π ,Γ ) peven (Π , Γ ) ind , Γ ) =allN complet − max ions c(Πof, (Π Γ ),+Γ ). A completion D CJ (Πover is dtaken ( Π ,Γ ) 2 ere the minimum ) is optimal if it attains the minimum in (3). Applying the closed form f where the to maximum is taken over all completions of (Π , Γresult ). 18. CJ distance in (1) (3) immediately produces the following a(Π ) is composed of a(Π ) t ogether wit h a perfect mat ching on V (Π We call t he adjacencies of a(Π ) − a(Π ) new. Note that t he chromos n occur as a ﬁnal step in the transformat ion the of Πcomponents into Γ , we embed as chromosomes of Π and Distance that of may Π − int Π rf A Novel View of DCJ-Indel e following framework. because new adjacencies form a perfect mat ching on V (Π ) − V (Π now withoutcompletion ambiguity call thesethe circular chromosomes Deﬁne completion of Π as aachieves genome Π optimum having g(Π ) = ofG Πandthe forin • a An optimal below. A completion of )atpair , Γ ) is mat simply a pair , Γ )) −forV Π ) is composed of a(Π ogetof hergenomes wit h a(Π perfect ching on (Π V (Π Γ are completions and Γ , respect e call t heand adjacencies of a(Π ) of −Π a(Π ) new. Not eively. that Our thecorrespondence chromosomes completion of is and aDCJ-indel genome such that: s of Π − Π form following equation dist ance: bed •as A chromosomes of ΠΠfor that tΠ’ he component • g(Π’) = g(Π) Uform g(Γ)aindperfect mat ching on V (Π ) − V (Π ); we cause new adjacencies dD CJ (Π , Γ ) = min { dD CJ (Π , Γ )} w wit hout ambiguity these circular of Π the indels ( Π ,Γ ) • a(Π’) = a(Π)call U perfect matchingchromosomes on V(Π’) – V(Π) completion of at he pairminimum of genomes (Π ,over Γ ) is pairof(Π whi where is taken allsimply completaions (Π ,, ΓΓ).) Aforcompl ∗ d Γ are completions ofif Π and Γ ,the respect ively.inOur correspondence yielfo Γ ) is optimal it at tains minimum (3). Applying t he closed • New chromosomes of Π’ are circular: the indels of Π’ owing equat DCJ-indel distimmediat ance: ely produces the following result DCJion distfor ance in (1) to (3) T heor em 3. The DCJ-indel distance is given by the following equat • Theorem: dind D CJ (Π , Γ ) = min { dD CJ (Π , Γ )} ( Π ,Γ ) peven (Π , Γ ) ind , Γ ) =allN complet − max ions c(Πof, (Π Γ ),+Γ ). A completion D CJ (Πover is dtaken ( Π ,Γ ) 2 ere the minimum ) is optimal if it attains the minimum in (3). Applying the closed form f where the to maximum is taken over all completions of (Π , Γresult ). 19. CJ distance in (1) (3) immediately produces the following Section 3: DCJ-Indel Sorting 1. 2. 3. 4. 5. Preliminaries Encoding Indels as DCJs DCJ-Indel Sorting The Solution Space of DCJ-Indel Sorting Conclusion 20 Open Vertices • π-open vertex: vertex not found in Π (must be matched in Π’) • path endpoint in B(Π, Γ) must be π-open/γ-open or telomere (or both) • Define {π, π}-paths, {π, γ}-paths, π-paths in B(Π, Γ) • Idea: Construct B(Π*, Γ*) from B(Π, Γ) by matching vertices. 21 Necessary Conditions for B(Π*, Γ*) • Lemma 1: If (Π*, Γ*) is an optimal completion of (Π, Γ), then every {π, π}-path ({γ, γ}-path) of length 2k – 1 in B(Π, Γ) embeds into a cycle of length 2k in B(Π*, Γ*). 22 Necessary Conditions for B(Π*, Γ*) • Lemma 1: If (Π*, Γ*) is an optimal completion of (Π, Γ), then every {π, π}-path ({γ, γ}-path) of length 2k – 1 in B(Π, Γ) embeds into a cycle of length 2k in B(Π*, Γ*). • Picture: π π π π Vs. π π π π dDCJ(Π’’, Γ’) < dDCJ(Π’, Γ’) Cycle B(Π’, Γ’) B(Π’’, Γ’) 23 Necessary Conditions for B(Π*, Γ*) • Lemma 1: If (Π*, Γ*) is an optimal completion of (Π, Γ), then every {π, π}-path ({γ, γ}-path) of length 2k – 1 in B(Π, Γ) embeds into a cycle of length 2k in B(Π*, Γ*). • Remaining components of B(Π*, Γ*): • bracelet: cycle linking {π, γ}-paths • chain: path linking π-paths/γ-paths via intermediate {π, γ}paths π π π π π γ γ 3-Chain 2-Bracelet π γ γ π π 2-Chain 24 Necessary Conditions for B(Π*, Γ*) • Lemma 2: B(Π*, Γ*) can contain only 2-bracelets, 2-chains, and 3-chains. • Picture: π π π π P1 Vs. P2 γ γ B(Π’, Γ’) P1 π π π π Cycle γ dDCJ(Π’’, Γ’) < dDCJ(Π’, Γ’) P2 γ B(Π’’, Γ’) 25 Necessary Conditions for B(Π*, Γ*) • Lemma 3: B(Π*, Γ*) cannot have one 2-chain joining two odd π-paths and another 2-chain joining two even π-paths. The same holds for γ-paths. • Picture: Ø P1 odd π P2 odd P3 even Ø π Ø Ø Even Path π π π Vs. π P4 even Ø Ø B(Π’, Γ’) π π Even Path dDCJ(Π’’, Γ’) < dDCJ(Π’, Γ’) Ø Ø B(Π’’, Γ’) 26 Sorting Algorithm 1. Remove all circular singletons of Π and Γ. 2. Lemma 1 Close every {π, π}-path ({γ, γ}-path) into a cycle by adding a single new adjacency to Π* (Γ*). 3. Form a maximum set of 2-bracelets (only chains remaining). 4. Form a maximum set of even 2-chains by linking pairs of πpaths (γ-paths) having opposite parity (Lemma 3). 5. If pπ, γ is odd, then link the remaining {π, γ}-path with any remaining π-path and γ-path. 6. Arbitrarily link pairs of remaining π-paths, all of which have the same parity. Do the same for any γ-paths remaining. 27 A Simpliﬁed V iew of DCJ-Indel Dist ance 373 Distance heorDCJ-Indel em 8. Algorithm 9, given below, describes an O(N ) time algorithm for CJ-indel sorting. For pairs { Π , Γ } having sing(Π , Γ ) = 0, the DCJ-indel dis• Theorem: The preceding algorithm solves DCJ-indel sorting ance is given by the following equation: in linear time, and it implies a DCJ-indel distance formula: ind dD CJ (Π , Γ ) = N− π,π c+ p γ ,γ +p pπ,γ 2 + 1 0 + peven + min { pπodd , pπeven } 2 + min { pγodd , pγeven } + δ)] (9) π,γ if pπ, γ is odd and either: where δ = 1 only here δ = 1 only if p is odd and either pπodd > pπeven , pγodd > pγeven or π π1. pγπ γπ > p , pγodd > pγδeven < p , p < p ; otherwise, = ;0.or odd even even odd even odd π π γ γ 2. p < p , p < p odd even odd roof. We aim to construct an optimaleven completion (Π ∗ , Γ ∗ ) having Otherwise, δ = 0. c(Π ∗ , Γ ∗ ) = c + pπ,π + pγ ,γ + pπ,γ 2 peven (Π ∗ , Γ ∗ ) = p0even + min { pπodd , pπeven } + min { pγodd , pγeven } + δ irst, we count t he cycles of B(Π ∗ , Γ ∗ ). (10) (11) 28 By Lemma 5, every { π, π} -path or Section 4: The Solution Space of DCJ-Indel Sorting 1. 2. 3. 4. 5. Preliminaries Encoding Indels as DCJs DCJ-Indel Sorting The Solution Space of DCJ-Indel Sorting Conclusion 29 ng framework. because new adjacencies form a perfect mat ching on V (Π ) − V (Π ); we m without of ambiguity these circular chromosomes of Π a now completion Π as acall genome Π having g(Π ) = ofG Πandthe forindels which A completion of )atpair , Γ ) is mat simply a pair , Γ )) −forV which omposed of a(Π ogetof hergenomes wit h a(Π perfect ching on (Π V (Π (Π ). Π Encompassing all Possible Cases Γ are completions and Γ , respect eand adjacencies of a(Π ) of −Π a(Π ) new. Not eively. that Our thecorrespondence chromosomes yields of Π t following equation DCJ-indel distcomponent ance: chromosomes of Πforand that t he s of Π − Π form cycles • The solution space is known for DCJ-sorting (Braga and ew adjacencies form aindperfect mat ching on V (Π ) − V (Π ); we may Stoye, 2010). dD CJ (Π , Γ ) = min { dD CJ (Π , Γ )} ( out ambiguity call these circular chromosomes of Π the indels of Π . ( Π ,Γ ) ion of at he pairminimum of genomes (Π ,over Γ ) is pairof(Π which Π (Π where is taken allsimply completaions (Π ,, ΓΓ).) Aforcompletion • Thus, we only need to find all optimal completions, and yields the e completions Γ ,the respect ively. Γ ∗ ) is optimalofif Π it atand tains minimum inOur (3). correspondence Applying t he closed formthe for t specific will fall out the wash. equat for DCJ-indel dist ance: DCJion dist ance inoperations (1) to (3) immediat ely in produces the following result. T heor em 3. distance is given dindThe (ΠDCJ-indel , Γ ) = min { dD CJ (Π , Γby)}the following equation:(3) D CJ ( Π ,Γ ) peven (Π , Γ ) ind ∗, ( d (Π , Γ ) = N − max c(Π , Γ ) + D CJ minimum is taken over all complet ions of (Π , Γ ). A completion (Π ( Π ,Γ ) 2 mal if it attains the minimum in (3). Applying the closed form for t he where the to maximum is taken over all completions of (Π , Γresult ). nce in (1) (3) immediately produces the following . 3. The DCJ-indel distance is given by the following equation: 3.3 Const r uct ing an Opt im al Com plet ion peven (Π , Γ ) 30 Handling Circular Singletons • The circular singletons of Π must be removed in sing(Π) steps. We have two options: 1. Delete all the circular singletons of Π. 2. Perform k “fusion” DCJs followed by sing(Π) – k chromosome deletions. • This poses a straightforward (yet tedious) counting problem. 31 Adding Necessary Conditions on B(Π*, Γ*) • Proposition 1: Every π-path embedding into a 3-chain of an optimal completion must have the same parity. • Proposition 2: If pπ, y is even, then B(Π*, Γ*) must contain a maximum collection of even 2-chains. • Proofs are slightly more involved… 32 Finishing the Job • Four cases, depending on path statistics. 1. pπ, γ is odd: a) pπodd > pπeven , pγodd > pγeven (or vice-versa); δ = 1 b) pπodd > pπeven , pγodd < pγeven (or vice-versa); δ = 0 2. pπ, γ is even: a) pπodd > pπeven , pγodd > pγeven (or vice-versa); δ = 0 b) pπodd > pπeven , pγodd < pγeven (or vice-versa); δ = 0 • These cases are tedious but straightforward and can be handled similarly. 33 Section 5: Conclusion 1. 2. 3. 4. 5. Preliminaries Encoding Indels as DCJs DCJ-Indel Sorting The Solution Space of DCJ-Indel Sorting Conclusion 34 Future Work • Correspondence with Braga et al., 2010? • Varying the indel cost? • Charge indel cost ≤ DCJ cost, take minimum total cost. • Most of the simplifying sorting lemmas hold, but actually computing the minimum cost appears difficult in this model. • The problem is solved! (under framework of Braga et al., 2010) 35 Questions? 36 Shameless Plug • www.rosalind.info • A novel education website that teaches bioinformatics through programming exercises. • Have “professor” environment for assigning programming exercises to your bioinformatics classes. 37