Report

Computing Reversed Lempel-Ziv Factorization Online Shiho Sugimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan Everything is String. HABATAKITAI Laboratory Outline • Reversed LZ factorization without self-references (RLZ) • Online RLZ algorithm by Kolpakov and Kucherov • New online RLZ algorithm using O(n log σ) bits of space • Reversed LZ factorization with self-references (RLZS) • New online RLZS algorithm using O(n log n) bits of space • New online RLZS algorithm using O(n log σ) bits of space n : the length of input string σ : the alphabet size Everything is String. HABATAKITAI Laboratory Background • LZ factorization was proposed in 1977 [Ziv & Lempel, 1977]. – data compression etc. • Reversed LZ factorization (RLZ in short) was proposed in 2009 [Kolpakov & Kucherov, 2009]. – finding gapped palindromes etc. Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise s1 s2 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise s1 s2 s3 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise s1 s2 s3 s4 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise s1 s2 s3 s4 s5 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise s1 s2 s3 s4 s5 s6 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise s1 s2 s3 s4 s5 s6 s7 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization without self-references [Ziv & Lempel, 1977] LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that • w = s1 s2…sm • si is the longest non-empty prefix of w[|s1…si−1|+1..n] that is also a substring of w[1.. | s1…si−1|] if such exists • si = w[|s1…si−1|+1] otherwise s1 s2 s3 s4 s5 s6 s7 s8 s 9 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization without self-references (RLZ) [Kolpakov & Kucherov, 2009] RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that • w = f1 f2…fm • fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists reversed • fi = w[|f1...fi−1|+1] otherwise Everything is String. HABATAKITAI Laboratory Reversed LZ factorization without self-references (RLZ) [Kolpakov & Kucherov, 2009] RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that • w = f1 f2…fm • fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists reversed • fi = w[|f1...fi−1|+1] otherwise f1 f2 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization without self-references (RLZ) [Kolpakov & Kucherov, 2009] RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that • w = f1 f2…fm • fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists reversed • fi = w[|f1...fi−1|+1] otherwise f1 f2 f3 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization without self-references (RLZ) [Kolpakov & Kucherov, 2009] RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that • w = f1 f2…fm • fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists reversed • fi = w[|f1...fi−1|+1] otherwise f1 f2 f3 f4 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization without self-references (RLZ) [Kolpakov & Kucherov, 2009] RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that • w = f1 f2…fm • fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists reversed • fi = w[|f1...fi−1|+1] otherwise f1 f2 f3 f4 f5 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization without self-references (RLZ) [Kolpakov & Kucherov, 2009] RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that • w = f1 f2…fm • fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists reversed • fi = w[|f1...fi−1|+1] otherwise f1 f2 f3 f4 f5 f6 f7 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory KK algorithm [Kolpakov & Kucherov, 2009] • Computes RLZ in an online manner • Works in O(n log n) bits of space and O(n log σ) time (on a word RAM model). – Constructs suffix tree for reversed prefixes online. – Computes RLZ factors from suffix tree. – Blumer’s version of Weiner’s algorithm achieves above complexity [Blumer et al, 1985] [Weiner, 1973]. Everything is String. HABATAKITAI Laboratory KK algorithm f1 [Kolpakov & Kucherov, 2009] Ex）w = a b b a a a a b b b a c Stree(ε) Everything is String. HABATAKITAI Laboratory KK algorithm f1 f2 [Kolpakov & Kucherov, 2009] Ex）w = a b b a a a a b b b a c Stree(aR) a Everything is String. HABATAKITAI Laboratory KK algorithm f1 f2 [Kolpakov & Kucherov, 2009] Ex）w = a b b a a a a b b b a c Stree((ab)R) a Everything is String. b a HABATAKITAI Laboratory KK algorithm f1 f2 f3 [Kolpakov & Kucherov, 2009] Ex）w = a b b a a a a b b b a c Stree((ab)R) a Everything is String. b a HABATAKITAI Laboratory KK algorithm f1 f2 f3 f4 [Kolpakov & Kucherov, 2009] Ex）w = a b b a a a a b b b a c Stree((abba)R) a Everything is String. b b a b a b a HABATAKITAI Laboratory KK algorithm f1 f2 f3 f4 f5 [Kolpakov & Kucherov, 2009] Ex）w = a b b a a a a b b b a c Stree((aabba)R) a a Everything is String. a b b b b a b a b a HABATAKITAI Laboratory KK algorithm f1 f2 f3 f4 f5 [Kolpakov & Kucherov, 2009] Ex）w = a b b a a a a b b b a c Stree((aabba)R) This suffix tree requires O(n log n) bits of space a a a b b b b a b a b a We propose a new online RLZ algorithm which uses only O(n log σ) bits of space. (σ≦n is the alphabet size) Everything is String. HABATAKITAI Laboratory For O(n log σ) bits of space • We utilize the idea of Starikovskaya’s algorithm. – It computes LZ factorization online in O(n log σ) bits of space and O(n log2n) time [Starikovskaya, 2012]. • We divide input string into blocks of length r = O(logσn). – Each block is replaced by a meta-character. Everything is String. HABATAKITAI Laboratory For O(n log σ) bits of space • We utilize the idea of Starikovskaya’s algorithm. – It computes LZ factorization online in O(n log σ) bits of space and O(n log2n) time [Starikovskaya, 2012]. • We divide input string into blocks of length r = O(logσn). – Each block is replaced by a meta-character. Ex）w = a b b a a a a b b b a c ……… r=3 B Everything is String. A B C ……… HABATAKITAI Laboratory For O(n log σ) bits of space • We utilize the idea of Starikovskaya’s algorithm. – It computes LZ factorization online in O(n log σ) bits of space and O(n log2n) time [Starikovskaya, 2012]. • We divide input string into blocks of length r = O(logσn). – Each block is replaced by a meta-character. Ex）w = a b b a a a a b b b a c ……… r=3 B Everything is String. A B C ……… HABATAKITAI Laboratory Our online RLZ algorithm • For fi of length shorter than r, we use suffix trie of reversed subwords of length 2r. – can find fi in o(n) bits of space and O(|fi| log σ) time. • For fi of length at least r, we use suffix tree of reversed blocks (meta-characters). – can find fi in O(n log σ) bits of space and O(|fi| log2n) time. Everything is String. HABATAKITAI Laboratory Our online RLZ algorithm • For fi of length shorter than r, we use suffix trie of reversed subwords of length 2r. – can find fi in o(n) bits of space and O(|fi| log σ) time. • For fi of length at least r, we use suffix tree of reversed blocks (meta-characters). – can find fi in O(n log σ) bits of space and O(|fi| log2n) time. Theorem We can compute RLZ without self-references online in O(n log σ) bits of space and O(n log2n) time. Everything is String. HABATAKITAI Laboratory Outline • Reversed LZ factorization without self-references (RLZ) • Online RLZ algorithm by Kolpakov and Kucherov • New online RLZ algorithm using O(n log σ) bits of space • Reversed LZ factorization with self-references (RLZS) • New online RLZS algorithm using O(n log n) bits of space • New online RLZS algorithm using O(n log σ) bits of space n : the length of input string σ : the alphabet size Everything is String. HABATAKITAI Laboratory LZ factorization with self-references [Ziv & Lempel, 1977] LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that • w = t1 t2…tm • ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists self-reference • ti = w[|t1…ti−1|+1] otherwise. Everything is String. HABATAKITAI Laboratory LZ factorization with self-references [Ziv & Lempel, 1977] LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that • w = t1 t2…tm • ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists self-reference • ti = w[|t1…ti−1|+1] otherwise. t1 t2 t3 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization with self-references [Ziv & Lempel, 1977] LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that • w = t1 t2…tm • ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists self-reference • ti = w[|t1…ti−1|+1] otherwise. t1 t2 t3 t4 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory LZ factorization with self-references [Ziv & Lempel, 1977] LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that • w = t1 t2…tm • ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists self-reference • ti = w[|t1…ti−1|+1] otherwise. t1 t2 t3 t4 t5 t6 t7 t8 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization with self-references RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that • w = g1 g2…gm • gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists self-reference • gi = w[|g1…gi−1|+1] otherwise. Everything is String. HABATAKITAI Laboratory Reversed LZ factorization with self-references RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that • w = g1 g2…gm • gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists self-reference • gi = w[|g1…gi−1|+1] otherwise. g1 g2 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization with self-references RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that • w = g1 g2…gm • gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists self-reference • gi = w[|g1…gi−1|+1] otherwise. g1 g2 g3 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory Reversed LZ factorization with self-references RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that • w = g1 g2…gm • gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists self-reference • gi = w[|g1…gi−1|+1] otherwise. g1 g2 g3 g4 g5 Ex）w = a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory online computation of RLZS Ex）w = a b b a a a a b b b a c w[1..1] = a Everything is String. HABATAKITAI Laboratory online computation of RLZS Ex）w = a b b a a a a b b b a c w[1..1] = a w[1..2] = a b Everything is String. HABATAKITAI Laboratory online computation of RLZS Ex）w = a b b a a a a b b b a c w[1..1] = a w[1..2] = a b w[1..3] = a b b Everything is String. HABATAKITAI Laboratory online computation of RLZS Ex）w = a w[1..1] = a w[1..2] = a w[1..3] = a w[1..4] = a Everything is String. b b a a a a b b b a c b b b b b a HABATAKITAI Laboratory online computation of RLZS Ex）w = a w[1..1] = a w[1..2] = a w[1..3] = a w[1..4] = a Everything is String. b b a a a a b b b a c b b b b b a HABATAKITAI Laboratory online computation of RLZS Ex）w = a w[1..1] = a w[1..2] = a w[1..3] = a w[1..4] = a w[1..5] = a w[1..6] = a w[1..7] = a w[1..8] = a w[1..9] = a w[1..10] = a w[1..11] = a w[1..12] = a Everything is String. b b a a a a b b b a c b b b b b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b b b b b b b b b b a b b a c HABATAKITAI Laboratory Reversed LZ factorization with self-references Every self-referencing factor is a suffix of a palindrome. g1 g2 g3 g4 g5 Ex）w = a b b a a a a b b b a c palindrome Everything is String. HABATAKITAI Laboratory Reversed LZ factorization with self-references Every self-referencing factor is a suffix of a palindrome. g1 g2 g3 g4 g5 Ex）w = a b b a a a a b b b a c palindrome Everything is String. HABATAKITAI Laboratory online RLZS in O(nlogn) bits of space We can compute each RLZS factor gi by • using KK algorithm, and – In a total of O(n log n) bits of space and O(n log σ) time. • computing the longest palindrome which ends at each position, online – In a total of O(n log n) bits of space and O(n) time, by modifying Manachar’s algorithm [Manacher, 1975]. Theorem We can compute RLZS online in O(n log n) bits of space and O(n log σ) time. Everything is String. HABATAKITAI Laboratory Outline • Reversed LZ factorization without self-references (RLZ) • Online RLZ algorithm by Kolpakov and Kucherov • New online RLZ algorithm using O(n log σ) bits of space • Reversed LZ factorization with self-references (RLZS) • New online RLZS algorithm using O(n log n) bits of space • New online RLZS algorithm using O(n log σ) bits of space n : the length of input string σ : the alphabet size Everything is String. HABATAKITAI Laboratory Suffix palindromes • All suffix palindromes of a string of length n can be presented by O(log n) arithmetic progressions [Apostolico,1995]. Everything is String. HABATAKITAI Laboratory Suffix palindromes • All suffix palindromes of a string of length n can be presented by O(log n) arithmetic progressions [Apostolico,1995]. Ex) w = a b a b a c a b a b a d a b a b a c a b a b a Everything is String. HABATAKITAI Laboratory Suffix palindromes • All suffix palindromes of a string of length n can be presented by O(log n) arithmetic progressions [Apostolico,1995]. Ex) w = a b a b a c a b a b a d a b a b a c a b a b a Everything is String. HABATAKITAI Laboratory online computation of suffix palindromes • What happens to the suffix palindromes when a new character is appended? Ex) w = a b a b a b a b wa = a b a b a b a b a wc = a b a b a b a b c Everything is String. HABATAKITAI Laboratory online computation of suffix palindromes • What happens to the suffix palindromes when a new character is appended? w Everything is String. a x HABATAKITAI Laboratory online computation of suffix palindromes • What happens to the suffix palindromes when a new character is appended? w a x if x = a w Everything is String. a a HABATAKITAI Laboratory online computation of suffix palindromes • What happens to the suffix palindromes when a new character is appended? w b b b b x if x = b w Everything is String. b b b b b HABATAKITAI Laboratory Computing RLZS in O(n log σ) bits of space We can compute each RLZS factor gi by • using our RLZS algorithm, and – In a total of O(n log σ) bits of space and O(n log2n) time. • computing the longest palindrome which ends at each position, online – In a total of O(log2n) bits of space and O(n log n) time. Theorem We can compute RLZS online in O(n log σ) bits of space and O(n log2n) time. Everything is String. HABATAKITAI Laboratory The problems of RLZS • RLZS was too difficult for us to factorize Proof There is a mistake in proceedings of the PSC. Everything is String. HABATAKITAI Laboratory The problems of RLZS • RLZS was too difficult for us to factorize Proof There is a mistake in proceedings of the PSC. p114 a b b a a a a b b b a c a b b a a a a b b b a c Everything is String. HABATAKITAI Laboratory The problems of RLZS • RLZS was too difficult for us to factorize Proof There is a mistake in proceedings of the PSC. Everything is String. HABATAKITAI Laboratory The problems of RLZS • RLZS was too difficult for us to factorize Proof There is a mistake in proceedings of the PSC. • No idea for using RLZS Everything is String. HABATAKITAI Laboratory Conclusion RLZ online algorithms space self -references O(n log n) bits O(n log σ) bits without O(n log σ) time O(n log2n) time [Kolpakov & Kucherov, 2009] with O(n log σ) time O(n log2n) time n : the length of input string σ : the alphabet size Everything is String. HABATAKITAI Laboratory Conclusion RLZ online algorithms space self -references O(n log n) bits O(n log σ) bits without O(n log σ) time O(n log2n) time [Kolpakov & Kucherov, 2009] with O(n log σ) time n : the length of input string σ : the alphabet size Everything is String. O(n log2n) time This work HABATAKITAI Laboratory