Applications of Suffix Trees
Applications of Suffix Trees
P and T are both known at the same time Boyer-Moore, or Suffix trees. O(n+m) T is known and kept fixed. P varies. Suffix trees, O(m) in preprocess, O(n+k) in searching P is known and kept fixed. T varies. Boyer-Moore, O(n) in preprocess, O(m) in searching
Aho-Corasick O(m+n+k)
Suffix trees. O(m) in building suffix tree O(ni+ki) in searching for pi O(m+ni+ ki) for all P, i.e. O(m+n+k)
10
11
Build a generalized suffix tree for S1 and S2 If a leave is from S1, then mark all its ancestors with 1. If a leave is from S2, then mark all its ancestors with 2. The path-label of any node that is marked with both 1 and 2 is a common substring of S1 and S2. Find the node that is labeled with 1 and 2, and has the greatest string-depth (number of characters on the path to it).
12
1,2
1,2 1 2 2 2 1,2 2
1
1
13
(2)
(3) (4)
O(m) for building generalized suffix tree O(m) for calculating the string-depth of each node (e.g. Breadth first) O(m) for marking node with 1 or 2 (e.g. Depth first) O(m) finding the longest.
14
Contamination sources: Human, bacteria, DNA from Dinosaur bone: More similar to human DNA than to bird and crockodilian DNA
15
16
1,2
1,2 1 2 2 2 1,2 2
1
1
17
18
Build a generalized suffix tree for the k strings giving each string a unique end marker. Each leaf belong to only one string For a node (v), let c(v) be the number of distinct string identifiers that appear at the subtree below it. V is a vector with V(i) denoting the length of the longest substring that occurs exactly in i strings (and a pointer to the node). From V(i) compute l (i), for i=k; i>1; i if (V(i)<V(i+1)), then l(i)= V(i+1) else l(i)= V(i)
20
4 2 2
V l
21
For each node keep a C vector of k bits, with one bit correspond to one string. ith is set to 1 if a leave that belongs to ith string appear below the node The V vector of a parent is obtained by ORing the vectors of its children. n nodes. O(Kn) in calculating c(v).
22
q p 8
7 5 3
23
a
8
6 1
24
2 8
-1
8 6 4
7
1 5 3
25
-1
8 6 4
7
1 5
If the subtrees under p and q are isomorphic (except leaf lables) and stringdepth(p)> stringdepth(q), then Merge p into q, by adding a direct edge from parent(p) to q Associated the directed edge with d=stringdepth(q)- stringdepth(p)
When search for P in the S (text), let i be the leaf below the path labeled with P, if the directed edge is traversed then P occurs at i+d, otherwise P occurs at i.
26
27
Ukkonent Algorithm
Suffix links Let xa denote an arbitrary string, where x denotes a single character and a denotes a (possible empty) substring. For an internal node v with path-label xa, if there is another node s(v) with path-label a, then a pointer from v to s(v) is called a suffix link, denoted as (v,s(v)).
v
The root has no suffix link from it. If a is empty, then the suffix link points to the root.
s(v)
28
x
a a q
p
b
i x i a b
i+1
29
t3
Either a is a proper suffix of g or g is a proper suffix of a There is a directed path of suffix links from one node to the other.
b
p
b
i a b
i+1
31
a l
b b
32
34
35
36
37
A simple accelerant
L and R are left and right boundaries of the current search interval. Query will be made at M=(L+R)/2 of Pos. l: the length of the longest prefix of Pos(L) that match a prefix of P r: the length of the longest prefix of Pos(R) that match a prefix of P lmr=min{l,r} Compare P and Pos(M) starting from position lmr+1 of the two string. O(nlogm)
38
A super accelerant
Lcp (i,j): length of the longest prefix of Pos(i) and Pos(j) Use Lcp(L,M), Lcp(M,R)
Suppose l>r, If Lcp(L,M) >l, LM, and l, r unchanged If Lcp(L,M) <l, RM, r=Lcp(L,M) If Lcp(L,M)=l, comparison of P and Pos(M) starting at l+1.
O(n+logm)
39
For any i<j, Lcp(i,j) is the smallest value of Lcp(k,k+1), where, k=i to j-1
40