0% found this document useful (0 votes)
48 views26 pages

Tries and Suffix Tries

The document discusses tries and suffix tries. A trie is a tree that represents a collection of strings, with each node representing a common prefix. The document then describes how a suffix trie can be built by adding all suffixes of a text string to a trie, where each path from the root to a leaf represents a suffix. Checking if a string is a substring or suffix of the text can be done by traversing the suffix trie and checking if the string's characters align with a path from the root to a leaf.

Uploaded by

Teo Tokis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views26 pages

Tries and Suffix Tries

The document discusses tries and suffix tries. A trie is a tree that represents a collection of strings, with each node representing a common prefix. The document then describes how a suffix trie can be built by adding all suffixes of a text string to a trie, where each path from the root to a leaf represents a suffix. Checking if a string is a substring or suffix of the text can be done by traversing the suffix trie and checking if the string's characters align with a path from the root to a leaf.

Uploaded by

Teo Tokis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Tries and sux tries

Ben Langmead

You are free to use these slides. If you do, please sign the
guestbook (www.langmead-lab.org/teaching-materials), or email
me ([email protected]) and tell me briefly how youre
using them. For original Keynote files, email me.

Tries
A trie (pronounced try) is a tree representing a collection of strings with
one node per common prefix
Smallest tree such that:
Each edge is labeled with a character c
A node has at most one outgoing edge labeled c, for c
Each key is spelled out along some path starting at the root

Natural way to represent either a set or a map where keys are strings

Tries: example
Represent this map with a trie:
Key

Value

instant

internal

internet

The smallest tree such that:


Each edge is labeled with a character c
A node has at most one outgoing edge
labeled c, for c
Each key is spelled out along some path
starting at the root

n
t

s
t

t
1

t
2

Tries: example
i

Checking for presence of a key P,


where n = | P |, is ???
O(n) time

n
t

s
t

t
1

If total length of all keys is N, trie


has O(N)
??? nodes
What about | | ?

t
2

Depends how we represent outgoing


edges. If we dont assume | | is a
small constant, it shows up in one or
both bounds.

Tries: another example


We can index T with a trie. The trie maps
substrings to osets where they occur
c
g
ac
ag
at
cc
cc
ct
gt
gt
ta
tt

4
8
14
12
2
6
18
0
10
16

a
c
root:

g
t

t
c
t

4
8
14
12, 2
6

t
a
t

18, 0
10
16

Tries: implementation
class TrieMap(object):
""" Trie implementation of a map. Associating keys (strings or other
sequence type) with values. Values can be any type. """

def __init__(self, kvs):
self.root = {}
# For each key (string)/value pair
for (k, v) in kvs: self.add(k, v)

def add(self, k, v):
""" Add a key-value pair """
cur = self.root
for c in k: # for each character in the string
if c not in cur:
cur[c] = {} # if not there, make new edge on character c
cur = cur[c]
cur['value'] = v # at the end of the path, add the value

def query(self, k):
""" Given key, return associated value or None """
cur = self.root
for c in k:
if c not in cur:
return None # key wasn't in the trie
cur = cur[c]
# get value, or None if there's no value associated with this node
return cur.get('value')

Python example:
http://nbviewer.ipython.org/6603619

Tries: alternatives
Tries arent the only tree structure that can encode sets or maps with string
keys. E.g. binary or ternary search trees.

i
b
a

s
h
y e

o
t

n
f

t
r o

t
Ternary search tree for as, at, be, by, he, in, is, it, of, on, or, to
Example from: Bentley, Jon L., and Robert Sedgewick. "Fast algorithms for sorting and searching
strings." Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. Society for
Industrial and Applied Mathematics, 1997

Indexing with suxes


Until now, our indexes have been based on extracting substrings from T
A very dierent approach is to extract suxes from T. This will lead us to
some interesting and practical index data structures:

6
5
3
1
0
4
2

Sux Trie

Sux Tree

$
A$
ANA$
ANANA$
BANANA$
NA$
NANA$

Sux Array

$ BANANA
A $ BANAN
ANA $ BAN
ANANA $ B
BANANA $
NA $ BANA
NANA $ BA

FM Index

Sux trie
Build a trie containing all suxes of a text T
T:

GTTATAGCTGATCGCGGCGTAGCGG$
GTTATAGCTGATCGCGGCGTAGCGG$
TTATAGCTGATCGCGGCGTAGCGG$
TATAGCTGATCGCGGCGTAGCGG$
ATAGCTGATCGCGGCGTAGCGG$
TAGCTGATCGCGGCGTAGCGG$
AGCTGATCGCGGCGTAGCGG$
GCTGATCGCGGCGTAGCGG$
CTGATCGCGGCGTAGCGG$
TGATCGCGGCGTAGCGG$
GATCGCGGCGTAGCGG$
A T C G C G G C G T A G C G G $m(m+1)/2
TCGCGGCGTAGCGG$
C G C G G C G T A G C G G $chars
GCGGCGTAGCGG$
CGGCGTAGCGG$
GGCGTAGCGG$
GCGTAGCGG$
CGTAGCGG$
GTAGCGG$
TAGCGG$
AGCGG$
GCGG$
CGG$
GG$
G$
$

Sux trie
First add special terminal character $ to the end of T
$ is a character that does not appear elsewhere in T, and we define it
to be less than other characters (for DNA: $ < A < C < G < T)
$ enforces a rule were all used to using: e.g. as comes before ash in the
dictionary. $ also guarantees no sux is a prefix of any other sux.
T:

GTTATAGCTGATCGCGGCGTAGCGG$
GTTATAGCTGATCGCGGCGTAGCGG$
TTATAGCTGATCGCGGCGTAGCGG$
TATAGCTGATCGCGGCGTAGCGG$
ATAGCTGATCGCGGCGTAGCGG$
TAGCTGATCGCGGCGTAGCGG$
AGCTGATCGCGGCGTAGCGG$
GCTGATCGCGGCGTAGCGG$
CTGATCGCGGCGTAGCGG$
TGATCGCGGCGTAGCGG$
GATCGCGGCGTAGCGG$
ATCGCGGCGTAGCGG$
TCGCGGCGTAGCGG$
CGCGGCGTAGCGG$
GCGGCGTAGCGG$
CGGCGTAGCGG$
GGCGTAGCGG$

Tries
Smallest tree such that:
Each edge is labeled with a character from
A node has at most one outgoing edge labeled with c, for any c
Each key is spelled out along some path starting at the root

Sux trie
a

T: abaaba

Shortest
(non-empty)
sux

T$: abaaba$
a

Each path from root to leaf represents a


sux; each sux is represented by some
path from root to leaf
Would this still be the case if we hadnt
added $?

Longest sux

Sux trie
a

T: abaaba
Each path from root to leaf represents a
sux; each sux is represented by some
path from root to leaf
Would this still be the case if we hadnt
added $? No

Sux trie
a

We can think of nodes as having labels,


where the label spells out characters on the
path from the root to the node

baa
$

Sux trie
a

How do we check whether a string S is a


substring of T?
Note: Each of Ts substrings is spelled out
along a path from the root. I.e., every
substring is a prefix of some sux of T.

Start at the root and follow the edges


labeled with the characters of S
If we fall o the trie -- i.e. there is no
outgoing edge for next character of S, then
S is not a substring of T
If we exhaust S without falling o, S is a
substring of T

S = baa
Yes, its a substring

Sux trie
a

How do we check whether a string S is a


substring of T?
Note: Each of Ts substrings is spelled out
along a path from the root. I.e., every
substring is a prefix of some sux of T.

Start at the root and follow the edges


labeled with the characters of S
If we fall o the trie -- i.e. there is no
outgoing edge for next character of S, then
S is not a substring of T
If we exhaust S without falling o, S is a
substring of T

$
S = abaaba
Yes, its a substring

Sux trie
a

How do we check whether a string S is a


substring of T?
Note: Each of Ts substrings is spelled out
along a path from the root. I.e., every
substring is a prefix of some sux of T.

Start at the root and follow the edges


labeled with the characters of S
If we fall o the trie -- i.e. there is no
outgoing edge for next character of S, then
S is not a substring of T
If we exhaust S without falling o, S is a
substring of T

x
$

S = baabb
No, not a substring

Sux trie
a

How do we check whether a string S is a


sux of T?
Same procedure as for substring, but
additionally check whether the final node in
the walk has an outgoing edge labeled $

S = baa
Not a sux

Sux trie
a

How do we check whether a string S is a


sux of T?
Same procedure as for substring, but
additionally check whether the final node in
the walk has an outgoing edge labeled $

S = aba
Is a sux
$

Sux trie
a

How do we count the number of times


a string S occurs as a substring of T?
Follow path corresponding to S.
Either we fall o, in which case
answer is 0, or we end up at node n
and the answer = # of leaf nodes in
the subtree rooted at n.
Leaves can be counted with depth-first
traversal.

S = aba
2 occurrences

Sux trie
a

How do we find the longest repeated


substring of T?
Find the deepest node with more
than one child

aba
a

Sux trie: implementation


class SuffixTrie(object):

def __init__(self, t):
""" Make suffix trie from t """
t += '$' # special terminator symbol
self.root = {}
for i in xrange(len(t)): # for each suffix
cur = self.root
for c in t[i:]: # for each character in i'th suffix
if c not in cur:
cur[c] = {} # add outgoing edge if necessary
cur = cur[c]

def followPath(self, s):
""" Follow path given by characters of s. Return node at
end of path, or None if we fall off. """
cur = self.root
for c in s:
if c not in cur:
return None
cur = cur[c]
return cur

def hasSubstring(self, s):
""" Return true iff s appears as a substring of t """
return self.followPath(s) is not None

def hasSuffix(self, s):
""" Return true iff s is a suffix of t """
node = self.followPath(s)
return node is not None and '$' in node

Python example:
http://nbviewer.ipython.org/6603756

Sux trie
How many nodes does the sux trie have?

T = aaaa
a

Is there a class of string where the number


of sux trie nodes grows linearly with m?
a

Yes: e.g. a string of m as in a row (am)


a

1 Root
m nodes with
incoming a edge
m + 1 nodes with
incoming $ edge
2m + 2 nodes

Sux trie
Is there a class of string where the number
of sux trie nodes grows with m2?
Yes: anbn

1 root
n nodes along b chain, right
n nodes along a chain, middle
n chains of n b nodes hanging o eacha chain node
2n + 1 $ leaves (not shown)
n2 + 4n + 2 nodes, where m = 2n

Figure & example


by Carl Kingsford

Sux trie: upper bound on size


Could worst-case # nodes be worse than O(m2)?

Root

Sux trie

Max # nodes from top to bottom


= length of longest sux + 1
=m+1

Deepest leaf
Max # nodes from left to right
= max # distinct substrings of any length
m

O(m2) is worst case

250000

Sux trie: actual growth

150000
100000
50000
0

# suffix trie nodes

Black curve shows how #


nodes increases with prefix
length

200000

Built sux tries for the first


500 prefixes of the lambda
phage virus genome

m^2
actual
m

100

200

300

400

Length prefix over which suffix trie was built

500

You might also like