Tries and Suffix Tries
Tries and Suffix Tries
Ben Langmead
You are free to use these slides. If you do, please sign the
guestbook (www.langmead-lab.org/teaching-materials), or email
me ([email protected]) and tell me briefly how youre
using them. For original Keynote files, email me.
Tries
A trie (pronounced try) is a tree representing a collection of strings with
one node per common prefix
Smallest tree such that:
Each edge is labeled with a character c
A node has at most one outgoing edge labeled c, for c
Each key is spelled out along some path starting at the root
Natural way to represent either a set or a map where keys are strings
Tries: example
Represent this map with a trie:
Key
Value
instant
internal
internet
n
t
s
t
t
1
t
2
Tries: example
i
n
t
s
t
t
1
t
2
4
8
14
12
2
6
18
0
10
16
a
c
root:
g
t
t
c
t
4
8
14
12, 2
6
t
a
t
18, 0
10
16
Tries: implementation
class
TrieMap(object):
"""
Trie
implementation
of
a
map.
Associating
keys
(strings
or
other
sequence
type)
with
values.
Values
can
be
any
type.
"""
def
__init__(self,
kvs):
self.root
=
{}
#
For
each
key
(string)/value
pair
for
(k,
v)
in
kvs:
self.add(k,
v)
def
add(self,
k,
v):
"""
Add
a
key-value
pair
"""
cur
=
self.root
for
c
in
k:
#
for
each
character
in
the
string
if
c
not
in
cur:
cur[c]
=
{}
#
if
not
there,
make
new
edge
on
character
c
cur
=
cur[c]
cur['value']
=
v
#
at
the
end
of
the
path,
add
the
value
def
query(self,
k):
"""
Given
key,
return
associated
value
or
None
"""
cur
=
self.root
for
c
in
k:
if
c
not
in
cur:
return
None
#
key
wasn't
in
the
trie
cur
=
cur[c]
#
get
value,
or
None
if
there's
no
value
associated
with
this
node
return
cur.get('value')
Python example:
http://nbviewer.ipython.org/6603619
Tries: alternatives
Tries arent the only tree structure that can encode sets or maps with string
keys. E.g. binary or ternary search trees.
i
b
a
s
h
y e
o
t
n
f
t
r o
t
Ternary search tree for as, at, be, by, he, in, is, it, of, on, or, to
Example from: Bentley, Jon L., and Robert Sedgewick. "Fast algorithms for sorting and searching
strings." Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. Society for
Industrial and Applied Mathematics, 1997
6
5
3
1
0
4
2
Sux Trie
Sux Tree
$
A$
ANA$
ANANA$
BANANA$
NA$
NANA$
Sux Array
$ BANANA
A $ BANAN
ANA $ BAN
ANANA $ B
BANANA $
NA $ BANA
NANA $ BA
FM Index
Sux trie
Build a trie containing all suxes of a text T
T:
GTTATAGCTGATCGCGGCGTAGCGG$
GTTATAGCTGATCGCGGCGTAGCGG$
TTATAGCTGATCGCGGCGTAGCGG$
TATAGCTGATCGCGGCGTAGCGG$
ATAGCTGATCGCGGCGTAGCGG$
TAGCTGATCGCGGCGTAGCGG$
AGCTGATCGCGGCGTAGCGG$
GCTGATCGCGGCGTAGCGG$
CTGATCGCGGCGTAGCGG$
TGATCGCGGCGTAGCGG$
GATCGCGGCGTAGCGG$
A T C G C G G C G T A G C G G $m(m+1)/2
TCGCGGCGTAGCGG$
C G C G G C G T A G C G G $chars
GCGGCGTAGCGG$
CGGCGTAGCGG$
GGCGTAGCGG$
GCGTAGCGG$
CGTAGCGG$
GTAGCGG$
TAGCGG$
AGCGG$
GCGG$
CGG$
GG$
G$
$
Sux trie
First add special terminal character $ to the end of T
$ is a character that does not appear elsewhere in T, and we define it
to be less than other characters (for DNA: $ < A < C < G < T)
$ enforces a rule were all used to using: e.g. as comes before ash in the
dictionary. $ also guarantees no sux is a prefix of any other sux.
T:
GTTATAGCTGATCGCGGCGTAGCGG$
GTTATAGCTGATCGCGGCGTAGCGG$
TTATAGCTGATCGCGGCGTAGCGG$
TATAGCTGATCGCGGCGTAGCGG$
ATAGCTGATCGCGGCGTAGCGG$
TAGCTGATCGCGGCGTAGCGG$
AGCTGATCGCGGCGTAGCGG$
GCTGATCGCGGCGTAGCGG$
CTGATCGCGGCGTAGCGG$
TGATCGCGGCGTAGCGG$
GATCGCGGCGTAGCGG$
ATCGCGGCGTAGCGG$
TCGCGGCGTAGCGG$
CGCGGCGTAGCGG$
GCGGCGTAGCGG$
CGGCGTAGCGG$
GGCGTAGCGG$
Tries
Smallest tree such that:
Each edge is labeled with a character from
A node has at most one outgoing edge labeled with c, for any c
Each key is spelled out along some path starting at the root
Sux trie
a
T: abaaba
Shortest
(non-empty)
sux
T$: abaaba$
a
Longest sux
Sux trie
a
T: abaaba
Each path from root to leaf represents a
sux; each sux is represented by some
path from root to leaf
Would this still be the case if we hadnt
added $? No
Sux trie
a
baa
$
Sux trie
a
S = baa
Yes, its a substring
Sux trie
a
$
S = abaaba
Yes, its a substring
Sux trie
a
x
$
S = baabb
No, not a substring
Sux trie
a
S = baa
Not a sux
Sux trie
a
S = aba
Is a sux
$
Sux trie
a
S = aba
2 occurrences
Sux trie
a
aba
a
Python example:
http://nbviewer.ipython.org/6603756
Sux trie
How many nodes does the sux trie have?
T = aaaa
a
1 Root
m nodes with
incoming a edge
m + 1 nodes with
incoming $ edge
2m + 2 nodes
Sux trie
Is there a class of string where the number
of sux trie nodes grows with m2?
Yes: anbn
1 root
n nodes along b chain, right
n nodes along a chain, middle
n chains of n b nodes hanging o eacha chain node
2n + 1 $ leaves (not shown)
n2 + 4n + 2 nodes, where m = 2n
Root
Sux trie
Deepest leaf
Max # nodes from left to right
= max # distinct substrings of any length
m
250000
150000
100000
50000
0
200000
m^2
actual
m
100
200
300
400
500