0% found this document useful (0 votes)

63 views44 pages

Challenges in Textual Database Management

The document discusses current challenges in textual databases, including the need for databases to be small, fast, flexible, and able to handle dynamic updates. It reviews classical solutions like suffix tries, suffix trees, and directed acyclic word graphs, and more compact structures like compact DAWGs. However, challenges remain regarding size, speed of access for large databases, flexibility of queries, and supporting dynamic updates.

Uploaded by

Amador Perez Lopez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views44 pages

Challenges in Textual Database Management

Uploaded by

Amador Perez Lopez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Current Challenges in Textual Databases

Gonzalo Navarro
Department of Computer Science
University of Chile
Road Map

• Text retrieval versus information retrieval

• Applications of text databases
• Challenges for modern text databases
• Classical solutions
– Suffix tries
– Suffix trees
– Directed acyclic word graphs (DAWGs)
• Towards succint data structures
– Compact DAWGs
– Sparse suffix trees, cactuses and q-grams
– Ziv-Lempel based indexes
– Suffix arrays
– Compact PAT trees
– Compact Suffix Arrays
• Self-indexing data structures
– Compressed Suffix Arrays
– FM-Index
– LZ-Index
• Secondary memory and dynamic data structures
• Open Challenges
Text Retrieval versus Information Retrieval

• In both cases we handle a text collection.

• In Information Retrieval
– The database is seen as a set of documents.
– The user has a (vague) information need.
– The user expresses that information using words and phrases.
– The system tries to guess what the user really wants.
– There are no correct or incorrect answers, just more or less relevant to
– An underlying text retrieval engine is used to compute relevance.
– It works only for “natural language” text, which excludes several hum
guages.
– We want speed, but especially good precision and recall.
• In Text Retrieval
– The database is seen as a set of strings.
– The user knows exactly what kind of strings he/she wants.
– The user gives patterns to search for in the text.
– The system retrieves exactly the occurrences of those patterns.
– A postprocessing stage may further ﬁlter these results, but this is not the
of the text retrieval engine.
– It works for any kind of text.
– We want basically speed, usually subject to correctness.
Applications of Text Databases

• Computational biology
– Genome projects have produced gigabytes of nucleotide and protein da
– Biologists search that data for known or new genes and proteins, to
homologous regions so as to predict chemical structure, biochemical func
evolutionary history, etc.
– Global and local similarity, with different concepts of approximate s
including approximate regular expression searching.
• Information retrieval
– IR systems usually have a text search engine at their kernel.
– Usually it is specific to natural language.
– Approximate searching is useful to cope with spelling, typing and OCR
– Handling from medium to huge texts: jurisprudence databases, linguis
porate data, news clipping... and of course the Web.
• Text retrieval in Oriental languages
– Chinese, Korean, (one) Japanese and other Oriental languages have v
alphabets, mixing phonetic and ideographic symbols.
– Word separations must be inferred from the meaning and are difficul
automatically.
– They have to treat their texts simply as strings of symbols.
– Typical IR systems do not apply to those languages.
• Multimedia databases
– Music databases, in MIDI format for example, with their own concept of s
e.g. transposition invariance.
– Audio databases, where we wish to find patterns independently of volum
etc.
– Video databases, for example object tracking information is a string o
directions.
Challenges for Modern Text Databases

• Be small, in order to store large text collections and extra data structures
them at a reasonable space cost.
• Be fast, in order to provide access to a large mass of text in reasonable
possibly many concurrent users.
• Be ﬂexible, in order to search for complex patterns.
• For fast access we need an index, that is, a data structure built on the text
• But an index may take much space, and this plays against the goal of
database.
• For large databases, that index has to be oriented to secondary memory.
• The index should be dynamic, that is, easy to update upon changes in
collection.
The Simplest Solution: Suﬃx Tries

• A trie that stores all the suﬃxes of the texts.

• Each leaf represents a unique substring (and extensions to suffix).
• Each internal node represents a repeating substring.
• Every text substring is the prefix of some suffix.
• So any substring of length m can be found in O(m) time.
• Many other search problems can be solved with suffix tries:
– Longest text repetitions.
– Approximate searching in O(nλ) time (n = text size).
– Regular expression searching in O(nλ) time.
– ... and many others.
• Let n be the text size, then the trie
– on average is O(n) space and construction time...
– but in the worst case it is O(n2) space and construction time.
$ r d
d 1
21 _ 19 _
b l 6
a
a
a $ a
l _
20 b l r _
_ l 9 a b
l a a _ d r 10
7 12 11 a
8 b 5 17
r
_ d
_
d a 4 16
3 15 _
r 2
alabar a la alabarda$ _ d
1 13
Guaranteeing Linear Space: Suffix Trees

Weiner, McCreight, Ukkonen,

• Works essentially as a suﬃx trie.

• Unary paths are compressed.
• It has O(n) size and construction time, in the worst case.
• Construction can be done online.
• After the O(m) search, the R occurences can be collected in O(R) time.
• So it’s simply perfect!
• Where is the trick?
– The suﬃx tree needs at least 20 times the text size.
– It is hard to build and search in secondary memory.
– It is diﬃcult to update upon changes in the text.
• So, it is problematic with respect to our three goals!
• Very compact implementations achieve 10n bytes.
• Other variants: Patricia trees, level-compressed tries... not better in practic
$ r d
d 1
21 _ la _
19
a 6
a l $ _

bar
20 _
a r 10
_ l 9

bar
11 l _ d
bar

7 12 8 5 17
laba

_ d
_ _
r

d 4 16
3 15 2

alabar a la alabarda$ _ d
1 13
Minimizing Trie Automata: DAWGs

Blumer et al., Croch

• The smallest deterministic automaton recognizing every text suﬃx.

• It would be the result of minimizing the suﬃx trie.
• At most 2n nodes and 3n edges.
• It can be built in O(n) time with an online algorithm.
• Searching can also be done in O(m + R) time.
• An interesting alternative to suﬃx trees.
• In practice, it is too large (30n), static and bad for secondary storage.
_
a
l a l
alabar a la alaba
b l
_ l
_ a _
a l a b a r _ a _ l a _ a l a b a r

b b d
r r
d
Best from Suﬃx Trees and DAWGs: Compact DAWGs

Blumer et al., Crochemore & Vérin, Takeda & Shin

• Obtained by compressing unary paths in the DAWG.

• Also obtained by minimizing the suﬃx tree.
• It is smaller than the three classical structures.
• It can still be built online in O(n) time.
• It can still search in O(m + R) time.
• It can still implement all the ﬂexible searches.
• But it is still too large (15n), static and bad on disk.
_alabarda alabar a la alabarda
alabarda
la bar labarda
la_alabarda
_ la_alabarda
_ a _la_alabarda
a labar _a_la_alabarda

bar bar da
r r
da
Sampling the Data: Sparse Suﬃx Trees, Cactuses, and q-Grams

Kärkkäinen, Ukkonen, Sutinen, T

• Sparse suﬃx trees

– Only one out of k suffixes are indexed.
– Hence the suffix tree has less nodes and is smaller.
– But searches are more expensive.
– In practice, reasonable search time requires about 4n space.
• Suffix cactuses
– Carry path compression one step further.
– They are a crossing between a suffix tree and a suffix array.
– Not very promising either (10n space).
• Q-gram indexes
– Indexes substrings of length q instead of full suffixes.
– Significantly less space, for example 2n extra space.
– But search times are far less attractive if q is not large enough.
– And the index grows exponentially with q, so 4n at least.

ala 1,13 a_l 8

lab 2,14 _la 9
aba 3,15 la_ 10
bar 4,16 a_a 11 alabar a la alabarda$
ar_ 5 _al 12
r_a 6 ard 17
a 7 rda 18
Using the Ziv-Lempel Parsing: Ziv-Lempel Indexes

Kärkkäinen, Sutinen, U

• The text is divided according to a Ziv-Lempel parsing.

• A sparse suffix tree indexes initial positions of blocks.
• Another indexes the reverse prefixes ending at final block positions.
• Occurrences can either be contained inside a block or not.
• Those that are not, have a prefix finishing a block and a suffix starting a b
• Once each prefix-suffix pair is found, a range search data structure perm
secting the lexicographical ranges.
• Those contained in another block are found by a structure tracking the rep
• Not implemented, but we estimate 3.5n to 5.5n extra space and good sear
• Although in theory search time is, for example, O(m2 + m log n + R logε n
• A version able of searching for q-grams only (q < log n) obtains optimal O
search time and 4n to 6n extra space.
• No good algorithm for longer patterns can be obtained with such a q.
la _
_a
_ d
_ l

bala
a r a
7 12 a 10 19

bar
$ 9 7 $ _
20 16
r l 4
_ $
b _ d _ d _ d
8 l 5 17 1
2 14 13 20 11
3 1

[Link]. .a .la. [Link].a$

bala$
bala_
_a

a$
a_
ad
_r

r
l
_a_ 6
_al 11
a$ 19
a_ 7
ab 2
al
ar_ 4
ard 16
la_ 9
labar_ 1
labard 13
Just the Tree Leaves: Suﬃx Arrays

Manber & Myers, Gonnet et al., Kurtz et al., Kärkkäinen, Baeza-Yate

• An array of all the text positions in lexicographical suﬃx order.

• Much simpler to implement and smaller than the suﬃx tree.
• Simple searches result in a range of positions.
• It can simulate all the searches with an O(log n) penalty factor.
• It can be built in O(n) time, but construction is not online.
• In practice it takes about 4 times the text size.
• Linear-time construction needs 8n bytes, otherwise construction is O(n log
• Paying 6n extra space, we can get rid of the O(log n) penalty.
• Builds in secondary memory in O(n2 log(M )/M ) time (M = RAM size).
• Searching in secondary memory is painful.
• But it can be improved a lot with sampling supraindexes.
• Still dynamism is a problem.
alabar a la alabarda$
bar

21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
Minimizing the Impact of the Tree: Compact PAT Trees

Jacobson, Munro, Raman, Clark

• Separate a suffix tree into the leaves (suffix array) plus the structure.
• The tree structure itself can be represented as a sequence of parentheses.
• This sequence can be traversed with the usual tree operations.
• Basic tools are: ranking of bits and finding closing parentheses.
• The result is functionally similar to a suffix tree.
• Experimental results report 5n to 6n bytes.
• So it is still too large.
• Worse than that, it is built from the plain suffix tree.
• Search in secondary storage is reasonable (2–4 disk accesses), by storing sub
single disk pages.
• Dynamism is painful.
• The rank mechanism is very valuable by itself.
$ r d
d 18
21 _ la _
19 6
a
a l 20 $_ _

b ar
a r 10
_ l 9

bar
7 12 11 l bar
_ d
8 5 17

laba
_ d
_ _ d
d r 4 16
3 15 2 14

_ d
1 13

(()((()())())(()(()())(()())(()())(()()))(()())()(()(()()))(()()))

alabar a la alabarda$ bar

21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
(()((()())())(()(()())(()())(()())(()()))(()())()(()(()()))((
1101110100100110110100110100110100110100011010010110110100011
3 6 7 9 3 5 6 9 2 3 5 7 2 5 6
9 18 25 3
Smaller than Suﬃx Arrays: Compact Suﬃx Arrays

• Exploits self-repetitions in the suﬃx array.

• An area in the array can be equal to another area, provided all the values are
• Those repetitions are factored out.
• Search time is O(m log n + R) and fast in practice .
• Extra space is around 1.6n .
• Construction needs O(n) time given the suﬃx array.
• This is much better than anything else seen so far.
• Still no provisions for updating nor secondary memory.
alabar a la alabarda$

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 (9,4,0)

0 1 2 3 4 5 6 7 8 9 10 11 12
21 7 (3,2,1)(10,2,0) 8 (12,2,0) 1 13 (9,2,0) (5,2,0) 19 10 (6,4,0)
Towards Self-Indexing

• Up to now, we have focused on compressing the index.

• We have obtained decent extra space, 1.6 times the text size.
• However, we still need the text separately available in plain form.
• Could the text be compressed?
• Moreover, could the compressed text act as an index by itself?
• Self-index: a data structure that acts as an index and comprises the text
• After the index is built, the text can be deleted.
• The retrieval of text passages is done via a request to the index.
• Hence, retrieving text becomes an essential operation of the index.
• Exciting possibility: the whole thing takes less space than the original text!
Exploiting Repetitions Again: Compressed Suﬃx Arrays

Grossi &

• We replace the suﬃx array by a level-wise data structure.

• Level 0 is the suffix array itself.
• Level k + 1 stores only the even pointers of level k, divided by 2.
• Bit array Bk (i) tells whether the i-th suffix array entry is even.
• Array Ψk (i) tells where is the pointer to position i + 1.
• Note that level k + 1 needs half the space of level k.
• For large enough k = = Θ(log log n), the suffix array is stored explicitly.
• If Bk (i) = 1, SAk [i] = 2 × SAk+1 [rank(Bk , i)].
• If Bk (i) = 0, SAk [i] = 2 × SAk+1 [rank(Bk , Ψk (i))] − 1.
• This permits computing SA[i] = SA0 [i] in O(log log n) time.
• Hence searching can be done at O((m + log log n) log n) time.
• We store the Bk and Ψk explicitly.
• Function rank(Bk , ·) is computed in constant time as explained.
• The Ψk vector is stored with differential plus delta encoding, with an absol
each log n entries.
• A two-level structure like that for rank computes Ψk in constant time.
• The delta encoding dominates space, and it is good because of the same r
property.
• By dividing the array by 2c instead of by 2, for c = ε log log n, we get
constant time.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
SA_0 21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14

B_0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1
Ψ_0 9 6 10 16 0 2 3 13 14 17 18 19 20 11 12 4 5 7 8
0 1 2 3 4 5 6 7 8 9
SA_1 6 10 4 2 8 5 1 7 3 9

B_1 1 1 1 1 1 0 0 0 0 0
Ψ_1 7 6 5 8 9 0 3 4 2 1

0 1 2 3 4
SA_2 3 5 2 1 4
alabar a la alabarda$
B_2 0 0 1 0 1
Ψ_2 4 3 0 2 1 D = 110010000000010110010
C = $_abdlr
0 1
SA_3 1 2
Compressed Suﬃx Arrays without Text

• Use only Ψ = Ψ0 and C.

• The text can be discarded: rank over a bit vector D with a 1 for each c
ﬁrst character pointed by SA[i].
• Using Ψ0 we obtain successive text characters.
• Overall space is n(H0 + 8 + 3 log2 H0) bits, text included.
• In practice, this index takes 0.6n to 0.7n bytes.
• Performance is good for counting, but showing text contexts is slow.
• Search complexity is O(m log n + R).
• Recently shown how to build in little space.
Building on the Burrows-Wheeler Transform: FM-Index

Ferragina &

• Based on the Burrows-Wheeler transform.

• The characters preceding each suffix are collected from the suffix array.
• The result is a more compressible permuted text.
• These are coded with move to front, run-length and δ-coding.
• The transformation is reversible.
• Given a position in the permuted text (last column), we can find the positio
letter preceding it in the original text.
• The trick is that we know which letter is at each position in the sorted array
(first column).
• And letters in the first column follow those in the last column.
• Given a letter in the last column, which is the k-th c, we easily find its po
the first column, and hence the character preceding it.
• Starting at the character “$”, we can obtain the text backwards.
alabar a la alabarda$
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18

araadl ll$ bbaar aaaa BWT

1st "a"
alabar a la alabarda$ $alabar a la alabarda
labar a la alabarda$a a la alabarda$alabar
abar a la alabarda$al alabarda$alabar a la
bar a la alabarda$ala la alabarda$alabar a
ar a la alabarda$alab a$alabar a la alabard 1st "d"

r a la alabarda$alaba a alabarda$alabar a l
a la alabarda$alabar a la alabarda$alabar
a la alabarda$alabar abar a la alabarda$al
la alabarda$alabar a abarda$alabar a la al
la alabarda$alabar a alabar a la alabarda$
a alabarda$alabar a l alabarda$alabar a la
alabarda$alabar a la ar a la alabarda$alab
alabarda$alabar a la arda$alabar a la alab
labarda$alabar a la a bar a la alabarda$ala
abarda$alabar a la al barda$alabar a la ala
barda$alabar a la ala da$alabar a la alabar 2nd "r"
arda$alabar a la alab la alabarda$alabar a
rda$alabar a la alaba labar a la alabarda$a
da$alabar a la alabar labarda$alabar a la a
a$alabar a la alabard r a la alabarda$alaba
$alabar a la alabarda rda$alabar a la alaba 9th "a"
• The index also has a cumulative letter frequency array C.
• As well as Occ[c, i] = number of occurrences of c before position i in the p
text.
• If we start at a position i in the permuted text, with character c, the previo
is at position C[c] + Occ[c, i].
• The search for pattern p1 . . . pm is done backwards, in optimal O(m) time
• First we take interval for pm, [l, r) = [C[pm], C[pm + 1]).
• Interval for pm−1 pm is [l, r) = [C[pm−1] + Occ[pm−1 , l], C[pm−1] + Occ[pm
• C is small and Occ is cumulative, so it is easy to store with blocks and sup
• The in-block computation of Occ is done by scanning the permuted text.
• We can show text contexts by walking the permuted text as explained.
• A problem is how to know which text position are we at!
• Some suffix array pointers are stored, we walk the text until we find one.
• Overall space is 5Hk n bits, text included, for any k.
• In practice, 0.3 to 0.8 times the text size, and includes the text.
• Counting the number of occurrences is amazingly fast.
• Reporting their positions and text contexts is very slow, however.
• Search complexity is O(m + R logε n).
• Construction needs to start from the suffix array.
• Some (theoretical) provisions for dynamism, search time becomes O(m log
alabar a la alabarda$ $ a
r
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18
a
a
araadl ll$ bbaar aaaa BWT a d
a l
2,6,1,0,5,6,5,1,0,5,2,6,0,5,0,6,3,2,0,0,0 MTF
a _
C[$]=0, C[_]=1, C[a]=4, C[b]=13, C[d]=15, C[l]=16, C[r]=19 a l
a l
Occ[$] = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1 a $
Occ[_] = 0,0,0,0,0,0,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3 a _
Occ[a] = 1,1,2,3,3,3,3,3,3,3,3,3,3,4,5,5,5,6,7,8,9 a b
Occ[b] = 0,0,0,0,0,0,0,0,0,0,0,1,2,2,2,2,2,2,2,2,2 a b
Occ[d] = 0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 b a
Occ[l] = 0,0,0,0,0,1,1,2,3,3,3,3,3,3,3,3,3,3,3,3,3 b a
Occ[r] = 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2 d r
l _
l a
l a
r a
r a
Using just LZ78 Parsing: LZ-Index

• Inspired in the idea of LZ-suﬃx trees, sticking to LZ78.

• Instead of a sparse suffix tree, the very same LZ78 trie is the index.
• We also need the tree of reverse phrases.
• These are represented using the parentheses technique.
• Occurrences can span 2 blocks (search prefix and suffix).
• They can lie inside one block (easily enumerated thanks to LZ78 properties
• Or they can span 3 blocks or more (complicated but very few thanks to LZ78
• Text can be retrieved by following parent pointers in the tree (can be d
parentheses representation).
• The index needs 4Hk n bits, but 1.2n to 1.6n bytes in practice, text includ
• It needs about the same space of a suffix array to build and builds fast.
• Query time is O(m3 + m2 log n + R log n) and slow in practice.
• But reporting occurrences is much faster, and this can be dominant.
[Link]. .a .la. [Link].a$

0
0 r
l $ _
_ a b d l
a
5 1 5 1 2
2
$ a a _ l a r
a _ b r a
8 11 6 3 4 7 11 6 8 7 3
d l a
b
10 9 9 10
Secondary Memory and Dynamic Data Structures

• Most of the methods we have seen are not suited to secondary memory.
• This refers both to construction and searching.
• Most of them are difficult to modify if the text changes.
• Those that can be built online can easily incorporate more text.
• Recently, efficient construction of suffix trees in secondary memory has been a
both in theory and in practice.
• Suffix arrays can perform decently on secondary memory, but modifying them
rewriting them completely.
• Compact PAT Trees put contiguous subtrees in disk pages, obtaining re
√
performance (2–4 disk accesses, O(k/ m + logm n)).
• But modifying them is painful.
• Another interesting approach for managing insertions is to have buckets of e
tially increasing sizes, one index per bucket.
• Insertion reminds incrementing a binary number and has good amortized
mance.
Dynamic in Secondary Memory: String B-Tree

Ferragina &

• A data structure to store a set of strings.

• It is like a B-tree of strings, but each node stores a compressed trie of the se
keys.
• Takes 12n space, but can be reduced to 6n.
• Permits searching substrings of any string.
• Search time is optimal, O((m + R)/B + logB n). .
• In practice it takes 4–6 disk accesses per query.
• It can insert a new text of length n in O(n logB (n + n )) time.
• Same complexity for deleting a text from the set.
• It uses optimal space O(n/B).
• It has been implemented and is a very attractive alternative.
• Yet, it is not succint.
Open Challenges

• Many goals have been obtained separately:

– Fast searches and presentation of results.
– Succint space for construction and usage.
– Efficient construction and search in secondary memory.
– Efficient insertions and deletions of texts.
• However, no existing data structure fits all the requirements.
• Several theoretical proposals remain unimplemented.
• We have only considered exact searching for simple patterns.
suffix trie

suffix tree
DAWG q-gr

sparse ST CPT suffix array

string
CDAWG
B-tree LZ-q
LZ-stree
Compact SA CSA FM

succint
LZ-index dynamic
secondary mem

Indexing Techniques for Text Searching
No ratings yet
Indexing Techniques for Text Searching
56 pages
Genome Text Indexing Techniques
No ratings yet
Genome Text Indexing Techniques
162 pages
Suffix Trees and Arrays Explained
No ratings yet
Suffix Trees and Arrays Explained
22 pages
Text Search Engine Techniques Explained
No ratings yet
Text Search Engine Techniques Explained
27 pages
Suffix Trees and Arrays Explained
No ratings yet
Suffix Trees and Arrays Explained
33 pages
Understanding Suffix Trees and Tries
No ratings yet
Understanding Suffix Trees and Tries
21 pages
Overview of Trie Data Structures
No ratings yet
Overview of Trie Data Structures
11 pages
Suffix Trees vs. Suffix Arrays
No ratings yet
Suffix Trees vs. Suffix Arrays
78 pages
Suffix Arrays: Construction and Use
No ratings yet
Suffix Arrays: Construction and Use
29 pages
Suffix Trees in Bioinformatics
No ratings yet
Suffix Trees in Bioinformatics
9 pages
Suffix Tree Construction and Usage
No ratings yet
Suffix Tree Construction and Usage
20 pages
String Matching Algorithms Overview
No ratings yet
String Matching Algorithms Overview
48 pages
Fully Compressed Suffix Trees Explained
No ratings yet
Fully Compressed Suffix Trees Explained
33 pages
Compressed Full-Text Indexing Techniques
No ratings yet
Compressed Full-Text Indexing Techniques
30 pages
Suffix Trees and Arrays Explained
No ratings yet
Suffix Trees and Arrays Explained
16 pages
Understanding Suffix Arrays
No ratings yet
Understanding Suffix Arrays
71 pages
Suffix Array Construction Techniques
No ratings yet
Suffix Array Construction Techniques
15 pages
Dynamic Suffix Array Update Algorithm
No ratings yet
Dynamic Suffix Array Update Algorithm
29 pages
Linear-Time Suffix Sorting Thesis
No ratings yet
Linear-Time Suffix Sorting Thesis
63 pages
Suffix Trees in String Algorithms
No ratings yet
Suffix Trees in String Algorithms
21 pages
Suffix Tries: Structure and Applications
No ratings yet
Suffix Tries: Structure and Applications
50 pages
Understanding Trie Trees and Their Types
No ratings yet
Understanding Trie Trees and Their Types
21 pages
Suffix Tree: Definition and Construction
No ratings yet
Suffix Tree: Definition and Construction
6 pages
Overview of Trie Data Structures
No ratings yet
Overview of Trie Data Structures
10 pages
Types and Structures of Tries
No ratings yet
Types and Structures of Tries
20 pages
Advanced Web Indexing Techniques
No ratings yet
Advanced Web Indexing Techniques
52 pages
Trie and Suffix Tree Overview
No ratings yet
Trie and Suffix Tree Overview
6 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Data Structures for Information Retrieval
No ratings yet
Data Structures for Information Retrieval
133 pages
Tolerant Retrieval Techniques Explained
No ratings yet
Tolerant Retrieval Techniques Explained
58 pages
PAT Trees and PATArrays Explained
No ratings yet
PAT Trees and PATArrays Explained
22 pages
Inverted Files and Suffix Structures Explained
No ratings yet
Inverted Files and Suffix Structures Explained
15 pages
Suffix Trees in Graph Applications
No ratings yet
Suffix Trees in Graph Applications
40 pages
Suffix Array Construction Algorithms
No ratings yet
Suffix Array Construction Algorithms
35 pages
ACM ICPC 2015: String Query Solutions
No ratings yet
ACM ICPC 2015: String Query Solutions
2 pages
Understanding Trie Data Structures
No ratings yet
Understanding Trie Data Structures
23 pages
Suffix Tree Overview and Applications
No ratings yet
Suffix Tree Overview and Applications
6 pages
Efficient Suffix Array Construction
No ratings yet
Efficient Suffix Array Construction
17 pages
Document Indexing and Retrieval Systems
No ratings yet
Document Indexing and Retrieval Systems
66 pages
Efficient Suffix Arrays Explained
No ratings yet
Efficient Suffix Arrays Explained
20 pages
Understanding Tries in Data Structures
No ratings yet
Understanding Tries in Data Structures
26 pages
Efficient Distributed Suffix Array Algorithm
No ratings yet
Efficient Distributed Suffix Array Algorithm
15 pages
Indexing Structures and Techniques Explained
No ratings yet
Indexing Structures and Techniques Explained
30 pages
IR System Design: Indexing & Efficiency
No ratings yet
IR System Design: Indexing & Efficiency
43 pages
Trie Data Structures for String Matching
No ratings yet
Trie Data Structures for String Matching
13 pages
Efficient DC3 Algorithm for Suffix Arrays
No ratings yet
Efficient DC3 Algorithm for Suffix Arrays
14 pages
Overview of Trie Data Structures
No ratings yet
Overview of Trie Data Structures
11 pages
Understanding Tries: Types and Applications
No ratings yet
Understanding Tries: Types and Applications
11 pages
Comparing Tries and Radix Trees
No ratings yet
Comparing Tries and Radix Trees
27 pages
Implementing a Trie Data Structure
No ratings yet
Implementing a Trie Data Structure
3 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Understanding Suffix Arrays and LCP
No ratings yet
Understanding Suffix Arrays and LCP
5 pages
Pattern Matching Algorithms Overview
No ratings yet
Pattern Matching Algorithms Overview
10 pages
Types of Tries in Data Structures
100% (1)
Types of Tries in Data Structures
14 pages
Pattern Matching Techniques in IR
No ratings yet
Pattern Matching Techniques in IR
32 pages
Understanding Tries and Their Variants
No ratings yet
Understanding Tries and Their Variants
3 pages
Shortest Path in Weighted Graph
No ratings yet
Shortest Path in Weighted Graph
90 pages
Discrete Logarithms and Cryptography
No ratings yet
Discrete Logarithms and Cryptography
88 pages
PTAS for Weighted Matroid Matching
No ratings yet
PTAS for Weighted Matroid Matching
44 pages
Secretary Problems and Optimization Techniques
No ratings yet
Secretary Problems and Optimization Techniques
198 pages
RSA Digital Signatures Explained
No ratings yet
RSA Digital Signatures Explained
10 pages
Jump Number in Two-Directional Orthogonal Graphs
No ratings yet
Jump Number in Two-Directional Orthogonal Graphs
45 pages
Matroid Secretary Problem Advances
No ratings yet
Matroid Secretary Problem Advances
51 pages
TSP in Cubic Graphs: Barnette's Conjecture
No ratings yet
TSP in Cubic Graphs: Barnette's Conjecture
56 pages
Jump Number in Two-Directional Orthogonal Graphs
No ratings yet
Jump Number in Two-Directional Orthogonal Graphs
45 pages
Matroid Secretary Problem Insights
No ratings yet
Matroid Secretary Problem Insights
92 pages
Oenological Decision Support System Design
No ratings yet
Oenological Decision Support System Design
3 pages
Variational Inequalities and Economic Equilibrium: A. Jofre
No ratings yet
Variational Inequalities and Economic Equilibrium: A. Jofre
4 pages
Normal Distribution Probability Guide
No ratings yet
Normal Distribution Probability Guide
9 pages
Transformations Module Test Questions
100% (1)
Transformations Module Test Questions
8 pages
Laboratory Guide Chemistry Form 5 PDF
No ratings yet
Laboratory Guide Chemistry Form 5 PDF
5 pages
List of Figures I List of Tables III Abbreviations IV V: Index of The Project
No ratings yet
List of Figures I List of Tables III Abbreviations IV V: Index of The Project
5 pages
Digital Circuits Week 2 MCQ Assignment
No ratings yet
Digital Circuits Week 2 MCQ Assignment
19 pages
FatBoy Smart Contract Security Audit
No ratings yet
FatBoy Smart Contract Security Audit
26 pages
Understanding Fugue Structure and Elements
100% (1)
Understanding Fugue Structure and Elements
86 pages
Marine Air Conditioner Pricing Guide
100% (1)
Marine Air Conditioner Pricing Guide
70 pages
Bijections in Finite-Dimensional Algebras
No ratings yet
Bijections in Finite-Dimensional Algebras
36 pages
Surveying I Worksheet for Civil Engineering
No ratings yet
Surveying I Worksheet for Civil Engineering
10 pages
Recycled 3D Atomic Model Project
No ratings yet
Recycled 3D Atomic Model Project
3 pages
BadCopy Pro Registration Keys List
No ratings yet
BadCopy Pro Registration Keys List
3 pages
IT/OT Convergence: Enhancing Cybersecurity
No ratings yet
IT/OT Convergence: Enhancing Cybersecurity
54 pages
Chord Transposing Guide
No ratings yet
Chord Transposing Guide
1 page
LMC Utilization Template for Math
No ratings yet
LMC Utilization Template for Math
11 pages
Embraer 175 Aircraft Specifications
100% (1)
Embraer 175 Aircraft Specifications
2 pages
Validity Proofs in Discrete Mathematics
No ratings yet
Validity Proofs in Discrete Mathematics
3 pages
Arduino Bluetooth Spy Robot Project
No ratings yet
Arduino Bluetooth Spy Robot Project
14 pages
Understanding Temperature and Heat Transfer
No ratings yet
Understanding Temperature and Heat Transfer
14 pages
Request to Change PhD Course Subjects
No ratings yet
Request to Change PhD Course Subjects
1 page
Intech Engineers Inspection Plan
No ratings yet
Intech Engineers Inspection Plan
11 pages
Capacitive Reactance vs Frequency Analysis
No ratings yet
Capacitive Reactance vs Frequency Analysis
5 pages
Gas Well Tubing Size Optimization
No ratings yet
Gas Well Tubing Size Optimization
7 pages
Alside Tempered Glass Specifications
No ratings yet
Alside Tempered Glass Specifications
2 pages
Grade 6 Language Arts - Academic Vocabulary: © 2011 Learning Plus Associates
No ratings yet
Grade 6 Language Arts - Academic Vocabulary: © 2011 Learning Plus Associates
13 pages
Advanced ETKD/ETWD Water Meter Solutions
No ratings yet
Advanced ETKD/ETWD Water Meter Solutions
4 pages
NMT Competency Test Guidelines
No ratings yet
NMT Competency Test Guidelines
11 pages
1134 x 2 Bifacial Solar Module Specs
No ratings yet
1134 x 2 Bifacial Solar Module Specs
2 pages
Online Fraud Detection with Concept Drift
No ratings yet
Online Fraud Detection with Concept Drift
30 pages
Grade 8 Math Inequalities Test
No ratings yet
Grade 8 Math Inequalities Test
3 pages
Quake Prediction via Astro Analysis
No ratings yet
Quake Prediction via Astro Analysis
5 pages

Challenges in Textual Database Management

Uploaded by

Challenges in Textual Database Management

Uploaded by

Current Challenges in Textual Databases

• Text retrieval versus information retrieval

• In both cases we handle a text collection.

• A trie that stores all the suﬃxes of the texts.

Weiner, McCreight, Ukkonen,

• Works essentially as a suﬃx trie.

Blumer et al., Croch

• The smallest deterministic automaton recognizing every text suﬃx.

Blumer et al., Crochemore & Vérin, Takeda & Shin

• Obtained by compressing unary paths in the DAWG.

Kärkkäinen, Ukkonen, Sutinen, T

• Sparse suﬃx trees

ala 1,13 a_l 8

• The text is divided according to a Ziv-Lempel parsing.

[Link]. .a .la. [Link].a$

Manber & Myers, Gonnet et al., Kurtz et al., Kärkkäinen, Baeza-Yate

• An array of all the text positions in lexicographical suﬃx order.

Jacobson, Munro, Raman, Clark

alabar a la alabarda$ bar

• Exploits self-repetitions in the suﬃx array.

• Up to now, we have focused on compressing the index.

• We replace the suﬃx array by a level-wise data structure.

• Use only Ψ = Ψ0 and C.

• Based on the Burrows-Wheeler transform.

araadl ll$ bbaar aaaa BWT

• Inspired in the idea of LZ-suﬃx trees, sticking to LZ78.

• A data structure to store a set of strings.

• Many goals have been obtained separately:

sparse ST CPT suffix array

You might also like