0% found this document useful (0 votes)

48 views

Filtering

This document summarizes the PEX filtering algorithm for approximate string matching. The algorithm works by dividing the pattern string into pieces and searching for matches to each piece in the text. Any positions matching a piece are then verified for a full match. To reduce verifications, a hierarchical approach divides pieces into smaller sub-pieces, searching for matches with fewer errors at each level. The algorithm aims to filter out non-matching text regions quickly while minimizing costly verification steps.

Uploaded by

Bohlokoa Nei

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Filtering

Uploaded by

Bohlokoa Nei

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Fast filtering algorithms

This exposition is based on the following sources, which are all recommended read-
ing:

1. Flexible Pattern Matching in Strings, Navarro, Raffinot, 2002, chapter 6.5,

pages 162ff.

2. Burkhardt et al.: q-gram Based Database Searching Using a Suffix Array

(QUASAR), RECOMB 99

We will present the hierarchical filtering approach of Navarro and Baeza-Yates and
and the simple QUASAR idea.

4000
Filtering algorithms
The idea behind filtering algorithms is that it might be easier to check that a text
position does not match a pattern string that to verify that it does.

Filtering algorithms filter out portions of the text that cannot possibly contain a
match, and, at the same time, find positions that can possibly match.

These potential match positions then need to be verified with another algorithm like
for example the bit-parallel algorithm of Myers (BPM).

4001
Filtering algorithms (2)

Filtering algorithms are very sensitive to the error level α := k/m since this normally
affects the amount of text that can be discarded from further consideration. (m =
pattern length, k = errors.)

If most of the text has to be verified, the additional filtering steps are an overhead
compared to the strategy of just verifying the pattern in the first place.

On the other hand, if large portions of the text can be discarded quickly, then the
filtering results in a faster search.

Filtering algorithms can improve the average-case performance (sometimes dra-

matically), but not the worst-case performance.

4002
PEX

4003
The pidgeonhole principle
The idea behind the presented filtering algorithm is very easy. Assume that we want
to find all occurrences of a pattern P = p1, ... , pm in a text T = t1, ... , tn that have an
edit distance of at most k .

If we divide the pattern into k + 1 pieces P = p1, ... , pk +1, then at least one of the
pattern pieces has to match without error .

4004
The pidgeonhole principle (2)

There is a more general version of this principle first formalized by Myers in 1994:
Lemma 1. Let Occ match P with k errors, P = p1, ... , pj be a concatenation of
Pj
subpatterns, and a1, ... , aj be nonnegative integers such that A = i=1 ai . Then, for
some i ∈ 1, ... , j, Occ includes a substring that matches pi with bai k/Ac errors.
Proof: Exercise.

4005
The pidgeonhole principle (3)

So the basic procedure is:

1. Divide: Divide the pattern into k + 1 pieces of approximately the same length.

2. Search: Search all the pieces simultaneously with a multi-pattern string match-
ing algorithm. According to the above lemma, each possible occurrence will
match at least one of the pattern pieces.

3. Verify: For each found pattern piece, check the neighborhood with a verification
algorithm that is able to detect an occurrence of the whole pattern with edit dis-
tance at most k . Since we allow indels, if pi1 ... pi2 matches the text tj ... tj+i2−i1 ,
then the verification has to consider the text area tj−(i1−1)−k ... tj+(m−i1)+k ,
which is of length m + 2k .

4006
An example
Say we want to find the pattern annual in the texts
t1 = any annealing and
t2 = an unusual example with numerous verifications
with at most 2 errors.

4007
An example (2)

1. Divide: We divide the pattern annual into p1 = an, p2 = nu, and p3 = al. One
of these subpattern has to match with 0 errors.

2. Search: We search for all subpatterns:

1: searching for an: in t_1: find positions 1, 5

in t_2: find position 1
2: searching for nu: in t_1: find no positions
in t_2: find positions 5, 25
3: searching for al: in t_1: find position 9
in t_2: find position 9

3. Verification: We have to verify 3 positions in t1, and 4 positions in t2, to find 3

occurrences at positions (indexed by the last character) 9, 10, 11 in t1 and none
in t2.

4008
Hierarchical verification
The toy example makes clear that many verifications can be triggered that are un-
successsful and that many subpatterns can trigger the same verification. Repeated
verfications can be avoided by carefully sorting the occurrences of the pattern (ex-
ercise).

It was shown by Baeza-Yates and Navarro that the running time is dominated by the
multipattern search for error levels α = k /m below 1/(3 log|Σ| m). In this region, the
log
|Σ| m
search cost is about O(kn m ). For higher error levels, the cost for verifications
starts to dominate, and the filter efficiency deteriorates abruptly.

Baeza-Yates and Navarro introduced the idea of hierarchical verification to reduce

the verification costs, which we will explain next. Then we will work out more details
of the three steps.

4009
Hierarchical verification (2)

Navarro and Baeza-Yates use Lemma ?? for a hierarchical verification. The idea
is that, since the verification cost is high, we pay too much for verifying the whole
pattern each time a small piece matches. We could possibly reject the occurrence
with a cheaper test for a shorter pattern.

So, instead of directly dividing the pattern into k + 1 pieces, we do it hierarchically.

We split the pattern first in two pieces and search for each piece with bk /2c errors,
following Lemma ??. The halves are then recursively split and searched until the
error rate reaches zero, i. e. we can search for exact matches.

With hierarchical verification the area of applicability of the filtering algorithm grows
to α < 1/ log|Σ| m, an error level three times as high as for the naive paritioning and
verification. In practice, the filtering algorithm pays off for α < 1/3 for medium long
patterns.

4010
Hierarchical verification (3)

Example. Say we want to find the pattern P = aaabbbcccddd in the text T =

xxxbbbxxxxxx with at most k = 3 differences. The pattern is split into four pieces
p1 = aaa, p2 = bbb, p3 = ccc, p4 = ddd. We search with k = 0 errors in level 2 and
find bbb.

level 0 aaabbbcccddd with k=3 errors

/ \
level 1 aaabbb cccddd with k=1 errors
/ \ / \
level 2 aaa bbb ccc ddd with k=0 errors

4011
Hierarchical verification (4)

Now instead of verifying the complete pattern in the complete text (at level 0) with
k = 3 errors, we only have to check a slightly bigger pattern (aaabbb) at level 1 with
one error. This is much cheaper. In this example we can decide that the occurrence
bbb cannot be extended to a match.

level 0 aaabbbcccddd with k=3 errors

/ \
level 1 AAABBB cccddd with k=1 errors
/ \ / \
level 2 aaa BBB ccc ddd with k=0 errors

4012
The PEX algorithm
Divide: Split pattern into k + 1 pieces, such that each piece has equal probability of
occurring in the text. If no other information is available, the uniform distribution is
assumed and hence the pattern is divided in pieces of equal length.

4013
The PEX algorithm (2)

Build Tree: Build a tree of the pattern for the hierarchical verification. If k + 1 is not
a power of 2, we try to keep the binary tree as balanced as possible.

Each node has two members from and to indicating the first and the last position of
the pattern piece represented by it. The member err holds the number of allowed
errors. A pointer myParent leads to its parent in the tree. (There are no child
pointers, since we traverse the tree only from the leafs to the root.) An internal
variable left holds the number of pattern pieces in the left subtree. idx is the next
leaf index to assign. plen is the length of a pattern piece.

Algorithm CreateTree generates a hierarchical verification tree for a single pattern.

(Lines ?? and ?? are justified by Lemma ??.)

4014
The PEX algorithm (3)

1 CreateTree( p = pi pi+1 ... pj , k , myParent, idx, plen )

2 // Note: the initial call is: CreateTree ( p, k, nil, 0, bm/(k + 1)c )
3 Create new node node
4 from(node) = i
5 to(node) = j
6 left = d(k + 1)/2e
7 parent(node) = myParent
8 err (node) = k
9 if k = 0
10 then leafidx = node
11 else
12 lk = b(left · k)/(k + 1)c
13 CreateTree( pi ... pi+left·plen−1, lk, node, idx, plen )
14 rk = b((k + 1 − left) · k )/(k + 1)c
15 CreateTree( pi+left·plen ... pj , rk, node, idx + left, plen )
16 fi
4015
The PEX algorithm (4)

Example: Find the pattern P = annual in the text T = annual CPM anniversary
with at most k = 2 errors. First we build the tree with k + 1 = 3 leaves. Below we
write at each node ni the variables (from, to, error ) .

"annual" n4=(1,6,2)
/ \
"annu" n3=(1,4,1) \
/ \ \
"an" n0=(1,2,0) "nu" n1=(3,4,0) "al" n2=(5,6,0)
| | |
leaf 0 leaf 1 leaf 2

4016
The PEX algorithm (5)

Search: After constructing the tree, we have k + 1 leafs leafi . The k + 1 subpatterns

{ pfrom(n), ... , pto(n), n = leafi , i ∈ {0, ... , k } }

are sent as input to a multi-pattern search algorithm (e. g. Aho-Corasick, Wu-
Manbers, or SBOM). This algorithm gives as output a list of pairs (pos, i) where
pos is the text position that matched and i is the number of the piece that matched.

The PEX algorithm performs verifications on its way upward in the tree, checking
the presence of longer and longer pieces of the pattern, as specified by the nodes.

4017
The PEX algorithm (6)

1 Search phase of algorithm PEX

2 for (pos, i) ∈ output of multi-pattern search do
3 n = leafi ; in = from(n); n = parent(n);
4 cand = true;
5 while cand = true and n 6= nil do
6 p1 = pos − (in − from(n)) − err (n);
7 p2 = pos + (to(n) − in) + err (n);
8 verify text tp1 ... tp2 for pattern piece pfrom(n) ... pto(n)
9 allowing err (n) errors;
10 if pattern piece was not found
11 then cand = false;
12 else n = parent(n);
13 fi
14 od
15 if cand = true
16 then report the positions where the whole p was found;
17 fi
18 od
4018
The PEX algorithm (7)

We search for annual in annual CPM anniversary. We constructed the tree for
annual. A multi-pattern search algorithm finds: (1, 1), (12, 1), (3, 2), (5, 3). (Note
that leaf i corresponds to pattern pi+1). For each of these positions we do the
hierarchical verification:

Initialization for (1,1);

n=n0; in=1; n=n3; cand=true;
While loop;
a) p1=1-(1-1)-1=0; p2=1+(4-1)+1=5;
verify pattern annu in text annua with 1 error => found !
b) p1=1-(1-1)-2=-1; p2=1+(6-1)+2=8;
verify pattern annual in text annual_C => found !
c) report end positions (6,7,8)

4019
The PEX algorithm (8)

Initialization for (3,2);

n=n1; in=3; n=n3; cand=true;
While loop;
a) p1=3-(3-1)-1=0; p2=3+(4-3)+1=5;
verify pattern annu in text annua with 1 error => found !
b) p1=3-(3-1)-2=-1; p2=3+(6-3)+2=8;
verify pattern annual in text annual_C => found !
c) report end positions (6,7,8)

4020
The PEX algorithm (9)

Initialization for (12,1);

n=n0; in=1; n=n3; cand=true;
While loop;
a) p1=12-(1-1)-1=11; p2=12+(4-1)+1=16;
verify pattern annu in text _anniv with 1 error => found !
b) p1=12-(1-1)-2=10; p2=12+(6-1)+2=19;
verify pattern annual in text M_annivers => NOT found !

4021
Summary

• Filtering algorithms prevent a large portion of the text from being looked at.

• The larger α = k /m, the less efficient filtering algorithms become.

• Filtering algorithms based on the pidgeonhole principle need an exact, multi-

pattern search algorithm and a verification capable approximate string matching
algorithm.

• The PEX algorithm starts verification from short exact matches and considers
longer and longer substrings of the pattern as the verification proceeds upward
in the tree.

4022
QUASAR - q-gram based database searching
This exposition has been developed by Knut Reinert. It is based on the following
sources, which are all recommended reading:

1. Burkhardt et al. (1999) q-gram Based Database Searching Using a Suffix Array
(QUASAR), Proc. RECOMB 99.

2. Burkhardt and Kärkkäinen (2001) Better Filtering with Gapped q-grams, Proc.
CPM 01.

The tool QUASAR aims at aligning a query S = s1, ... , sm in a text, also called
database D = d1, ... , dn . It can be seen as an efficient filter that uses exact matches.
In contrast to online filtering algorithms, QUASAR uses a suffix array as indexing
structure for the database.

5000
Quasar

5001
Quasar
QUASAR, or “Q-gram Alignment based on Suffix ARrays”, is a filtering ap-
proach. QUASAR finds all local approximate matches of a query sequence S in
a database D = {d, ...}. The verification is performed by other means.

Definition. A sequence d is locally similar to S, if there exists at least one pair

(Si,i+w−1, d 0) of substrings such that:

1. Si,i+w−1 is a substring of length w and d 0 is a substring of D, and

2. the substrings d 0 and Si,i+w−1 have edit distance at most k.

We call this the approximate matching problem with k differences and window length
w.

For simplicity, we assume that the database consists of only one sequence, i. e.
D = {d}.

5002
The q-gram lemma
A short subsequence of length q is called a q-gram. In the following we start by
considering the first w letters of S. The algorithm uses the following lemma:
Lemma 2. Let P and S be strings of length w with at most k differences. Then P
and S share at least w + 1 − (k + 1)q common q-grams.

In our case, this means:

Lemma 3. Let an occurrence of S1,w with at most k differences end at position j
in D. Then at least w + 1 − (k + 1)q of the q-grams in S1,w occur in the substring
Dj−w+1,j .
Proof: Exercise. . . .

That means that as a necessary condition for an approximate match, at least t =

w + 1 − (k + 1)q of the q-grams contained in S1,w occur in a substring of D with
length w. For example the strings ACAGCTTA and ACACCTTA have 8 + 1 − (1 + 1)3 = 3
common 3-grams, namely ACA,CTT and TTA.

5003
The q-gram lemma (2)

match? no if # match? maybe

q-grams < t if # q-grams ≥ t
D

1 w S

5004
q-gram index
The algorithm builds in a first step an indexing structure as follows:

1. Build a suffix array A over D.

2. Given q, compute for all possible | Σ |q q-grams the start position of the hitlist.
This allows to lookup a q-gram in constant time.

3. If another q is specified, A is used to recompute the above table.

5005
q-gram index (2)

0 7 AAAAACGCTAAGCG. . .
A A A A 1 87 AAAAAGGCT. . .
A A A C 2 32 AAAAAGGTTCTCCTTAAATC. . .

... 3 12 AAAAGAAAGTTCTCCTTAAATC. . .

C C G T .. ..
. .
...
.. ..
. .
...
.. ..
. .
T T T T
q-gram table
n 3 TTTTAAGGCCTTAAATC. . .
suffix array

5006
Counting q-grams
Now we have to find all approximate matches between S1,w and D, that means we
have to find all substrings in D that share at least t q-grams with S1,w . The algorithm
proceeds in the following basic steps on which we will elaborate:

1. Define two arrays of non-overlapping blocks of size b ≥ 2w. The first array is
shifted by b/2 against the other.

2. Process all q-grams in S1,w and increment the counters of the corresponding
blocks.

3. All blocks containing approximate matches will have a counter of at least t. (The
reverse is not true).

4. Shift the search window by one. Now we consider S2,w+1.

5007
Blocking

B_1 B_3 .............. B_c−3 B_c−1

B_2 B_4 .............. B_c−2

Since we want to count the q-grams that are in common between the query and
the database, we use counters. Ideally we would use a counter of size w for each
substring of this size. Since this uses too much memory, we build larger, non-
overlapping blocks. While this decreases the memory usage, it also decreases the
specificity.

Since the blocks are not overlapping we might miss q-grams that cross the block
boundary. As a remedy, we use a second, shifted array of blocks.

5008
Window Shifting
We started the search for approximate matches of window length w with the first
w-mer in S, namely S1,w . In order to determine the approximate matches for the
next window S2,w+1, we only have to discard the old q-gram S1,q and consider the
new q-gram Sw−q+2,w+1.

To do that we decrement the counters of all blocks that contain S1,q that have not
reached the threshold t. However, if the counter has already reached t it stays at
this value to indicate a match for the extension phase.

For the new block we use the precomputed index and the suffix array to find the
occurrences of the new q-gram and increment the corresponding block counters (at
most two).

5009
Alignment
After having computed the list of blocks, QUASAR uses BLAST to actually search
the blocks. Here are some results from the inital implementation. QUASAR was
run with w = 50, q = 11, and t such that windows with at most 6% differences are
found. Reasonable values for the block size are 512 to 4096.
DB size query id. res. filtr. ratio QUASAR BLAST
73.5 Mb 368 91.4% 0.24% 0.123 s 3.27 s
280 Mb 393 97.1% 0.17% 0.38 s 13.27 s
“A database in BLAST format is built in main memory which is then passed to the BLAST search
engine. The construction of this database requires a significant amount of time and introduces
unnecessary overhead.”

5010
Gapped q-grams

5011
Gapped q-grams
In order to achieve a high filtration rate, we would like to choose q as large as
possible, since the number of hits decreases exponentially in q. On the other hand,
the threshold t = w − q − qk + 1 also decreases with increasing q thereby reducing
the filtering efficiency. The question is whether we could increase the length of the
q-grams somehow, such that the threshold t stays high.

This can indeed be achieved by using gapped q-grams. For example the 3-grams
with the shape ##.# in the string ACAGCT are AC.G, CA.C, and AG.T:
ACAGCT
AC G
CA C
AG T
Next we define the concept formally.

5012
Gapped q-grams (2)

Definition 4.

• A shape Q is a set of non-negative integers containing 0.

• The size of Q, denoted by |Q|, is the cardinality of the set.

• The span of Q is s(Q) = max Q + 1.

• A shape of size q and span s is called (q, s)-shape.

• For any integer i and shape Q, the positioned shape Qi is the set {i + j | j ∈ Q}.

• Let Qi = {i1, i2, ... , iq }, where i = i1 < i2 < i3 < · · · < iq , and let S = s1s2 ... sm
be a string. For 1 ≤ i ≤ m − s(Q) + 1, the Q-gram at position i in S, denoted
by S[Qi ], is the string si1 si2 ... siq .

• Two strings P and S have a common Q-gram at position i if P[Qi ] = S[Qi ].

5013
Gapped q-grams (3)

Example 5. Let Q = {0, 1, 3, 6} be a shape. Using the graphical representation

it is the shape ##.#..#. Its size is |Q| = 4 and its span is s(Q) = 7. The string
ACGGATTAC has three Q-grams: S[Q1] = s1s2s4s7 = ACGT , S[Q2] = CGAA, and
S[Q3] = GGTC.

The q-gram lemma can be extended for gapped q-grams. A generalization gives

t = w − s(Q) − |Q|k + 1.
However it is not tight anymore (we will prove this).

5014
Gapped q-grams (4)

Example 6. Let w = 11 and k = 3 and consider the 3-shapes ### and ##.#. The
above threshold for the two shapes is 0 = 11 − 3 · 4 + 1 and −1 = 11 − 4 −
3 · 3 + 1 respectively. Thus neither shape would be useful for filtering. However,
the real threshold for ##.# is 1. This can be checked by a full enumeration of all
combinations of 3 mismatches.

shape: ### shape: ##.#

Worst-case mismatch positions

5015
New threshold
What is the (tight) threshold for arbitrary Q-shapes?

Let P = p1, ... , pw and S = s1, ... , sw be two strings of length w. Let R(P, S) be
the set of positions where S and P do not match. Then |R(S, P)| is the Hamming
distance of P and S.

To determine the common Q-grams of P and S only the mismatch set is needed: It
holds that
P[Qi ] = S[Qi ] if and only if Qi ∩ R(P, S) = ∅.

Using this notation we can define the threshold of a shape Q for a pattern of length
w and Hamming distance k as:

t(Q, w, k) := min {i ∈ {1, ... , w − s(Q) + 1} | Q ∩ R = ∅}
i
R⊆{1,...,w},|R|=k

5016
New threshold (2)

From the above discussion we get the following tight form of the q-gram lemma for
arbitrary shapes:
Lemma 7. Let Q be a shape. For any two strings P and S of length w with Ham-
ming distance k , the number of common Q-grams of P and S is at least t(Q, w, k).
Furthermore, there exist two strings P and S of length w and Hamming distance k,
for which the number of common Q-grams is exactly t(Q, w, k).

5017
New threshold (3)

It is easy to see that this bound is as least as tight as the lower bound we already
introduced:
Lemma 8.
t(Q, w, k ) ≥ max{0, w − s(Q) − |Q|k + 1}
Proof: Let R be the set minimizing the expression in the definition of
t(Q, w, k ). For each j ∈ R there are exactly |Q| integers i such that j ∈ Qi .
Therefore, at most k |Q| of the positioned shapes Qi , i ∈ {1, ... , w −s(Q)+1},
intersect with R, and at least w − s(Q) − k|Q| + 1 do not intersect with R.

5018
New threshold (4)

T The above lemma gives indeed the exact threshold for ungapped q-grams.
Lemma 9. Let Q be a contiguous shape,i. e., Q = {0, ... , q − 1}. Then

t(Q, w, k) = max{0, w − s(Q) − |Q|k + 1} = max{0, w − q(k + 1) + 1}.

Proof: The lower bound is shown by Lemma ??. For the upper bound
we choose R = {q, 2q, ... , kq}. Then Qi intersects with R if and only if
i ∈ {1, ... , kq}, and thus does not intersect with R if i ∈ {kq +1, ... , w −q +1}.
Hence for this R we have only w − q + 1 − kq − 1 + 1 = w − (k + 1)q + 1 common
q-grams.

5019
New threshold (5)

The following table gives the exact thresholds for all shapes for w = 50 and k = 5.
One can see that in many cases, especially for higher values of q, best gapped
shapes have higher thresholds than contiguous shapes of the same or even smaller
size.
s↓ : q→ 4 5 6 7 8 9 10
5 26 21 − − − − −
6 25 20 15 − − − −
7 24 19 14 9 − − −
8 23 18 13 8 3 − −
9 22 18 > 17 14 > 12 9 > 7 5>2 0 −
10 21 18 > 16 13 > 11 10 > 6 6>1 3>0 0
11 20 16 > 15 13 > 10 10 > 5 7>0 4>0 2>0
12 19 16 > 14 12 > 9 9 > 4 7>0 4>0 2>0

5020
New threshold (6)

It has to be noted that it does not suffice to put in gaps somewhere; the gaps have
to be choosen carefully. For example in the above table (w = 50, k = 5, and q = 12)
there are only two shapes with a positive threshold, namely ###.#..###.#..###.#
and #.#.#...#.....#.#.#...#.....#.#.#...# and their mirror images.

5021
Minimum coverage
The filtering efficiency of a Q-gram clearly depends on the threshold t(Q, w, k).
However there is also another factor that influences it. This factor is called minimum
coverage.

Before we define it formally lets have a look at an example.

5022
Minimum coverage (2)

Example 10. Let w = 13 and k = 3. Then both shapes ### and ##.# have a
threshold of two. If two strings have four consecutive characters then they have two
common 3-grams of shape ###. In contrast, in order to have two common 3-grams
of shape ##.#, two strings need at least 5 matching characters.

This means, that the gapped 3-gram would have a lower count of common q-grams
on strings that have only four consecutively matching characters although it has the
same threshold.

5023
Minimum coverage (3)

Definition 11. Let Q be a shape and t be a non-negative integer. The minimum

coverage of Q for threshold t is:

c(Q, t) = min | ∪i∈C Qi | .

C⊂N,|C|=t

Hence the minimum coverage is the minimum number of characters that need to
match between a pattern and a text substring for there to be t matching Q-grams.

Whenever possible, gapped Quasar chooses the highest minimum coverage, since
it makes it more unlikely that a random string matches t Q-grams. This improves
the filter efficency.

5024
Minimum coverage (4)

Computational experiments indicate that there is a strong correlation between the

minimum coverage c(Q, t(Q, m, k )) and the filter efficiency.

Correlation between expected and actual number of potential matches.

5025
Best shapes (5)

The following table shows different shapes for k = 5. The column best shows the
shape with the highest minimum coverage (ties are broken using the threshold).
The column median shows the median shape ordered by minimum coverage. If one
chooses a random shape, the chance is 50% to be better (or worse) than this one.
The last column show the best one-gapped shape. (The details of the tie breaking
used here can be read in the paper.)

q best median 1-gapped

6 ##......#..#..#.# #.###.....#.# #####....#
7 #.##......##..#.# ##..#..#..## #####...##
8 ##..#.#...............##..#.# #.#..####...#.# ######...##
9 ###..#..#.#...#.## ######..#.#.# #######..##
10 ###..#..#.#..###.# ##.##..#.#.###.# #######

5026
Index structure
It is not necessary to use a suffix array for ungapped q-grams, and it is not possible
anymore to use a suffix array for the gapped Q-grams. Instead, the database is
scanned twice. The first time the number of occurrences of all Q-grams is counted.

In the second scan, the positions at which a q-gram starts are recorded in an array
of size n. During that scan, the index points to the start of the respective list.

The detail shall be worked out as an exercise.

5027
Extension to Levenshtein distance
Note that the q-gram method presented so far can only be used to find local approx-
imate matches with the Hamming distance.

The q-gram method can be generalized to the Levenshtein distance. Burkhardt and
Kärkkäinen have described an extension that uses ‘one-gapped q-grams’.

The idea is to model insertions and deletions by additional Q-grams. For example,
with the basic shape ##-# applied the text, we would use ##-#, ##--#, and ### for
the pattern.

The filter then compares all three shapes in the pattern to the q-grams of the basic
shape in the text. Thus matching q-grams are even found in the presence of indels.

Otherwise he algorithm stays essentially unchanged, except that the threshold com-
putation is slightly different.

5028
Summary

• Filtering based on q-grams using a suffix array with an index is an efficient

filtering method.

• In the gapless case, filtering efficiencies of ≈ 0.2% were observed for genomic
sequences.

• Gapped Q-grams improve the filtering efficency further (by orders of magni-
tude).

• The threshold t and the minimum coverage both influence the filter efficency.

• No closed formula is known for computing t for gapped Q-grams.

5029
Q-gram filters for ε-matches
This exposition was developed by Clemens Gröpl. It is based on:

• Kim R. Rasmussen, Jens Stoye, Eugene W. Myers: Efficient q-Gram Filters for
Finding All -Matches over a Given Length, Journal of Computational Biology,
Volume 13, Number 2, 2006, pages 296–308. (Originally presented at GCB
2004 and RECOMB 2005.) [RSM06]

6000
Motivation
Comparison of large genomic sequences can be speeded up a lot if filtering tech-
niques are applied. The key observation is that a local alignment of high sequence
similarity must contain at least a few short exact matches.

The idea of using q-grams for fast filtering is not new. A q-gram is a substring of
length q. Programs like BLAST use q-grams which occur in both sequences as
seeds for a local alignment search.

It has also been observed that combining the idea of seeds with a combinatorial
argumentation based on some form of the pigeon hole principle can be used to
discard large parts of the input sequences from further consideration, because they
cannot contain a good local alignment.

6001
Motivation (2)

We can distinguish three kinds of algorithms.

When applied for finding highly similar regions, the classical exact algorithms (e. g.
Smith-Waterman) will spend most of the time verifying that there is no match be-
tween a given pair of regions. The running times (typically the product of sequence
lengths) are infeasible for genome size sequences.

Heuristics like BLAST typically employ a q-gram index to locate seeds and perform a
verification for the candidate regions located in this way. However, BLAST might fail
to recognize an existing match, unless the filtering parameters are set very stringent.
Thus one has to trade off sensitivity against speed.

A filter is an algorithm that allows us to discard large parts of the input, but is guar-
anteed not to loose any significant match. The trade-off to be considered for filtering
algorithms is thus only whether the additional effort is payed off by the saving of time
spent for verifications.

6002
Motivation (3)

In this lecture, we will consider the problem of finding matches of low error rate ε
and a given minimum length n0.

The cost measure will be the edit distance (Levenshtein distance). That is, the dis-
tance between two strings is the number of insertions, deletions, and substitutions
needed to transform one into the other.

The SWIFT algorithm is an improvement of the QUASAR algorithm by Burkhardt

et. al.. Note, however, that QUASAR uses an absolute error threshold rather than
an error rate. Using an error rate is more appropriate since the length of a local
alignment is not known in advance.

The filter has been successfully applied for the fragment overlap computation in
sequence assembly and for BLAST-like searching in EST sequences.

6003
Definitions
As usual, let A and B denote strings over a finite alphabet Σ, let |A| be the length of
A, let A[i] be the i-th letter of A, and let A[p..q] be the substring starting at position
p and ending with position q of A, thus A[i..i] consists of the letter A[i]. A substring
of length q > 0 of A is a q-gram of A.

The (unit cost) edit distance between strings A and B is the minimum number of
edit operations (insertion, deletion, substitution) in an alignment of A and B. It is
denoted by dist(A, B).

The edit distance can be computed by the well-known Needleman-Wunsch algo-

rithm. It computes in O(|A||B|) time an edit matrix E(i, j) := dist(A[1..i], B[1..j]). The
letter A[i] corresponds to the step from row i − 1 to i, so it is natural to visualize the
letters between the rows and columns of the edit matrix, etc..

An ε-match is a local alignment for substrings (α, β) with an error rate of at most ε.
That is, dist(α, β) ≤ ε|β|. (Note the ‘asymmetry’ in the definition of error rate.)

6004
Definitions (2)

The problem can now be formally stated as follows:

Given a target string A and a query string B, a minimum match length n0

and a maximum error rate ε > 0;
Find all ε-matches (α, β) where α and β are substrings of A and B, respec-
tively, such that

1. |β| ≥ n0 and

2. dist(α, β) ≤ bε|β|c.

6005
q-gram filters for ε-matches
A q-hit is a pair (i, j) of indices such that A[i..i + q − 1] = B[j..j + q − 1].

The basic idea of the q-gram method is as follows:

1. Find (enumerate) all q-hits between the query and the target strings.

2. Identify regions (in the Cartesian product of the strings) that have “enough” hits.

3. Such candidate regions are then subjected to a closer examination.

The concrete methods differ in the shape and the size of the regions.

6006
q-gram filters for ε-matches (2)

The following lemma relates ε-matches (α, β) to parallelograms of the edit matrix.
For a moment, we assume that the length of β is known, so that we can work with
an absolute bound on the distance.

An n × e parallelogram of the edit matrix consists of entries from n + 1 consecutive

rows and e + 1 consecutive diagonals.

Lemma 1. Let α and β be substrings of A and B, respectively, and assume that

|β| = n and dist(α, β) ≤ e. Then there exists an n × e parallelogram P such that

1. P contains at least T (n, q, e) := (n + 1) − q(e + 1) q-hits,

2. the B-projection of the parallelogram is pB (P) = β,

3. the A-projection pA(P) of the parallelogram is contained in α.

The A- and B-projections are defined as illustrated below.

6007
q-gram filters for ε-matches (3)

The A-projection pA(P) of a parallelogram P is defined as the substring of A between

the last column of the first row of P and the first column of the last row of P.

The B-projection pB (P) of a parallelogram P is defined as the substring of B be-

tween the first and the last row of P.
6008
q-gram filters for ε-matches (4)

(Note: these figures are taken from the RECOMB and GCB version, which uses the
transposed matrix of the JCB article.)

6009
q-gram filters for ε-matches (5)

Clearly, a q-hit (i, j) corresponds to q + 1 consecutive entries of the edit matrix along
the diagonal j − i. A q-hit is contained in a parallelogram if its corresponding matrix
entries are.

The proof of Lemma 1 is straightforward: Consider the path of an optimal alignment

of α and β. At each row except for the last q ones, we have a q-gram unless there
is an edit operation among the next q edges. Each edit operation can ‘destroy’ at
most q q-hits.

So the case where |β| is fixed was easy. Next we consider -matches for |β| ≥ n0.
The following lemma is the combinatorial foundation of the SWIFT algorithm.

6010
q-gram filters for ε-matches (6)

Lemma 2. Let α and β be substrings of A and B, respectively, and assume that

|β| ≥ n0 and dist(α, β) ≤ ε|β|. Let U(n, q, ε) := T (n, q, bεnc) = (n + 1) − q(bεnc + 1)
and assume that the q-gram size q and the threshold τ have been chosen such that

q < d1/εe and τ ≤ min U(n0, q, ε), U(n1, q, ε) ,

where n1 := (bεn0c + 1)/ε .

Then there exists a w × e parallelogram P such that:

1. P contains at least τ q-hits whose projections intersect α and β,

2. w = (τ − 1) + q(e + 1),
$ %
2τ + q − 3
3. e = ,
1/ε − q

4. if |β| ≤ w, then pB (P) contains β, otherwise β contains pB (P).

6011
q-gram filters for ε-matches (7)

The purpose of Lemma 2 is as follows. Given parameters ε and n0, we can choose
suitable values for q, τ , w, and e using Lemma 2. Then we enumerate all parallel-
ograms P with enough hits according to these parameters. All relevant ε-matches
can be found in these regions.

6012
q-gram filters for ε-matches (8)

Proof of Lemma 2. The lemma is proven in three steps:

1. Assuming there is an ε-match (α, β) of length |β| = n ≥ n0, show that there are
at least τ q-hits in the surrounding n × bεnc parallelogram.

2. Argue that there is a w × e parallelogram that contains at least τ q-hits, where

w and e do not depend on n ≥ n0.

3. Determine the dimensions w and e of such a parallelogram.

6013
q-gram filters for ε-matches (9)

. . . details omitted . . .

6014
q-gram filters for ε-matches (10)

6015
Algorithm
The SWIFT algorithm relies on the q-gram filter for -matches of length n0 or greater.
Using the parameters obtained from Lemma 2, it searches for all w × e parallelo-
grams which contain a sufficient number of q-grams.

6016
Algorithm (2)

In the preprocessing step, we construct a q-gram index for the target sequence A.
The index consists of two tables:
1. The occurrence table is a concatenation of the lists L(G) := { i | A[i..i + q − 1] =
G } for all q-grams G ∈ Σq in A.
2. The lookup table is an array indexed by the natural encoding of G to base |Σ|,
giving the start of each list in the occurrence table.

6017
Algorithm (3)

Once the q-gram index is built, the w × e parallelograms containing τ or more q-hits
can be found using a simple sliding window algorithm.

The idea is to split the (fictitious) edit matrix into overlapping bins of e + 1 diagonals.
For each bin we count the number of q-hits in the w × e parallelogram that is the
intersection of the diagonals of the corresponding bin and the rows of the sliding
window Wj := B[j..j + q − 1].

As the sliding window proceeds to Wj+1, the bin counters are updated to reflect the
changes due to the q-grams leaving and entering the window.

Whenever a bin counter reaches τ , the corresponding parallelogram is reported.

Overlapping parallelograms can be merged on the fly.

The space requirement for the bins is reduced by searching for somewhat larger
parallelograms of size w × (e + ∆). Then each bin counts for e + ∆ + 1 diagonals, and
successive bins overlap by e diagonals. While this will lead to more verifications, it
reduces the number of bins which have to be maintained. In practice, ∆ is set to a
power of 2, and bin indices are computed with fast bit-operations.
6018
Algorithm (4)

6019
Algorithm (5)

6020
Algorithm (6)

6021
Algorithm (7)

6022
Algorithm (8)

6023
Algorithm (9)

6024
Algorithm (10)

Each ‘candidate’ parallelogram must be checked for the presence of an ε-match.

This can be done trivially by dynamic programming. Alternatively, one can use the
knowledge about the q-grams in the ε-match to construct an alignment by sparse
dynamic programming.

6025
Algorithm (11)

6026
Algorithm (12)

6027
Results

6028

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (83)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
LIGGGHTS (R) - PUBLIC Users Manual (Autoguardado)
100% (1)
LIGGGHTS (R) - PUBLIC Users Manual (Autoguardado)
137 pages
Operation Research Sec B
No ratings yet
Operation Research Sec B
11 pages
CSI 2110 Summary PDF
No ratings yet
CSI 2110 Summary PDF
17 pages
CCS592 - AZHT KSCP 2017marking Scheme USM
No ratings yet
CCS592 - AZHT KSCP 2017marking Scheme USM
5 pages
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
4.5/5 (3)
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
1 s2.0 0890540191900465 Main
No ratings yet
1 s2.0 0890540191900465 Main
27 pages
Algorithms Everything
No ratings yet
Algorithms Everything
33 pages
DAA Unit 4
No ratings yet
DAA Unit 4
34 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
Basic PRAM Algorithm Design Techniques
No ratings yet
Basic PRAM Algorithm Design Techniques
13 pages
Question Text: Feedback
No ratings yet
Question Text: Feedback
17 pages
Fusion Trees Report
No ratings yet
Fusion Trees Report
6 pages
String Matching
No ratings yet
String Matching
5 pages
Daa Notes (Final)
No ratings yet
Daa Notes (Final)
41 pages
Exact String Matchin
No ratings yet
Exact String Matchin
7 pages
UGRD CS6202 Algorithms and Complexity Midterm Exam
No ratings yet
UGRD CS6202 Algorithms and Complexity Midterm Exam
15 pages
Notebook 231102
No ratings yet
Notebook 231102
10 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
daaunit5-IT3
No ratings yet
daaunit5-IT3
21 pages
Randomized Algorithms
No ratings yet
Randomized Algorithms
12 pages
Slides On Data Structures Tree and Graph
No ratings yet
Slides On Data Structures Tree and Graph
94 pages
Solutions To Algorithms PS 4
No ratings yet
Solutions To Algorithms PS 4
20 pages
Fast Pattern Matching In: Strings
No ratings yet
Fast Pattern Matching In: Strings
28 pages
Hwk4 Solution
No ratings yet
Hwk4 Solution
8 pages
Compsci Explanations PDF
No ratings yet
Compsci Explanations PDF
24 pages
Module 4
No ratings yet
Module 4
83 pages
Lec27 DFS III
No ratings yet
Lec27 DFS III
29 pages
Ir Asnment
No ratings yet
Ir Asnment
6 pages
Approximate String
No ratings yet
Approximate String
36 pages
Sikkim Manipal University: July 2011
No ratings yet
Sikkim Manipal University: July 2011
25 pages
Lec 14 and 15 2020
No ratings yet
Lec 14 and 15 2020
61 pages
Daa Kcs503 2021-22 Aktu Qpaper Sol
No ratings yet
Daa Kcs503 2021-22 Aktu Qpaper Sol
40 pages
Editorial Runda Finala
No ratings yet
Editorial Runda Finala
8 pages
York University CSE 3101 Summer 2012 - Exam
No ratings yet
York University CSE 3101 Summer 2012 - Exam
10 pages
IE2108 Data Structure and Algorithm Summary
No ratings yet
IE2108 Data Structure and Algorithm Summary
28 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Commentaries 2024
No ratings yet
Commentaries 2024
111 pages
Agm5 Qualification Editorial-1
No ratings yet
Agm5 Qualification Editorial-1
10 pages
Describe The Following: Fibonacci Heaps Binomial Heaps
No ratings yet
Describe The Following: Fibonacci Heaps Binomial Heaps
13 pages
Artificial Intelligence Record: Shraddha Muralidhar/1BM15CS143
No ratings yet
Artificial Intelligence Record: Shraddha Muralidhar/1BM15CS143
19 pages
Information Strucutres
No ratings yet
Information Strucutres
13 pages
KMP 2
No ratings yet
KMP 2
7 pages
Data Structures and Algorithms Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms Data Structures and Algorithms
29 pages
Algorithms in Bioinformatics
No ratings yet
Algorithms in Bioinformatics
7 pages
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
No ratings yet
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
5 pages
COMP 482: Design and Analysis of Algorithms: Spring 2013
No ratings yet
COMP 482: Design and Analysis of Algorithms: Spring 2013
34 pages
Cs 702 Final Term Solved
No ratings yet
Cs 702 Final Term Solved
19 pages
Levitin: Introduction To The Design and Analysis of Algorithms
No ratings yet
Levitin: Introduction To The Design and Analysis of Algorithms
35 pages
Problem Set 3
No ratings yet
Problem Set 3
2 pages
Induction and Recursion
No ratings yet
Induction and Recursion
7 pages
Inf 3015 - Graph
No ratings yet
Inf 3015 - Graph
14 pages
AI Notes
No ratings yet
AI Notes
10 pages
Algorithms Assignment 4: July 9, 2020
No ratings yet
Algorithms Assignment 4: July 9, 2020
3 pages
Lecture 04
No ratings yet
Lecture 04
18 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Problem Set 4 Solutions: Introduction To Algorithms
No ratings yet
Problem Set 4 Solutions: Introduction To Algorithms
5 pages
Robert M. Keller
No ratings yet
Robert M. Keller
15 pages
comp2123_week4
No ratings yet
comp2123_week4
4 pages
problemset5Calgary
No ratings yet
problemset5Calgary
2 pages
Logic Primer, third edition
From Everand
Logic Primer, third edition
Colin Allen
No ratings yet
16.20 Bengt Steinbrecher Jean Jacques Bois NANOLIKE
No ratings yet
16.20 Bengt Steinbrecher Jean Jacques Bois NANOLIKE
23 pages
Sagrada Familia
No ratings yet
Sagrada Familia
5 pages
SAP S4 HANA Introduction
No ratings yet
SAP S4 HANA Introduction
33 pages
Magic Quadrant For Network Performance Monitoring and Diagnostics - 2016
100% (1)
Magic Quadrant For Network Performance Monitoring and Diagnostics - 2016
28 pages
Dara 379
No ratings yet
Dara 379
6 pages
Drug Name Dosage Mechanism of Action Specific Interactions Contraindications and Cautions Adverse Reaction Nursing Considerations
No ratings yet
Drug Name Dosage Mechanism of Action Specific Interactions Contraindications and Cautions Adverse Reaction Nursing Considerations
4 pages
11 Comm 308 Final Exam (Fall 2011) Solutions
No ratings yet
11 Comm 308 Final Exam (Fall 2011) Solutions
18 pages
Csec CXC Pob Past Papers January 2010 Paper 02 PDF
No ratings yet
Csec CXC Pob Past Papers January 2010 Paper 02 PDF
5 pages
Definitions and Types of False Friends
50% (2)
Definitions and Types of False Friends
4 pages
Changes Incorporated Into Versions of SRIM/TRIM
No ratings yet
Changes Incorporated Into Versions of SRIM/TRIM
5 pages
2011 Verizon Foundation Tax Package PDF
No ratings yet
2011 Verizon Foundation Tax Package PDF
302 pages
Top 20 Cancer Drugs
No ratings yet
Top 20 Cancer Drugs
8 pages
Historical Dictionary of Mesopotamia Gwendolyn Leick Download PDF
100% (8)
Historical Dictionary of Mesopotamia Gwendolyn Leick Download PDF
84 pages
Using Psychological Science To Help Children Thrive
No ratings yet
Using Psychological Science To Help Children Thrive
3 pages
Mathletics 3
No ratings yet
Mathletics 3
1 page
Sunwave™ Prismatic Skylights: Product Information
No ratings yet
Sunwave™ Prismatic Skylights: Product Information
9 pages
Indonesia National Policy. DEVELOPMENT OF COMMUNITY-BASED WATER SUPPLY AND ENVIRONMENTAL SANITATION
No ratings yet
Indonesia National Policy. DEVELOPMENT OF COMMUNITY-BASED WATER SUPPLY AND ENVIRONMENTAL SANITATION
87 pages
Brochure Schiller SpiroScout en
No ratings yet
Brochure Schiller SpiroScout en
4 pages
Ahsanullah Omari HW Excel
No ratings yet
Ahsanullah Omari HW Excel
14 pages
Install Formwork Components
No ratings yet
Install Formwork Components
2 pages
Experiment 2: Zener Diode: Objective
No ratings yet
Experiment 2: Zener Diode: Objective
5 pages
R-KEM II ETA - Polyester Resin For Concrete
No ratings yet
R-KEM II ETA - Polyester Resin For Concrete
16 pages
NetBackup Appliance Lab
No ratings yet
NetBackup Appliance Lab
90 pages
CPAR LAS Module 6
No ratings yet
CPAR LAS Module 6
3 pages
Worksheet 1 - Trends and Fads
No ratings yet
Worksheet 1 - Trends and Fads
4 pages
Mathematics I A (EM) BLM 2021-22
No ratings yet
Mathematics I A (EM) BLM 2021-22
130 pages
Anushka MPR Final
No ratings yet
Anushka MPR Final
38 pages
Parliament Library Building
No ratings yet
Parliament Library Building
372 pages