0% found this document useful (0 votes)
34 views33 pages

String Matching Algorithms Overview

The document discusses the string matching problem, which involves finding occurrences of a pattern within a text, and highlights its applications in areas like DNA sequencing and internet search engines. It introduces various algorithms for string matching, focusing on the Naive and Rabin-Karp algorithms, detailing their procedures and complexities. The document also explains the concepts of valid and invalid shifts in string matching, along with definitions and notations relevant to the algorithms.

Uploaded by

dayalrattanani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views33 pages

String Matching Algorithms Overview

The document discusses the string matching problem, which involves finding occurrences of a pattern within a text, and highlights its applications in areas like DNA sequencing and internet search engines. It introduces various algorithms for string matching, focusing on the Naive and Rabin-Karp algorithms, detailing their procedures and complexities. The document also explains the concepts of valid and invalid shifts in string matching, along with definitions and notations relevant to the algorithms.

Uploaded by

dayalrattanani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CT-363

Design & Analysis of Algorithms


String Matching
String Matching Problem
• Text-editing programs frequently need to find all
occurrences of a pattern in the text.
• Typically, the text is a document being edited, and
the pattern searched for is a particular word
supplied by the user.
• Efficient algorithms for this problem—called
“string matching”
• String matching can grealty aid the
responsiveness of the text-editing program
String Matching Problem
• In Computer Science string searching algorithms,
sometimes called string matching algorithms, that
try to find a place where one or several string
(called patterns) are found within large string or
text.
Application- String Matching Algorithms
• Particular patterns in DNA Sequence.
• Internet search engines
String Matching Problem
• We assume that the text is an array T [1 .. n] of
length n and that the pattern is an array P[1 .. m]
of length m ≤ n.

• We further assume that the elements of P and T


are characters drawn from a finite alphabet Σ.
– For example, we may have Σ = {0, 1} or
Σ = {a, b, . . . , z}.

• The character arrays P and T are often called


strings of characters.
String Matching Problem
• We say that pattern P occurs with shift s in text T (or,
equivalently, that pattern P occurs beginning at position
s + 1 in text T) if
0 ≤ s ≤ n - m and T [s + 1 .. s + m] = P[1 .. m] i.e.
T [s + j] = P[ j], for 1 ≤ j ≤ m).
– If P occurs with shift s in T, we call s a valid shift;
– otherwise, we call s an invalid shift.
String Matching Problem
• The string-matching problem is “finding all valid shifts
with which a given pattern P occurs in a given text T”.
Example: String Matching Problem

Text T a b c a b a a b c a b a c

s=3
Pattern P a b a a
Definitions and Notations

Notation Terminology
Σ* The set of all finite-length strings formed using characters from the
alphabet Σ.
ε The zero-length empty string, also belongs to Σ*.

|x| The length of a string x.

xy The concatenation of two strings x and y has length |x| + |y| and
consists of the characters from x followed by the characters from y.
wx A string w is a prefix of a string x, if x = wy for some string y 
Σ*. If w  x, then |w| ≤ |x|.
wx A string w is a suffix of a string x, if x = yw for some y  Σ*. If w
 x that |w| ≤ |x|.
1. Naive Approach
• The idea is based on Brute Force Approach.

• The naive algorithm finds all valid shifts using a loop that
checks the condition P[1 .. m] = T[s + 1 .. s + m] for each
of the n - m + 1 possible values of s.

• It can be interpreted graphically as sliding a


“template“ containing the pattern over the text, noting for
which shifts all of the characters on the template equal
the corresponding characters in the text.
String Matching Algorithms
String Matching Algorithms
• There are many types of String Matching
Algorithms
The Naive string-matching algorithm
The Rabin-Krap algorithm
String matching with finite automata
The Knuth-Morris-Pratt algorithm

• But we will discuss only 2 types, i.e. Naive &


Rabin-Krap
Naive String Matching Algorithm
• The naive algorithm finds all valid shifts using a loop
that checks the condition P[1..m] = T[1..s] for each
of the n-m+1 possible values of s.

NAIVE-STRING-MATCHER(T, P)
1 n ← length[T]
2 m ← length[P]
3 for s ← 0 to n - m
4 do if P[1 .. m] = T[s + 1 .. s + m]
5 then print "Pattern occurs with shift" s
Example: Naive String Matching Algorithm
• Suppose
P = aab
T = acaabc
Find all valid shifts
Example: Naive String Matching Algorithm

n ← length[T] = 6
m ← length[P] = 3
a c a a b c for s ← 0 to n – m (6 - 3 = 3)
P[1] = T[s + 1]
s=0
a a b P[1] = T[1] (As a = a)

P[2] = T[s + 2]
But P[2]  T[2] (As a  c)
Example: Naive String Matching Algorithm

a c a a b c for s ← 1
P[1] = T[s + 1]
s=1
a a b But P[1]  T[2] (As a  c)
Example: Naive String Matching Algorithm

for s ← 2
P[1] = T[s + 1]
P[1] = T[3] (As a = a)
a c a a b c

s=2 P[2] = T[s + 2]


a a b P[2] = T[4] (As a = a)

P[3] = T[s + 3]
P[3] = T[5] (As b = b)
Example: Naive String Matching Algorithm

for s ← 3

a c a a b c P[1] = T[s + 1]
P[1] = T[4] (As a = a)
s=3
a a b
P[2] = T[s + 2]
But P[2]  T[5] (As a  b)
Example: Naive String Matching Algorithm
Naive String Matching Algorithm
• Worst case Running Time
– Outer loop: n – m + 1
– Inner loop: m
– Total ((n - m + 1)m)
• Best-case: n-m
Note
• Not an optimal procedure for String Matching problem.
• It has high running time for worst case.
• The naive string-matcher is inefficient because
information gained about the text for one value of s is
entirely ignored in considering other values of s.
2. The Rabin-Karp Algorithm
• It compares string’s hash values, rather than string
themselves.
• Perform well in practice, and generalized to other
algorithms for related problems, such as two-dimensional
pattern matching.
2. The Rabin-Karp Algorithm
Special Case
• Given a text T [1 .. n] of length n, a pattern P[1 .. m] of
length m ≤ n, both as arrays.
• Assume that elements of P and T are characters drawn
from a finite set of alphabets Σ.
• Where Σ = {0, 1, 2, . . . , 9}, so that each character is a
decimal digit.
• Now our objective is “finding all valid shifts with which
a given pattern P occurs in a text T”.
Notations: The Rabin-Karp Algorithm
Let us suppose that
• p denotes decimal value of given a pattern P[1 .. m]
• ts = decimal value of length-m substring T[s + 1 .. s + m],
of given text T [1 .. n], for s = 0, 1, ..., n - m.
• It is very obvious that, ts = p if and only if
T [s + 1 .. s + m] = P[1 .. m];
thus, s is a valid shift if and only if ts = p.

• Now the question is how to compute p and ts efficiently


• Answer is Horner’s rule
Horner’s Rule
Example: Horner’s rule
[3, 4, 5] = 5 + 10(4 + 10(3)) = 5 + 10(4 + 30) = 5+340 =
345
p = P[3] + 10 (P[3 - 1] + 10(P[1])).

Formula
• We can compute p in time Θ(m) using this rule as
p = P[m] + 10 (P[m-1] + 10(P[m-2] + … + 10(P[2] + 10P[1]) ))

• Similarly t0 can be computed from T [1 .. m] in time Θ(m).

• To compute t1, t2, . . . , tn-m in time Θ(n - m), it suffices to


observe that ts+1 can be computed from ts in constant time.
Computing ts+1 from ts in constant time
• Text = [3, 1, 4, 1, 5, 2]; t0 = 31415
• m = 5; Shift = 0
3 1 4 1 5 2
• Old higher-order digit = 3
• New low-order digit = 2
• t1 = 10.(31415 – 104.T(1)) + T(5+1)
= 10.(31415 – 104.3) + 2
= 10(1415) + 2 = 14152

• ts+1 = 10(ts – T[s + 1] 10m-1 ) + T[s + m + 1])


• t1 = 10(t0 – T[1] 104) + T[0 + 5 + 1])

• Now t1, t2, . . . , tn-m can be computed in Θ(n - m)


Procedure: Computing ts+1 from ts
1. Subtract T[s + 1]10m-1 from ts, removes high-order digit
2. Multiply result by 10, shifts the number left one position
3. Add T [s + m + 1], it brings appropriate low-order digit.
ts+1 = (10(ts – T[s + 1] 10m-1 ) + T[s + m + 1])

Another issue and its treatment


• The only difficulty with the above procedure is that p and
ts may be too large to work with conveniently.

• Fortunately, there is a simple cure for this problem,


compute p and the ts modulo a suitable modulus q.
Computing ts+1 from ts Modulo q = 13

A window of length 5 is shaded.

The numerical value of window = 31415

31415 mod 13 = 7

2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
mod 13
7
Spurious Hits and their Elimination
• m = 5.
• p = 31415,
• Now, 31415 ≡ 7 (mod 13)
• Now, 67399 ≡ 7 (mod 13)
• Window beginning at position 7 = valid match; s = 6
• Window beginning at position 13 = spurious hit; s = 12
• After comparing decimal values, text comparison is
needed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
… … … mod 13

8 9 3 11 0 1 7 8 4 5 10 11 7 9 11
Valid match Spurious hit
2. The Rabin-Karp Algorithm
Generalization
• Given a text T [1 .. n] of length n, a pattern P[1 .. m] of
length m ≤ n, both as arrays.
• Assume that elements of P and T are characters drawn
from a finite set of alphabets Σ = {0, 1, 2, . . . , d-1}.
• Now our objective is “finding all valid shifts with which
a given pattern P occurs in a text T”.

Note
• ts+1 = (d(ts – T[s + 1]h) + T[s + m + 1]) mod q
where h = dm-1 (mod q) is the value of the digit “1” in the
high-order position of an m-digit text window.
Sequence of Steps Designing Algorithm
1. Compute the lengths of pattern P and text T
2. Compute p and ts under modulo q using Horner’s Rule
3. For any shift s for which ts ≡ p (mod q), must be tested
further to see if s is really valid shift or a spurious hit.
4. This testing can be done by checking the condition:
P[1 .. m] = T [s + 1 .. s + m]. If these strings are equal s
is a valid shift otherwise spurious hit.
5. If for shift s, ts ≡ p (mod q) is false, compute ts+1 and
replace it with ts and repeat the step 3.

Note
• As ts ≡ p (mod q) does not imply that ts = p, hence text
comparison is required to find valid shift
2. The Rabin-Karp Algorithm
RABIN-KARP-MATCHER(T, P, d, q)
1 n ← length[T]
2 m ← length[P]
3 h ← dm-1 mod q
4 p←0
5 t0 ← 0
6 for i ← 1 to m  Preprocessing.
7 do p ← (dp + P[i]) mod q
8 t0 ← (dt0 + T[i]) mod q
9 for s ← 0 to n - m  Matching.
10 do if p = ts
11 then if P[1 .. m] = T [s + 1 .. s + m]
12 then print "Pattern occurs with shift" s
13 if s < n - m
14 then ts+1 ← (d(ts - T[s + 1]h) + T[s + m + 1]) mod q
Analysis: The Rabin-Karp Algorithm
• Worst case Running Time
– Preprocessing time: Θ(m)
– Matching time is Θ((n – m + 1)m)

• If P = am, T = an, verifications take time Θ((n - m + 1)m),


since each of the n - m + 1 possible shifts is valid.

• In applications with few valid shifts, matching time of the


algorithm is only O((n - m + 1) + cm) = O(n + m), plus
the time required to process spurious hits.
Summary
4. The Knuth-Morris-Pratt algorithm
Algorithm Preprocessing Matching
Time Time
Naive 0 O((n-m+1)m)
Rabin-Karp (m) O((n-m+1)m)
Finite Automaton O(m| | ) (n)

You might also like