0% found this document useful (0 votes)

282 views16 pages

Approximate Matching

The document summarizes approximate string matching techniques. It discusses allowing mismatches or edits between strings when doing string matching. It introduces concepts like Hamming distance and edit distance to quantify the differences between strings. It then describes how to adapt exact string matching algorithms like naive, Boyer-Moore, and index-assisted techniques to do approximate matching by splitting the pattern string into substrings and requiring at least one substring to match exactly. The document provides Python pseudocode for an approximate Boyer-Moore algorithm using this approach.

Uploaded by

skgcp864355

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

282 views16 pages

Approximate Matching

Uploaded by

skgcp864355

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Approximate matching

Ben Langmead

Department of Computer Science

You are free to use these slides. If you do, please sign the
guestbook (www.langmead-lab.org/teaching-materials), or email
me ([email protected]) and tell me briefly how youre
using them. For original Keynote files, email me.

Read alignment requires approximate matching

Read
CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC

Reference
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC
GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA
ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA
ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC
TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT
CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA
CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA
GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA
CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT
TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC
AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA
ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC
GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC
TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC
TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA
TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA
CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG
AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA
CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG
ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG
AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG
AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC
AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT
CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG
AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA
GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA
AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG
AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA
TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT

Sequence dierences occur

because of...
1. Sequencing error
2. Natural variation

Approximate string matching

Looking for places where a P matches T with up to a certain number of
mismatches or edits. Each such place is an approximate match.
A mismatch is a single-character substitution:
T: G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G
||| |||||
P:
GTAACGGCG

An edit is a single-character substitution or gap (insertion or deletion):

T: G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G
||| |||||
P:
GTAACGGCG
Gap in T

T: G G A A A A A G A G G T A G C - G C G T T T A A C A G T A G
||||| |||
P:
GTAGCGGCG
T: G G A A A A A G A G G T A G C G G C G T T T A A C A G T A G
|| ||||||
P:
GT-GCGGCG
Gap in P

Hamming and edit distance

For two same-length strings X and Y, hamming distance is the minimum
number of single-character substitutions needed to turn one into the other:
X: G A G G T A G C G G C G T T T A A C
| |||| ||| |||||||
Y: G T G G T A A C G G G G T T T A A C

Hamming distance = 3

Edit distance (Levenshtein distance): minimum number of edits required to

turn one into the other:
X: T G G C C G C G C A A A A A C A G C
|| |||||||||| ||||
Y: T G A C C G C G C A A A A C
-A
CG
AC
GC
-C
AG
CC
GC
X: G C G T A T G C G G C T A A
|| |||||||||| |||
Y: G C T
-A
TT
AG
TC
GG
CGC
GT
CA
TT
AA
TC
AG
CC
GC

Edit distance = 2

Approximate string matching

Adapting the naive algorithm to do approximate string matching within
configurable Hamming distance:
def naiveApproximate(p, t, maxHammingDistance=1):
occurrences = []
for i in xrange(0, len(t) - len(p) + 1): # for all alignments
nmm = 0
for j in xrange(0, len(p)): # for all characters
if t[i+j] != p[j]: # does it match?
nmm += 1 # mismatch
if nmm > maxHammingDistance:
break # exceeded maximum distance
if nmm <= maxHammingDistance:
# approximate match; return pair where first element is the
# offset of the match and second is the Hamming distance
occurrences.append((i, nmm))
return occurrences

Instead of stopping upon first mismatch, stop when maximum

distance is exceeded

Python example: http://bit.ly/CG_NaiveApprox

Approximate string matching

How to make Boyer-Moore and index-assisted exact matching approximate?
Helpful fact: Split P into non-empty non-overlapping substrings u and v. If P
occurrs in T with 1 edit, either u or v must match exactly.

P
u

v
...or here. Cant go anywhere else!

Either the edit goes here...

More generally: Let p1, p2, ..., pk+1 be a partitioning of P into k+1 nonoverlapping non-empty substrings. If P occurrs in T with up to k edits, then
at least one of p1, p2, ..., pk+1 must match exactly.

P
p1

...

pk+1

k edits can aect as many as k of these, but not all

Approximate string matching

These rules provides a bridge from the exact-matching methods weve
studied so far, and approximate string matching.
P
p1

...

pk+1

k edits can overlap as many as k of these, but not all

Use an exact matching algorithm to find exact matches for p1, p2, ..., pk+1.
Look for a longer approximate match in the vicinity of the exact match.
check

check

p4
Exact
match

Approximate string matching

def bmApproximate(p, t, k, alph="ACGT"):
""" Use the pigeonhole principle together with Boyer-Moore to find
approximate matches with up to a specified number of mismatches. """
if len(p) < k+1:
raise RuntimeError("Pattern too short (%d) for given k (%d)" % (len(p), k))
ps = partition(p, k+1) # split p into list of k+1 non-empty, non-overlapping substrings
off = 0 # offset into p of current partition
occurrences = set() # note we might see the same occurrence >1 time
for pi in ps: # for each partition
bm_prep = BMPreprocessing(pi, alph=alph) # BM preprocess the partition
for hit in bm_prep.match(t)[0]:
if hit - off < 0: continue # pattern falls off left end of T?
if hit + len(p) - off > len(t): continue # falls off right end?
# Count mismatches to left and right of the matching partition
nmm = 0
for i in range(0, off) + range(off+len(pi), len(p)):
if t[hit-off+i] != p[i]:
nmm += 1
if nmm > k: break # exceeded maximum # mismatches
if nmm <= k:
occurrences.add(hit-off) # approximate match
off += len(pi) # Update offset of current partition
return sorted(list(occurrences))

Full example: http://bit.ly/CG_BoyerMooreApprox

Approximate Boyer-Moore performance

Boyer-Moore, 1 mismatch
with pigeonhole

Boyer-Moore, exact

Boyer-Moore, 2 mismatches
with pigeonhole

# character wall clock

# character wall clock
# character wall clock
#
matches
#
matches
# matches
comparisons
comparisons
comparisons
time
time
time

P: tomorrow
T: Shakespeares
complete works

786 K

1.91s

3.05 M

7.73 s

6.98 M

16.83 s

382

32.5 M

67.21 s

336

107 M

209 s

1,045

171 M

328 s

2,798

P: 50 nt string
from Alu repeat*
T: Human
reference (hg19)
chromosome 1

* GCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGG

Approximate string matching: more principles

Let p1, p2, ..., pk+1 be a partitioning of P into k+1 non-overlapping non-empty
substrings. If P occurrs in T with up to k edits, then at least one of p1, p2, ..., pk+1
must match exactly.

P
p1

...

pk+1

New principle:
Let p1, p2, ..., pj be a partitioning of P into j non-overlapping non-empty
substrings. If P occurs with up to k edits, then at least one of p1, p2, ..., pj must
occur with floor(k / j) edits.

Review: approximate matching principles

Let j = k + 1

Non-overlapping substrings
General

Pigeonhole principle
p1, p2, ..., pj is a partitioning of P. If P occurs
with k edits, at least one partition
matches with floor(k / j) edits.

Specific

Pigeonhole principle with j = k + 1

p1, p2, ..., pk+1 is a partitioning of P. If P occurrs
in T with k edits, at least one partition
matches exactly.

Why?
Smallest value s.t. floor(k / j) = 0
Why make floor(k / j) = 0?
So we can use exact matching
Why is smaller j good?
Yields fewer, longer partitions
Why are long partitions good?
Makes exact-matching filter more
specific, minimizing # candidates

Approximate string matching: more principles

We partitioned P into non-overlapping substrings

...

pk+1

P
Consider overlapping substrings
P3

P2
P1

Pn-l+1

...

P5
P4

Pn-l
Pn-l-1

Approximate string matching: more principles

x
x
x
x

...

n-q+1
of these

n
Say substrings are length q. There are n - q + 1 such substrings.
Worst case: 1 edit to P changes up to q substrings
Minimum # of length-q substrings unedited after k edits?
q-gram lemma: if P occurs in T with up to k edits, alignment
must contain t exact matches of length q, where t n - q + 1 - kq

n - q + 1 - kq

Approximate string matching: more principles

If P occurs in T with up to k edits, alignment contains an exact match of

length q, where q floor(n / (k + 1))
Derived by solving this for q:

n - q + 1 - kq 1

Exact matching filter: find matches of length floor(n / (k + 1)) between

T and any substring of P. Check vicinity for full match.

Approximate matching principles

Non-overlapping substrings

q-gram lemma

General

Pigeonhole principle
p1, p2, ..., pj is a partitioning of P. If P occurs
with k edits, at least one partition matches
with floor(k / j) edits.

Pigeonhole principle with j = k + 1

Specific

Overlapping substrings

p1, p2, ..., pk+1 is a partitioning of P. If P occurrs

in T with k edits, at least one partition
matches exactly.

If P occurs with k edits, alignment

contains t exact matches of length q,
where t n - q + 1 - kq

q-gram lemma with t = 1

If P occurs with k edits, alignment
contains an exact match of length q
where q floor(n / (k + 1))

...

...
P
P

Sensitivity
Sensitivity = fraction of true approximate matches discovered by the algorithm
Lossless algorithm finds all of them, lossy algorithm doesnt necessarily
Weve seen lossless algorithms. Most everyday tools are lossy. Lossy algorithms
are often much speedier & still acceptably sensitive (e.g. BLAST, BLAT, Bowtie).

...
P

Example lossy algorithm: pick q > floor(n / (k + 1))

Edit Distance & Dynamic Programming
No ratings yet
Edit Distance & Dynamic Programming
30 pages
Note 4
No ratings yet
Note 4
1 page
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Module V
No ratings yet
Module V
4 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
On The Communication Complexity of Approximate Pattern Matching
No ratings yet
On The Communication Complexity of Approximate Pattern Matching
67 pages
String Matching
No ratings yet
String Matching
5 pages
String Matching
No ratings yet
String Matching
66 pages
Hamming Distance - Wikipedia
No ratings yet
Hamming Distance - Wikipedia
23 pages
String Edit PDF
No ratings yet
String Edit PDF
39 pages
4 Module Algorithms
No ratings yet
4 Module Algorithms
28 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
13 pages
Levenshtein
No ratings yet
Levenshtein
14 pages
03 Myers Bit Vector
No ratings yet
03 Myers Bit Vector
12 pages
18-IntroNLP II PDF
No ratings yet
18-IntroNLP II PDF
187 pages
Semester Final Project Report
No ratings yet
Semester Final Project Report
11 pages
Lab5 Ch2 Sequence Similarity PDF
No ratings yet
Lab5 Ch2 Sequence Similarity PDF
95 pages
Lecture 2
No ratings yet
Lecture 2
71 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
Efficient Merging and Filtering Algorithms For Approximate String Searches
No ratings yet
Efficient Merging and Filtering Algorithms For Approximate String Searches
10 pages
4-Tolerant Retrieval
No ratings yet
4-Tolerant Retrieval
82 pages
On Differentially Private String Distances
No ratings yet
On Differentially Private String Distances
25 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
DAA Summarized Unit 5
No ratings yet
DAA Summarized Unit 5
21 pages
Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time
No ratings yet
Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time
12 pages
Information Retrieval Journal
No ratings yet
Information Retrieval Journal
33 pages
String Matching Algorithms Guide
No ratings yet
String Matching Algorithms Guide
46 pages
Introduction To String Matching
No ratings yet
Introduction To String Matching
28 pages
Auto-Correction via N-gram Indexing
No ratings yet
Auto-Correction via N-gram Indexing
5 pages
String Matching
No ratings yet
String Matching
16 pages
String Matching Algorithms Guide
No ratings yet
String Matching Algorithms Guide
57 pages
Lec10 12 Edit Distance
No ratings yet
Lec10 12 Edit Distance
54 pages
Efficient String Search Techniques
No ratings yet
Efficient String Search Techniques
2 pages
String Matching Algorithms Analysis
No ratings yet
String Matching Algorithms Analysis
5 pages
IR Practical B1
No ratings yet
IR Practical B1
15 pages
Daa
No ratings yet
Daa
10 pages
Rabin Karp Algorithm of Pattern Matching (Goutam Padhy)
No ratings yet
Rabin Karp Algorithm of Pattern Matching (Goutam Padhy)
15 pages
05 Dynamic Programming I I
No ratings yet
05 Dynamic Programming I I
64 pages
Design 1
No ratings yet
Design 1
15 pages
10 String Algorithms
No ratings yet
10 String Algorithms
36 pages
54.string Inotes
No ratings yet
54.string Inotes
20 pages
Topcoder Article
No ratings yet
Topcoder Article
8 pages
17 StringMatching
No ratings yet
17 StringMatching
18 pages
DAA (Algorithms Knowledge Capsule 4 by Dr. Choudhary Ravi Singh)
No ratings yet
DAA (Algorithms Knowledge Capsule 4 by Dr. Choudhary Ravi Singh)
20 pages
String Matching Algorithms Guide
No ratings yet
String Matching Algorithms Guide
63 pages
Unit 7
No ratings yet
Unit 7
60 pages
BioInfor Assignment
No ratings yet
BioInfor Assignment
4 pages
Z Function and Its Calculation:: Int Int Int Int For Int If While If
No ratings yet
Z Function and Its Calculation:: Int Int Int Int For Int If While If
32 pages
Graph Algorithms: BFS, DFS, and Applications
No ratings yet
Graph Algorithms: BFS, DFS, and Applications
8 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Internetalgo
No ratings yet
Internetalgo
13 pages
4 EditDistance
No ratings yet
4 EditDistance
23 pages
KMP and Moores
No ratings yet
KMP and Moores
14 pages
Database Systems & Business Intelligence
No ratings yet
Database Systems & Business Intelligence
61 pages
DNA Sequence Compression Technique Based On Nucleotides Occurrence
No ratings yet
DNA Sequence Compression Technique Based On Nucleotides Occurrence
4 pages
VSSUT Hardware & Software Tender
No ratings yet
VSSUT Hardware & Software Tender
25 pages
Application Guide: Examiners and Moderators
No ratings yet
Application Guide: Examiners and Moderators
8 pages
Challenges
No ratings yet
Challenges
1 page
CSR Fire Seal
100% (1)
CSR Fire Seal
16 pages
Exploratory Data Analysis Engineering Statistics Handbook PDF
100% (1)
Exploratory Data Analysis Engineering Statistics Handbook PDF
790 pages
Second Variation of The Einstein-Hilbert Action
No ratings yet
Second Variation of The Einstein-Hilbert Action
4 pages
Ss2 Week 6 Farm Planning
No ratings yet
Ss2 Week 6 Farm Planning
3 pages
Measurements of Physical Quantity
No ratings yet
Measurements of Physical Quantity
8 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Vutet Traffic Solutions Debate
No ratings yet
Vutet Traffic Solutions Debate
4 pages
Assembly Homework
100% (1)
Assembly Homework
4 pages
DSAJ5 Test Bank Chapter 06
No ratings yet
DSAJ5 Test Bank Chapter 06
6 pages
Coordination Numbers and Geometry - Chemistry LibreTexts
No ratings yet
Coordination Numbers and Geometry - Chemistry LibreTexts
7 pages
OMs' Individual Act
No ratings yet
OMs' Individual Act
2 pages
A Detailed Lesson Plan in Media and Information Literacy Evolution of Media
No ratings yet
A Detailed Lesson Plan in Media and Information Literacy Evolution of Media
11 pages
HALP232811
No ratings yet
HALP232811
3 pages
Seeking Customer Centricity The Omni Business Model
No ratings yet
Seeking Customer Centricity The Omni Business Model
60 pages
Internship Vvce Final
No ratings yet
Internship Vvce Final
31 pages
Handout 23
No ratings yet
Handout 23
32 pages
Small Group Dynamics
100% (1)
Small Group Dynamics
5 pages
Atomic Structure DPP 04 of Lec 07 Arjuna JEE
No ratings yet
Atomic Structure DPP 04 of Lec 07 Arjuna JEE
3 pages
Exploring Power in Morrison & Díaz
No ratings yet
Exploring Power in Morrison & Díaz
14 pages
Rani Durgawati DWG 2
No ratings yet
Rani Durgawati DWG 2
1 page
Filters Cabin Air Oil Fuel Catalog
No ratings yet
Filters Cabin Air Oil Fuel Catalog
356 pages
Friedmans Core-Periphery Model
100% (1)
Friedmans Core-Periphery Model
29 pages
750-386 Adac 1000 PDF
No ratings yet
750-386 Adac 1000 PDF
140 pages
SOC 2 Evidence Collection Spreadsheet - Secureframe
No ratings yet
SOC 2 Evidence Collection Spreadsheet - Secureframe
8 pages
RWSModule 9
No ratings yet
RWSModule 9
5 pages
K-Pos DP Operator Manual (Release 8.4.1)
100% (2)
K-Pos DP Operator Manual (Release 8.4.1)
430 pages
Cke900 Spec
No ratings yet
Cke900 Spec
4 pages
Post Box No.: 1038, Street No. 41, New Industrial Area, Doha, Qatar
No ratings yet
Post Box No.: 1038, Street No. 41, New Industrial Area, Doha, Qatar
8 pages
Final Counselling Instructions
No ratings yet
Final Counselling Instructions
4 pages

Approximate Matching

Uploaded by

Approximate Matching

Uploaded by

Approximate matching

Department of Computer Science

Read alignment requires approximate matching

Sequence dierences occur

Approximate string matching

An edit is a single-character substitution or gap (insertion or deletion):

Hamming and edit distance

Edit distance (Levenshtein distance): minimum number of edits required to

Approximate string matching

Instead of stopping upon first mismatch, stop when maximum

Python example: http://bit.ly/CG_NaiveApprox

Approximate string matching

Either the edit goes here...

k edits can aect as many as k of these, but not all

Approximate string matching

k edits can overlap as many as k of these, but not all

Approximate string matching

Full example: http://bit.ly/CG_BoyerMooreApprox

Approximate Boyer-Moore performance

# character wall clock

Approximate string matching: more principles

Review: approximate matching principles

Pigeonhole principle with j = k + 1

Approximate string matching: more principles

Approximate string matching: more principles

Approximate string matching: more principles

If P occurs in T with up to k edits, alignment contains an exact match of

Exact matching filter: find matches of length floor(n / (k + 1)) between

Approximate matching principles

Pigeonhole principle with j = k + 1

p1, p2, ..., pk+1 is a partitioning of P. If P occurrs

If P occurs with k edits, alignment

q-gram lemma with t = 1

Example lossy algorithm: pick q > floor(n / (k + 1))

You might also like