0% found this document useful (0 votes)
8 views11 pages

Lectures 5-8

The document outlines a course on Computational Biophysics at IIT Kharagpur, focusing on algorithms and their applications in protein analysis, including sequence alignment and protein folding. It discusses various computational methods such as the longest common substring and subsequence problems, highlighting algorithms like Rabin-Karp and Knuth-Morris-Pratt. Additionally, it covers dynamic programming techniques for sequence alignment and provides examples of pairwise and multiple sequence alignment methods.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

Lectures 5-8

The document outlines a course on Computational Biophysics at IIT Kharagpur, focusing on algorithms and their applications in protein analysis, including sequence alignment and protein folding. It discusses various computational methods such as the longest common substring and subsequence problems, highlighting algorithms like Rabin-Karp and Knuth-Morris-Pratt. Additionally, it covers dynamic programming techniques for sequence alignment and provides examples of pairwise and multiple sequence alignment methods.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computational Biophysics@CSE, IITKGP Spring 2018

Computational Biophysics:
Algorithms to Applications
(CS61060)

Instructor: Pralay Mitra


Email: [email protected]

Lecture 05-06

[email protected] 1
Computational Biophysics@CSE, IITKGP Spring 2018

Computational Methods in Proteins


Sequence Primary Secondary
Alignment
Sequence to
Secondary Structure

Protein Protein
Folding Design

Protein
Tertiary Docking Quaternary

Aligning / Matching

[email protected] 2
Computational Biophysics@CSE, IITKGP Spring 2018

Matching / Alignment
• Longest common substring problem
Let Σ be an alphabet (finite set; for DNA alphabet (Σ = {A,C,G,T}) ). Search a
pattern from the text where both the pattern and text are arrays of elements
of Σ.
– Example:
• Pattern: ins, india, iit, iit kharagpur
• String: indian institute of technology kharagpur
• Longest common subsequence problem
Let Σ be an alphabet (finite set; for DNA alphabet (Σ = {A,C,G,T}) ). Find the longest
subsequence common to all sequences in a set of sequences (often just two
sequences) constructed over alphabet set Σ.
– Example:
• Pattern: ins, india, iit, iit kharagpur
• String: indian institute of technology kharagpur

Complexity
• Time Complexity
• Space Complexity
N N2 N3 N4 N6 N3log3(N)
1 1 1 1 1 0

10 100 1,000 10,000 1,000,000 36,658

100 10,000 1,000,000 100,000,000 1,000,000,000,000 293,265,294

1000 1,000,000 1,000,000,000 1,000,000,000,000 1,000,000,000,000,000,000 989,770,366,797

Our interest is computational time that depends on


computational complexity as well on per step
computation time.

[email protected] 3
Computational Biophysics@CSE, IITKGP Spring 2018

Longest common substring problem


Algorithm Preprocessing Time Searching Time Space Required
Naïve string search None O(mn) None
algorithm
Rabin–Karp string O(m) O(n+m) Constant
search algorithm O((n-m)m)
Knuth–Morris–Pratt O(m) O(n) O(m)
algorithm

Length of a string: n
Length of the pattern: m
Cardinality of character set: k

Longest common substring problem


• Naïve string search algorithm

for a character in txt


for every character in pat
break the search if txt character does not match pat character.
If all the character of pat matches then output a match.

[email protected] 4
Computational Biophysics@CSE, IITKGP Spring 2018

Longest common substring problem


• Rabin–Karp string search algorithm
– Hash function

/* pat  pattern; M  length of pattern Implement for genomics data.


txt  text; N  length of text/string
q  A prime number What are the advantages?
*/

Steps:
1. Create a hash index for each possible pat ( (h*d)%q ). // h  pow(d, M-1)%q
2. Calculate the hash value of pat and first window of txt (p=(d*p + pat[i])%q; t=(d*t+txt[i])%q)
3. Slide the pattern over txt one by one
Check the hash values of current window of txt and pat. If the hash values match
then only check for characters one by one
If all the character of pat matches then output a match.

Calculate hash value for next window of txt: Remove leading digit, add trailing digit

Longest common substring problem


• Knuth–Morris–Pratt (KMP) algorithm

Pattern: nano
Text: banananobano

[email protected] 5
Computational Biophysics@CSE, IITKGP Spring 2018

Lecture 07-08

Longest common subsequence problem


Problem Definition: We are given two strings: string S of length n, and
string T of length m. Our goal is to produce their longest common
subsequence: the longest sequence of characters that appear left-to-
right (but not necessarily in a contiguous block) in both strings.
Example,
S = ABAZDC
T = BACBAD

– Dynamic Programming
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)

[email protected] 6
Computational Biophysics@CSE, IITKGP Spring 2018

Longest common subsequence problem


Dynamic Programming
Example, Initialization
S = ABAZDC
T = BACBAD

B A C B A D
0 0 0 0 0 0
A 0
B 0
A 0
Z 0
D 0
C 0

Longest common subsequence problem


Dynamic Programming
Example,
Mi,j = MAXIMUM [ Mi-1, j-1 + Si,j , Mi-1,j-k , Mi-k,j-1 ]
S = ABAZDC
T = BACBAD

B A C B A D Initialization
0 0 0 0 0 0 Scoring
A 0 0 1 1 1 1 1
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
D 0 1 2 2 2 3 4
C 0 1 2 3 3 3 4

[email protected] 7
Computational Biophysics@CSE, IITKGP Spring 2018

Longest common subsequence problem


Dynamic Programming
Example,
Mi,j = MAXIMUM [ Mi-1, j-1 + Si,j , Mi-1,j-k , Mi-k,j-1 ]
S = ABAZDC
T = BACBAD

B A C B A D Initialization
Solution??? 0 0 0 0 0 0 Scoring
A 0 0 1 1 1 1 1 Alignment
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
Multiple
D 0 1 2 2 2 3 4 Possibilities
C 0 1 2 3 3 3 4

Longest common subsequence problem


Dynamic Programming
Example,
Mi,j = MAXIMUM [ Mi-1, j-1 + Si,j , Mi-1,j-k , Mi-k,j-1 ]
S = ABAZDC
T = BACBAD

B A C B A D Implement
0 0 0 0 0 0
A 0 0 1 1 1 1 1
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
D 0 1 2 2 2 3 4
C 0 1 2 3 3 3 4

[email protected] 8
Computational Biophysics@CSE, IITKGP Spring 2018

Longest common subsequence problem


Dynamic Programming
Example, Space complexity: O(N2)
Time complexity: O(N2)
S = ABAZDC
T = BACBAD

B A C B A D
0 0 0 0 0 0
A 0 0 1 1 1 1 1
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
D 0 1 2 2 2 3 4
C 0 1 2 3 3 3 4

Sequence Alignment
• Pairwise
– DOT matrix
– Dynamic programming
– Word method (efficient heuristic method; e.g., BLAST)

• Multiple
– Dynamic programming
– Progressive method (e.g., CLUSTAL, T-Coffee)
– Iterative
– Motif finding

[email protected] 9
Computational Biophysics@CSE, IITKGP Spring 2018

Dynamic Programming
S = BACBAD
Gap Penalty:
T = ABAZDC
Opening
Extension
Initialization Combined

Matrix Fill
Score function
Mi,j= Max [ Mi-1,j-1+Si,j ; Mi-1,j ; Mi,j-1]

Traceback
Tbi,j={ ; ; }

Alignment Score Function


W(k) = copen + clength * k

Mi,j = MAXIMUM [
Mi-1, j-1 + Si,j ,
Mi-1,j-k + w(k) (k = 1, …, j-1),
Mi-k,j-1 + w(k) (k = 1, …, i-1) ]

Mi,j = MAXIMUM [
0
Mi-1, j-1 + S(ai,bj) ,
Mi-1,j + w(ai,-),
Mi-k,j-1 + w(-,bj) ]

[email protected] 10
Computational Biophysics@CSE, IITKGP Spring 2018

Pairwise Sequence Alignment


• Problem Statement
– Input: Two sequences.
>Seq1|Dummy|SEQUENCE
GAATTCAGTTA

>Seq2|Dummy|SEQUENCE
GGATCGA
– Output: Their alignment subject to optimum alignment
score.
G _ A A T T C A G T T A
| | | | | |
G G _ A _ T C _ G _ _ A

Pairwise Sequence Alignment


GAATTCAGTTA G _ A A T T C A G T T A
GGATCGA | | | | | |
G G _ A _ T C _ G _ _ A

M(i,j) = MAXIMUM [
Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2) ]

Source: http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

[email protected] 11

You might also like