Computational Biophysics@CSE, IITKGP Spring 2018
Computational Biophysics:
Algorithms to Applications
(CS61060)
Instructor: Pralay Mitra
Email:
[email protected] Lecture 05-06
[email protected] 1
Computational Biophysics@CSE, IITKGP Spring 2018
Computational Methods in Proteins
Sequence Primary Secondary
Alignment
Sequence to
Secondary Structure
Protein Protein
Folding Design
Protein
Tertiary Docking Quaternary
Aligning / Matching
[email protected] 2
Computational Biophysics@CSE, IITKGP Spring 2018
Matching / Alignment
• Longest common substring problem
Let Σ be an alphabet (finite set; for DNA alphabet (Σ = {A,C,G,T}) ). Search a
pattern from the text where both the pattern and text are arrays of elements
of Σ.
– Example:
• Pattern: ins, india, iit, iit kharagpur
• String: indian institute of technology kharagpur
• Longest common subsequence problem
Let Σ be an alphabet (finite set; for DNA alphabet (Σ = {A,C,G,T}) ). Find the longest
subsequence common to all sequences in a set of sequences (often just two
sequences) constructed over alphabet set Σ.
– Example:
• Pattern: ins, india, iit, iit kharagpur
• String: indian institute of technology kharagpur
Complexity
• Time Complexity
• Space Complexity
N N2 N3 N4 N6 N3log3(N)
1 1 1 1 1 0
10 100 1,000 10,000 1,000,000 36,658
100 10,000 1,000,000 100,000,000 1,000,000,000,000 293,265,294
1000 1,000,000 1,000,000,000 1,000,000,000,000 1,000,000,000,000,000,000 989,770,366,797
Our interest is computational time that depends on
computational complexity as well on per step
computation time.
[email protected] 3
Computational Biophysics@CSE, IITKGP Spring 2018
Longest common substring problem
Algorithm Preprocessing Time Searching Time Space Required
Naïve string search None O(mn) None
algorithm
Rabin–Karp string O(m) O(n+m) Constant
search algorithm O((n-m)m)
Knuth–Morris–Pratt O(m) O(n) O(m)
algorithm
Length of a string: n
Length of the pattern: m
Cardinality of character set: k
Longest common substring problem
• Naïve string search algorithm
for a character in txt
for every character in pat
break the search if txt character does not match pat character.
If all the character of pat matches then output a match.
[email protected] 4
Computational Biophysics@CSE, IITKGP Spring 2018
Longest common substring problem
• Rabin–Karp string search algorithm
– Hash function
/* pat pattern; M length of pattern Implement for genomics data.
txt text; N length of text/string
q A prime number What are the advantages?
*/
Steps:
1. Create a hash index for each possible pat ( (h*d)%q ). // h pow(d, M-1)%q
2. Calculate the hash value of pat and first window of txt (p=(d*p + pat[i])%q; t=(d*t+txt[i])%q)
3. Slide the pattern over txt one by one
Check the hash values of current window of txt and pat. If the hash values match
then only check for characters one by one
If all the character of pat matches then output a match.
Calculate hash value for next window of txt: Remove leading digit, add trailing digit
Longest common substring problem
• Knuth–Morris–Pratt (KMP) algorithm
Pattern: nano
Text: banananobano
[email protected] 5
Computational Biophysics@CSE, IITKGP Spring 2018
Lecture 07-08
Longest common subsequence problem
Problem Definition: We are given two strings: string S of length n, and
string T of length m. Our goal is to produce their longest common
subsequence: the longest sequence of characters that appear left-to-
right (but not necessarily in a contiguous block) in both strings.
Example,
S = ABAZDC
T = BACBAD
– Dynamic Programming
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
[email protected] 6
Computational Biophysics@CSE, IITKGP Spring 2018
Longest common subsequence problem
Dynamic Programming
Example, Initialization
S = ABAZDC
T = BACBAD
B A C B A D
0 0 0 0 0 0
A 0
B 0
A 0
Z 0
D 0
C 0
Longest common subsequence problem
Dynamic Programming
Example,
Mi,j = MAXIMUM [ Mi-1, j-1 + Si,j , Mi-1,j-k , Mi-k,j-1 ]
S = ABAZDC
T = BACBAD
B A C B A D Initialization
0 0 0 0 0 0 Scoring
A 0 0 1 1 1 1 1
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
D 0 1 2 2 2 3 4
C 0 1 2 3 3 3 4
[email protected] 7
Computational Biophysics@CSE, IITKGP Spring 2018
Longest common subsequence problem
Dynamic Programming
Example,
Mi,j = MAXIMUM [ Mi-1, j-1 + Si,j , Mi-1,j-k , Mi-k,j-1 ]
S = ABAZDC
T = BACBAD
B A C B A D Initialization
Solution??? 0 0 0 0 0 0 Scoring
A 0 0 1 1 1 1 1 Alignment
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
Multiple
D 0 1 2 2 2 3 4 Possibilities
C 0 1 2 3 3 3 4
Longest common subsequence problem
Dynamic Programming
Example,
Mi,j = MAXIMUM [ Mi-1, j-1 + Si,j , Mi-1,j-k , Mi-k,j-1 ]
S = ABAZDC
T = BACBAD
B A C B A D Implement
0 0 0 0 0 0
A 0 0 1 1 1 1 1
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
D 0 1 2 2 2 3 4
C 0 1 2 3 3 3 4
[email protected] 8
Computational Biophysics@CSE, IITKGP Spring 2018
Longest common subsequence problem
Dynamic Programming
Example, Space complexity: O(N2)
Time complexity: O(N2)
S = ABAZDC
T = BACBAD
B A C B A D
0 0 0 0 0 0
A 0 0 1 1 1 1 1
B 0 1 1 1 2 2 2
A 0 1 2 2 2 3 3
Z 0 1 2 2 2 3 3
D 0 1 2 2 2 3 4
C 0 1 2 3 3 3 4
Sequence Alignment
• Pairwise
– DOT matrix
– Dynamic programming
– Word method (efficient heuristic method; e.g., BLAST)
• Multiple
– Dynamic programming
– Progressive method (e.g., CLUSTAL, T-Coffee)
– Iterative
– Motif finding
[email protected] 9
Computational Biophysics@CSE, IITKGP Spring 2018
Dynamic Programming
S = BACBAD
Gap Penalty:
T = ABAZDC
Opening
Extension
Initialization Combined
Matrix Fill
Score function
Mi,j= Max [ Mi-1,j-1+Si,j ; Mi-1,j ; Mi,j-1]
Traceback
Tbi,j={ ; ; }
Alignment Score Function
W(k) = copen + clength * k
Mi,j = MAXIMUM [
Mi-1, j-1 + Si,j ,
Mi-1,j-k + w(k) (k = 1, …, j-1),
Mi-k,j-1 + w(k) (k = 1, …, i-1) ]
Mi,j = MAXIMUM [
0
Mi-1, j-1 + S(ai,bj) ,
Mi-1,j + w(ai,-),
Mi-k,j-1 + w(-,bj) ]
[email protected] 10
Computational Biophysics@CSE, IITKGP Spring 2018
Pairwise Sequence Alignment
• Problem Statement
– Input: Two sequences.
>Seq1|Dummy|SEQUENCE
GAATTCAGTTA
>Seq2|Dummy|SEQUENCE
GGATCGA
– Output: Their alignment subject to optimum alignment
score.
G _ A A T T C A G T T A
| | | | | |
G G _ A _ T C _ G _ _ A
Pairwise Sequence Alignment
GAATTCAGTTA G _ A A T T C A G T T A
GGATCGA | | | | | |
G G _ A _ T C _ G _ _ A
M(i,j) = MAXIMUM [
Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2) ]
Source: http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
[email protected] 11