0% found this document useful (0 votes)

1 views

FASTA-05042018

The document explains the difference between homology and similarity in sequence comparison, where homology indicates a shared ancestral origin while similarity is a broader, quantitative measure. It details the FASTA software package for DNA and protein sequence alignment, outlining its four-step algorithm for identifying and scoring sequence similarities. Additionally, it contrasts Divide and Conquer with Dynamic Programming, emphasizing the efficiency of the latter in solving overlapping sub-problems.

Uploaded by

shubhang.aryamsc.bio24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

FASTA-05042018

Uploaded by

shubhang.aryamsc.bio24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

FASTA

Homology vs Similarity
When comparing two sequences or structures, the terms of similarity or
homology are often used interchangeably to indicate that there is a "close"
relationship between the comparison objects.

However, the two terms refer to different aspects of the comparison:

The homology is a qualitative property of the comparison: you say that

sequences are homologous if they have a common ancestral gene, or that
share an ancestor. So the term homology indicates that the two entities
share a common phylogenetic origin, from which they have evolved
differentiated from each other.
The similarity and a quantitative property of a comparison. The term
similarity has a more general meaning indicating a similarity regardless of
the reasons that led it. The similarity is often due to homology but it can
also be generated by the case or by adaptive convergence phenomena at
both morphological and molecular level.
FASTA
FASTA is a DNA and protein sequence alignment software package first
described (as FASTP) by David J. Lipman and William R. Pearson in
1985.[1]

FASTA is pronounced "fast A", and stands for "FAST-All", because it

works with any alphabet, and it is an extension of "FAST-P" (protein) and
"FAST-N" (nucleotide) alignment.

The current FASTA package contains programs for protein:protein,

DNA:DNA, protein:translated DNA (with frameshifts), and ordered or
unordered peptide searches.

Recent versions of the FASTA package include special translated search

algorithms that correctly handle frameshift errors (which six-frame-
translated searches do not handle very well) when comparing nucleotide
to protein sequence data.
[1] Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science.
227 (4693): 1435–41.
FASTA

In addition to rapid heuristic search methods, the FASTA package

provides SSEARCH, an implementation of the optimal Smith-Waterman
algorithm.

A major focus of the package is the calculation of accurate similarity

statistics, so that biologists can judge whether an alignment is likely to
have occurred by chance, or whether it can be used to infer homology.

The FASTA program in its classic version, is a heuristic program able to

search the global sequence similarity. Two variants thereof, and
LFASTA PLFASTA, are able to search for local sequence similarity.

The FASTA package is available from fasta.bioch.virginia.edu.

FASTA

FASTA is a four steps algorithm.

First Step: OFFSET Definition

Initially, a table is created that contains all the positions for each type of
amino acid (or nucleotide) within each of the sequences present in the
database. For example, if it exists in the database a sequence like the one
below, it is created the table at the right side:

F 1
L 2
Position 1 2 3 4 5 6 7
W 3, 6
R 4
Sequence A F L W R T W S
T 5
S 7
FASTA

First Step: OFFSET Definition

S 1
Position 1 2 3 4 5 6
W 2, 5
R 3
Sequence B S W R T W T
T 4, 6

The positional table can be constructed taking into account the position of the
amino acids taken individually (ktup = 1) (as is the case above) or as taken
in pairs (ktup = 2). Using this second mode will result in a speeding up of the
process at the expense of the accuracy of the final data.
However, the approximation is still valid because it can be assumed that the
homology between two sequences are meaningful only if it can be
considered pairs of amino acids and not individual amino acids. In the case
of nucleotide sequences ktup is 4 or 6.
FASTA
First Step: OFFSET Definition
At this point you have to run the mathematical difference of positional
values of the amino acids of the same type in the sequence that is being
compared (sequence query) and the sequences in the database. This
difference is also called OFFSET.
In the example below, the amino acids common to the two sequences are
S, W (present in two positions), R and T.
The comparison of the positional values of AA DELTA OFFSET
these amino acids in the two sequences
gives rise to these values of OFFSET: S 7-1 6
F 1 W 3–2 1
S 1
L 2 W 3–5 -2
W 2, 5
W 3, 6 W 6–2 4
R 3
R 4 W 6–5 1
T 4, 6
T 5 R 4–3 1
S 7 T 5–4 1
T 5–6 -1
FASTA

First Step: OFFSET Definition

In the example, the best offset is the one with the value 1 which allows
alignment of 4 amino acids. The other offsets allow the alignment of a
single amino acid or no amino acid. The score in an offset is increased for
each identity and reduced for each misalignment (mismatch). The latter is
allowed, while insertions / deletions that are forbidden.

1 2 3 4 5 6 7
F L W R T W S
1 S
2 W
3 R
4 T
5 W
6 T
FASTA
First Step: OFFSET Definition
You can define local regions of similarity between two sequences. They
are the ones who have the offset with the highest score. The 10 best
regions of similarity are "stored" for further analysis.

1 2 3 4 5 6 7
F L W R T W S
1 S
2 W
3 R
4 T
5 W
6 T
FASTA
Second Step: Evaluation of Amino Acids Substitution

This step evaluates any replacements that occurred between amino acids
in the 10 best regions of similarity selected in the first phase, using for
this purpose the empirical matrices (scoring matrix).

Several king of matrix can be used, including the most famous and popular
ones that are the PAM and BLOSUM.

The most widely used of which it is the PAM 250.

Such matrices define a score for each possible substitution of amino acids,
based on the frequency of mutations that occurred during the time.
FASTA
Second Step: Evaluation of Amino Acids Substitution

To each of the 10 best regions of similarity it is then assigned a score

based on the selected score matrix and for each of them it is identified
those residues that contribute to define the maximum score.

The defined sub-region is called the initial region (initial region) and its
score is said initial score or INIT1 (in FASTP it is called INITN).

This score is used as a parameter to define the similarity between two

sequences (similarity score).

To calculate the initial score you can use a ktup = 1 or 2.

FASTA
Second Step: Evaluation of Amino Acids Substitution

The method that uses ktup = 2 is much faster (about 5 times) than the one
that uses ktup = 1, but will also increase the level of imprecision. The
accuracy remains still acceptable for long proteins and nucleotide
sequences, instead, this accuracy is unacceptable in the case of
oligonucleotides or oligopeptide.

•The oligonucleotides are short (oligo) nucleotide sequences (RNA or DNA),

typically with 20 or fewer base pairs.
•An oligopeptide is a molecule formed by the condensation of certain amino acids
in a number which, by convention, varies from 10 to 50. The amino acids that
form an oligopeptide are linked together through a series of peptide bonds.

The initial region with the highest score is used to create a ranking of the
collation sequences found in the database in order to define which of them
are most similar to the input sequence.
FASTA
Third Step: Join several initial regions

This is the new step introduced by FASTA with respect to FASTP/FASTN.

FASTA carries out an assessment about possible ways to connect (join)

different starting regions. The constraints for the creation of the link are:

-Exclusion of potential areas of overlapping between regions

-Score must be higher than a "threshold value"
-Introduction of a penalty score (-16) for each gap (the gap penalty).

Given the location of the initial regions and their respective scores, FASTA
assigns a score penalty at intermediate regions without similarity.
FASTA
Third Step: Join several initial regions
An algorithm assesses whether the introduced penalty lowers the score
below a certain "threshold-value."

If it does not, FASTA calculates the optimal alignment of initial regions. Such
alignment is defined by the join of the earliest regions (a sub-set) showing
highest score.

Consequently the initial score INIT1 is also recalculated being redefined on

the basis of created joins, and that is called INITN.

The rank of the comparison sequences in the database is also recalculated.

This step increases the sensitivity of the method at the expense of

specificity. The reduction of specificity is somewhat compensated by the
inclusion in the join of only the initial regions that show a score higher than a
threshold value (OPTCUT). The OPTCUT is about one standard deviation
above the mean score expected for non-correlated sequences in the
database.
FASTA
Fourth Step : Deletions and insertions evaluation:

Sequences showing greater similarity are aligned to the input sequence

using a methods that is based on a modified version of Needleman and
Wunsch’ algorithm.

Needleman and Wunsch’ algorithm is based on dynamic programming. It

takes into account possible deletions or insertions of amino acids (or
nucleotides).

• The variant proposed by Lipman and Pearson does not take into
account the sequence regions that are not similar. This allows for an
optimized score (OPT).
FASTA

Fourth Step : Deletions and insertions evaluation:

The final comparison is made among

• all possible alignments for the input sequence that are found during the
second phase and
• the similar sequences extracted from the database that fall within a
range of 32 amino acids, centred around the initial region that has the
highest score.

In this phase, it is used a gap penalty of -12 for the first residue missing
and -4 for each additional residue.
FASTA
Fourth Step : Deletions and insertions evaluation:

Therefore FASTA defines three different scores:

INIT1: it is the first score calculated, the one obtained by the

second phase after the usage of the score matrix
INITN: that is calculated after the introduction of the join, in the
third phase
OPT: optimized score, after evaluation of the insertions and
deletions, in the fourth phase
FASTA

Summarizing
FASTA

The final score is

Opt.
Dynamic Programming
Divide et Impera vs
Dynamic Programming
• The basic steps of the algorithms based on the technique
Divide and Conquer to solve a given algorithmic problem:
1. Divide the problem into sub-problems of smaller size
2. Solve (recursively) sub-problems of smaller size
3. Combine the solutions of the sub-problems into a solutionfor the
original problem
• Typically the sub-problems obtained from step 1. can be
different, therefore, each of them was individually resolved
from the recursive step Call 2.
• In many situations, the sub-problems obtained in step 1 may
be similar, or even the same. In this case, the algorithm
based on D & I solves the same problem several times, doing
unnecessary work!
Divide et Impera vs
Dynamic Programming
• In such situations (ie when the subproblems of a given
algorithmic problem tend to be "similar", or even "Equal"), it is
useful to employ the technique of Dynamic Programming
• This technique is essentially similar to D & I, having the
characteristic of solving each sub-problem only once.
• Hence, algorithms based on Dynamic Programming are more
efficient.
• Basic idea behind Dynamic Programming:
– It calculates the solution to distinct sub-problems only once
– It stores these solutions in a table, in such a way that they can be used
in the following, if necessary.
Example: Fibonacci
numbers
• The sequence F0, F1, F2, F3, . . . of Fibonacci numbers is
defined by the following recurring equation:
– F0 = F1 = 1
– Fn = Fn−1 + Fn−2 For n>2

• 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, . . .

Example: Fibonacci
numbers
• D&I algorithm based on the Fibonacci numbers’ definition will
be

The complexity T(n) of Fib(n) is

T(0)=T(1)=1
T(n)=T(n-1)+T(n-2)+1 , n>=2
By defining T’(n)=T(n)+1, we obtain
T’(0)=T’(1)=2
T’(n)=T’(n-1)+T’(n-2), n>=2
Example: Fibonacci
numbers
Let us run Fib(5):

Fib(1) is calculated 5 times, F(0)

and F(2) three times and Fib(3)
twice!
Example: Fibonacci
numbers
Dynamic Programming

Algorithm 1: use an
array of n+1
elements,
O(n2) time complexity
and O(n) space
complexity

Algorithm 2: use an
array of 2 elements,
O(n2) time complexity

IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
From Everand
IP Routing Protocols All-in-one: OSPF EIGRP IS-IS BGP Hands-on Labs
Redouane MEDDANE
No ratings yet
Guy 1976 - ECDEU Assessment Manual For Psychopharmacology
100% (1)
Guy 1976 - ECDEU Assessment Manual For Psychopharmacology
612 pages
FASTA
No ratings yet
FASTA
4 pages
Unit2 2
No ratings yet
Unit2 2
30 pages
FASTA
No ratings yet
FASTA
24 pages
FASTA
No ratings yet
FASTA
33 pages
Fasta Sequence Database
No ratings yet
Fasta Sequence Database
17 pages
Introduction To Bioinformatics: Database Search (FASTA)
No ratings yet
Introduction To Bioinformatics: Database Search (FASTA)
35 pages
FASTA
No ratings yet
FASTA
3 pages
Sequence Similarity Searching: WWW - Med.nyu - edu/rcr/rcr/course/PPT/similarity
No ratings yet
Sequence Similarity Searching: WWW - Med.nyu - edu/rcr/rcr/course/PPT/similarity
57 pages
Lecture 9 and 10 half
No ratings yet
Lecture 9 and 10 half
4 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
fasta and blast
No ratings yet
fasta and blast
19 pages
FASTA Algorithm
No ratings yet
FASTA Algorithm
15 pages
UNIT III
No ratings yet
UNIT III
14 pages
1 Pearson
No ratings yet
1 Pearson
9 pages
MAFFT
No ratings yet
MAFFT
8 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
BIF401 MID Term Exam 2022 Preparation by BADSHA ALI
No ratings yet
BIF401 MID Term Exam 2022 Preparation by BADSHA ALI
6 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
Week 3
No ratings yet
Week 3
42 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
5.Pairwise Alignment
No ratings yet
5.Pairwise Alignment
85 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
Diploma - Practical
No ratings yet
Diploma - Practical
11 pages
Introduction To Bioinformatics Lecture 3
No ratings yet
Introduction To Bioinformatics Lecture 3
20 pages
ch10 2
No ratings yet
ch10 2
41 pages
ch10
No ratings yet
ch10
42 pages
Amino Acid Substitution Scores: 1 2 N 1 2 N N I 1 I I
No ratings yet
Amino Acid Substitution Scores: 1 2 N 1 2 N N I 1 I I
3 pages
Unit 2.1
No ratings yet
Unit 2.1
77 pages
Sequence Alignemt
No ratings yet
Sequence Alignemt
3 pages
FASTA and BLAST
No ratings yet
FASTA and BLAST
2 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
Protein Tertiary Structures: Prediction From Amino Acid Sequences
No ratings yet
Protein Tertiary Structures: Prediction From Amino Acid Sequences
7 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
No ratings yet
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
4 pages
Bioinformatics I
No ratings yet
Bioinformatics I
39 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
FASTA& BLASTA
No ratings yet
FASTA& BLASTA
5 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Bioinformatics Practical Part Iii
No ratings yet
Bioinformatics Practical Part Iii
4 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
AsBioinfo-Ders-7-ALLIGNMENT_1
No ratings yet
AsBioinfo-Ders-7-ALLIGNMENT_1
9 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
Bioinfo-Ders-7-ALLIGNMENT_1
No ratings yet
Bioinfo-Ders-7-ALLIGNMENT_1
55 pages
Bif401 Solved Final Papers 2017
No ratings yet
Bif401 Solved Final Papers 2017
8 pages
BI Manual
No ratings yet
BI Manual
35 pages
Module III
No ratings yet
Module III
55 pages
Blast Fasta
No ratings yet
Blast Fasta
27 pages
Is To Be Acquaint With Sequence Analysis Tools That Can Be Accessed Through The Internet Specifically Working The NCBI Database
No ratings yet
Is To Be Acquaint With Sequence Analysis Tools That Can Be Accessed Through The Internet Specifically Working The NCBI Database
3 pages
Substitution Matrix
No ratings yet
Substitution Matrix
10 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
BLAST
No ratings yet
BLAST
2 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Bioinformatics Chaper3
No ratings yet
Bioinformatics Chaper3
34 pages
(BIF 401) Current Solved Papers.
No ratings yet
(BIF 401) Current Solved Papers.
16 pages
BLAST N FASTA
No ratings yet
BLAST N FASTA
55 pages
Sequence Alignment: "Continuing.." (5th Week)
No ratings yet
Sequence Alignment: "Continuing.." (5th Week)
61 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Biostatistics For Medical Science
No ratings yet
Biostatistics For Medical Science
4 pages
bio info 2023
No ratings yet
bio info 2023
2 pages
13. ვ. ქობალია. ბიოტექნოლოგია მცენარეთა დაცვაში
No ratings yet
13. ვ. ქობალია. ბიოტექნოლოგია მცენარეთა დაცვაში
215 pages
MAFFT Ver.7 - RBCL 2
No ratings yet
MAFFT Ver.7 - RBCL 2
1 page
MAP 716 Lecture 1 Introduction1
No ratings yet
MAP 716 Lecture 1 Introduction1
8 pages
Biostat Manipal
No ratings yet
Biostat Manipal
4 pages
Biostat Practice 23 07 Categorical
No ratings yet
Biostat Practice 23 07 Categorical
18 pages
Syllabus - Biostatistics
No ratings yet
Syllabus - Biostatistics
3 pages
MUSCLE: Multiple Sequence Alignment With High Accuracy and High Throughput
No ratings yet
MUSCLE: Multiple Sequence Alignment With High Accuracy and High Throughput
6 pages
Lec 01 Introduction To Biostatistics
No ratings yet
Lec 01 Introduction To Biostatistics
14 pages
Lecture 3 (M. Ochs)
100% (1)
Lecture 3 (M. Ochs)
76 pages
Biostatisticians - Programmers Resumes-7
No ratings yet
Biostatisticians - Programmers Resumes-7
16 pages
4.1. Pairwise Alignment - 2
No ratings yet
4.1. Pairwise Alignment - 2
4 pages
29) Altschul 1997
No ratings yet
29) Altschul 1997
14 pages
Bioinformatics MCQs
No ratings yet
Bioinformatics MCQs
10 pages
Intro To Biostat in The Health Sciences
No ratings yet
Intro To Biostat in The Health Sciences
29 pages
BIF401 Midterm Short Notes
No ratings yet
BIF401 Midterm Short Notes
45 pages
Databases in Bioinformatics - An Introduction
No ratings yet
Databases in Bioinformatics - An Introduction
11 pages
Lab Guide For Student
No ratings yet
Lab Guide For Student
28 pages
Exercise 7 Bioinformatics
No ratings yet
Exercise 7 Bioinformatics
8 pages
Research Methodology and Biostatistics A Comprehensive Guide for Health Care Professionals, 1st Edition Free Download
100% (4)
Research Methodology and Biostatistics A Comprehensive Guide for Health Care Professionals, 1st Edition Free Download
16 pages
Variants of Blast: By-Darshana D Ghadi Roll No. - 03
No ratings yet
Variants of Blast: By-Darshana D Ghadi Roll No. - 03
17 pages
Practical No: Date: DDBJ Database
No ratings yet
Practical No: Date: DDBJ Database
1 page
DNA BARCODING
No ratings yet
DNA BARCODING
12 pages
Scoring Matrices 06
No ratings yet
Scoring Matrices 06
25 pages
Chapter 1: Genbank: The Nucleotide Sequence Database: Ilene Mizrachi
No ratings yet
Chapter 1: Genbank: The Nucleotide Sequence Database: Ilene Mizrachi
14 pages
Introduction and Overview of Bioinformatics
100% (1)
Introduction and Overview of Bioinformatics
1 page
Bioinformatics (STH Sir)
No ratings yet
Bioinformatics (STH Sir)
13 pages

FASTA-05042018

Uploaded by

FASTA-05042018

Uploaded by

FASTA

However, the two terms refer to different aspects of the comparison:

The homology is a qualitative property of the comparison: you say that

FASTA is pronounced "fast A", and stands for "FAST-All", because it

The current FASTA package contains programs for protein:protein,

Recent versions of the FASTA package include special translated search

In addition to rapid heuristic search methods, the FASTA package

A major focus of the package is the calculation of accurate similarity

The FASTA program in its classic version, is a heuristic program able to

The FASTA package is available from fasta.bioch.virginia.edu.

FASTA is a four steps algorithm.

First Step: OFFSET Definition

First Step: OFFSET Definition

First Step: OFFSET Definition

The most widely used of which it is the PAM 250.

To each of the 10 best regions of similarity it is then assigned a score

This score is used as a parameter to define the similarity between two

To calculate the initial score you can use a ktup = 1 or 2.

•The oligonucleotides are short (oligo) nucleotide sequences (RNA or DNA),

This is the new step introduced by FASTA with respect to FASTP/FASTN.

FASTA carries out an assessment about possible ways to connect (join)

-Exclusion of potential areas of overlapping between regions

Consequently the initial score INIT1 is also recalculated being redefined on

The rank of the comparison sequences in the database is also recalculated.

This step increases the sensitivity of the method at the expense of

Sequences showing greater similarity are aligned to the input sequence

Needleman and Wunsch’ algorithm is based on dynamic programming. It

Fourth Step : Deletions and insertions evaluation:

The final comparison is made among

Therefore FASTA defines three different scores:

INIT1: it is the first score calculated, the one obtained by the

The final score is

• 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, . . .

The complexity T(n) of Fib(n) is

Fib(1) is calculated 5 times, F(0)

You might also like