FASTA-05042018
FASTA-05042018
Homology vs Similarity
When comparing two sequences or structures, the terms of similarity or
homology are often used interchangeably to indicate that there is a "close"
relationship between the comparison objects.
Initially, a table is created that contains all the positions for each type of
amino acid (or nucleotide) within each of the sequences present in the
database. For example, if it exists in the database a sequence like the one
below, it is created the table at the right side:
F 1
L 2
Position 1 2 3 4 5 6 7
W 3, 6
R 4
Sequence A F L W R T W S
T 5
S 7
FASTA
The positional table can be constructed taking into account the position of the
amino acids taken individually (ktup = 1) (as is the case above) or as taken
in pairs (ktup = 2). Using this second mode will result in a speeding up of the
process at the expense of the accuracy of the final data.
However, the approximation is still valid because it can be assumed that the
homology between two sequences are meaningful only if it can be
considered pairs of amino acids and not individual amino acids. In the case
of nucleotide sequences ktup is 4 or 6.
FASTA
First Step: OFFSET Definition
At this point you have to run the mathematical difference of positional
values of the amino acids of the same type in the sequence that is being
compared (sequence query) and the sequences in the database. This
difference is also called OFFSET.
In the example below, the amino acids common to the two sequences are
S, W (present in two positions), R and T.
The comparison of the positional values of AA DELTA OFFSET
these amino acids in the two sequences
gives rise to these values of OFFSET: S 7-1 6
F 1 W 3–2 1
S 1
L 2 W 3–5 -2
W 2, 5
W 3, 6 W 6–2 4
R 3
R 4 W 6–5 1
T 4, 6
T 5 R 4–3 1
S 7 T 5–4 1
T 5–6 -1
FASTA
1 2 3 4 5 6 7
F L W R T W S
1 S
2 W
3 R
4 T
5 W
6 T
FASTA
First Step: OFFSET Definition
You can define local regions of similarity between two sequences. They
are the ones who have the offset with the highest score. The 10 best
regions of similarity are "stored" for further analysis.
1 2 3 4 5 6 7
F L W R T W S
1 S
2 W
3 R
4 T
5 W
6 T
FASTA
Second Step: Evaluation of Amino Acids Substitution
This step evaluates any replacements that occurred between amino acids
in the 10 best regions of similarity selected in the first phase, using for
this purpose the empirical matrices (scoring matrix).
Several king of matrix can be used, including the most famous and popular
ones that are the PAM and BLOSUM.
Such matrices define a score for each possible substitution of amino acids,
based on the frequency of mutations that occurred during the time.
FASTA
Second Step: Evaluation of Amino Acids Substitution
The defined sub-region is called the initial region (initial region) and its
score is said initial score or INIT1 (in FASTP it is called INITN).
The method that uses ktup = 2 is much faster (about 5 times) than the one
that uses ktup = 1, but will also increase the level of imprecision. The
accuracy remains still acceptable for long proteins and nucleotide
sequences, instead, this accuracy is unacceptable in the case of
oligonucleotides or oligopeptide.
The initial region with the highest score is used to create a ranking of the
collation sequences found in the database in order to define which of them
are most similar to the input sequence.
FASTA
Third Step: Join several initial regions
Given the location of the initial regions and their respective scores, FASTA
assigns a score penalty at intermediate regions without similarity.
FASTA
Third Step: Join several initial regions
An algorithm assesses whether the introduced penalty lowers the score
below a certain "threshold-value."
If it does not, FASTA calculates the optimal alignment of initial regions. Such
alignment is defined by the join of the earliest regions (a sub-set) showing
highest score.
• The variant proposed by Lipman and Pearson does not take into
account the sequence regions that are not similar. This allows for an
optimized score (OPT).
FASTA
In this phase, it is used a gap penalty of -12 for the first residue missing
and -4 for each additional residue.
FASTA
Fourth Step : Deletions and insertions evaluation:
Summarizing
FASTA
Algorithm 1: use an
array of n+1
elements,
O(n2) time complexity
and O(n) space
complexity
Algorithm 2: use an
array of 2 elements,
O(n2) time complexity