0% found this document useful (0 votes)
6 views51 pages

IBB.MB.501 Database search and sequence alignment

Genome Science

Uploaded by

Muhammad Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views51 pages

IBB.MB.501 Database search and sequence alignment

Genome Science

Uploaded by

Muhammad Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Genome Science (IBB.MB.

501)
IBB.MB.501
Database search and sequence alignment

2
Introduction
– Over the past five decades the use of computers has had a profound
effect on research in the biological sciences
– The amount of information available to researchers in databases,
increases almost exponentially, with biologists and computer scientists
coming together to provide Bioinformatics tools to help extract useful
information from these databases
– The aim of these sessions is to introduce you to the use of some of the
information and software resources available in the public domain

IBB.MB.501
3
Searching sequence databases
– Sequence databases exist for nucleic acids, proteins and complex
carbohydrates
– For nucleic acids and proteins the chemical structure is represented as
a string of characters, such as ACCGTA for nucleic acids or DFGIMCR
for proteins
– Database entries include much more information, or annotation, which
contains the biological, bibliographic and administrative context for

IBB.MB.501
the sequence.

4
NA databases

– For nucleic acids, there are three major public domain databases:
– European Nucleotide Archive, from EMBL-EBI
– NCBI (including GenBank (USA))
– DDBJ (DNA DataBank of Japan)
– All exchange information daily, so that they are essentially identical

IBB.MB.501
5
IBB.MB.501
6
IBB.MB.501
7
IBB.MB.501
8
IBB.MB.501
9
Sequence Alignment

IBB.MB.501
10
Job Dispatcher ❏50+ tools
Bioinformatics Tools (nucleotide and
protein analysis)

❏Recently added:

❏R2DT

❏SSRAECH2SEQ

IBB.MB.501
❏GGSEARCH2SE
Q

11
Tool Categories
▪ Sequence Format Conversion (sfc)
▪ Protein Function Analysis (pfa)
▪ Sequence Operation (so)
▪ Sequence Statistics (seqstats)
▪ Sequence Translation (st)
▪ RNA Analysis (rna)
▪ Phylogeny (phylogeny)
▪ Pairwise Sequence Alignment (psa)
▪ Multiple Sequence Alignment (msa)

IBB.MB.501
▪ Sequence Similarity Search (sss)
▪ Emboss Tools (emboss)

12
Sequence Format Conversion (sfc)

❏Convert one sequence format to another.

❏EMBOSS Seqret, MView

IBB.MB.501
13
Advantages

• No local installation
• Workflows

IBB.MB.501
Typical Bioinformatics Setup
14
How to access tools?
– EBI Service page:
– https://www.ebi.ac.uk/services

– Job dispatcher Tool category page:


– https://www.ebi.ac.uk/Tools/<category>
– Eg: https://www.ebi.ac.uk/Tools/msa (Multiple Sequence Alignment)

IBB.MB.501
15
https://www.ebi.ac.uk/services

IBB.MB.501
16
Sequence Alignment

– Identify regions of similarity

IBB.MB.501
17
Sequence Alignment

IBB.MB.501
18
Sequence Alignment

– Match, Mismatch, Gap

– Gap extension penalty

IBB.MB.501
19
Sequence alignment

– Similarity and Identity

– Substitution matrix

– https://github.com/kimrutherford/EMBOSS/tree/master/emboss/data

– Alignment score : calculated based on the match/mismatch of residues

using the substitution matrices

IBB.MB.501
20
Sequence Alignment Types

PAIRWISE MULTIPLE

IBB.MB.501
21
Pairwise Sequence Alignment

– Involves aligning two sequences using a scoring matrix

– Basic of database similarity search

– Dynamic Programming for global alignment : Needleman-Wunsch

algorithm

IBB.MB.501
– Dynamic programming for local alignment : Smith-Waterman

Algorithm
22
Local and Global Alignment

– Global

IBB.MB.501
– Local

23
Pairwise Alignment Tools
https://www.ebi.ac.uk/Tools/psa/

– Needle
– Stretcher
– GGSEARCH2SEQ
– Water
– Matcher
– LALIGN

IBB.MB.501
– SSEARCH2SEQ
– GeneWISE
24
Which tool to use?
– Global Alignment

Needle Stretcher GGSEARCH2SEQ


(big sequences)

– Local Alignment

Water LALIGN Matcher SSEARCH2SE


(big sequences)
Q

IBB.MB.501
25
https://www.ebi.ac.uk/Tools/psa/emboss_needle/

Sequence input

IBB.MB.501
Parameters

Submit! 26
IBB.MB.501
27
Things to remember

– Selection of tool

– Choose local/global based on your requirement

– Selection of Matrix

IBB.MB.501
– Blosum{n} higher value focus on more closely related proteins.
– PAM{n} higher value focuses on more distantly related proteins.

28
What is BLAST?

– Basic BLAST search


– What is BLAST?
– The framework of BLAST
– Different BLAST programs
– BLAST databases you can search
– Where can I run BLAST?

IBB.MB.501
29
What is BLAST?

• BLAST stands for


Basic Local Alignment Search Tool
• Why BLAST is popular?
- Good balance of sensitivity and speed
- Reliable
- Flexible
• Produce local alignments: short significant stretches of

IBB.MB.501
similarity, irrespective of where they are in the sequence

30
BLAST Programs
The most common BLAST search include five programs:

Program Database (Subject) Query


BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nt. ➔ Protein
TBLASTN Nt. ➔ Protein Protein

IBB.MB.501
TBLASTX Nt. ➔ Protein Nt. ➔ Protein

31
BLASTN

– BLASTN
– The query is a nucleotide sequence
– The database is a nucleotide database
– No conversion is done on the query or database
– DNA :: DNA homology
– Mapping oligos to a genome
– Annotating genomic DNA with transcriptome data from ESTs
and RNA-Seq

IBB.MB.501
– Annotating untranslated regions

32
BLASTP

– BLASTP
– The query is an amino acid sequence
– The database is an amino acid database
– No conversion is done on the query or database
– Protein :: Protein homology
– Protein function exploration
– Novel gene ➔ make parameters more sensitive

IBB.MB.501
33
BLASTX

– BLASTX
– The query is a nucleotide sequence
– The database is an amino acid database
– All six reading frames are translated on the query and used
to search the database
– Coding nucleotide seq :: Protein homology
– Gene finding in genomic DNA

IBB.MB.501
– Annotating ESTs and transcripts assembled from RNA-Seq
data

34
TBLASTN

– TBLASTN
– The query is an amino sequence
– The database is a nucleotide database
– All six frames are translated in the database and searched
with the protein sequence
– Protein :: Coding nucleotide DB homology
– Mapping a protein to a genome

IBB.MB.501
– Mining ESTs and RNA-Seq data for protein similarities

35
TBLASTX

– TBLASTX
– The query is a nucleotide sequence
– The database is a nucleotide database
– All six frames are translated on the query and on the
database
– Coding :: Coding homology
– Searching distantly-related species

IBB.MB.501
– Sensitive but expensive

36
BLAST output

1. List of sequences with scores


– Raw score
– Higher is better
– Depends on aligned length
– Expect Value (E-value)
– Smaller is better

IBB.MB.501
– Independent of length and database size
2. List of alignments

37
IBB.MB.501
38
IBB.MB.501
39
Multiple Sequence Alignment
– Multiple Sequence Alignment (MSA) can be seen as a generalization of
a Pairwise Sequence Alignment (PSA). Instead of aligning just two
sequences, three or more sequences are aligned simultaneously.

– MSA is used for:


– Detection of conserved domains in a group of genes or proteins
(conservation analysis)
– Construction of a phylogenetic tree
– Prediction of a protein function/structure

IBB.MB.501
– Determination of a consensus sequence

40
https://www.ebi.ac.uk/Tools/msa/

IBB.MB.501
https://www.ebi.ac.uk/Tools/<tool category> 41
Multiple Sequence Alignment Tools
https://www.ebi.ac.uk/Tools/msa/

– Clustal Omega

– Kalign

– MAFFT

– MUSCLE

IBB.MB.501
– T-Coffee

– EMBOSS Cons
42
Multiple Sequence Alignment Tools
– Use heuristics
– Progressive alignment
– E.g. Clustal Omega
– Iterative alignment
– E.g. MAFFT, MUSCLE, Clustal Omega
– Consistency-based alignment
– E.g. T-Coffee
– Profile (HMM-based) alignment

IBB.MB.501
– E.g. Clustal Omega

43
https://www.ebi.ac.uk/Tools/msa/clustalo

IBB.MB.501
44
IBB.MB.501
45
Consensus Symbols

– An * (asterisk) indicates positions which have a single, fully conserved


residue.

– A : (colon) indicates conservation between groups of amino acids


with strongly similar properties

– A . (period) indicates conservation between groups of amino acids

IBB.MB.501
with weakly similar properties

46
Things to remember

– Check the input size limit (depends on tool)

– Tool Errors (not a proper file format, if you provide a single sequence)

IBB.MB.501
47
Things to remember

– Input format
– Try using FASTA format
– Unique sequence identifiers
– First 30 characters in identifier should be unique
– Include sequence!

– Job can’t be found/other error

IBB.MB.501
– Results deleted after 7 days
– Some sequence/program combinations run out of memory
– Use a different program
48
Which tool should I use?
– 3-100 sequences of typical protein length
– MUSCLE, T-Coffee, MAFFT, Clustal Omega

– 100-500 sequences
– Clustal Omega, MUSCLE, MAFFT

– >500 sequences
– Clustal Omega, Kalign

IBB.MB.501
49
Which tool should I use?

– Small number of unusually long sequence


– KALIGN, MAFFT (fast)

– DNA
– MAFFT, Kalign, MUSCLE

IBB.MB.501
50
Final remarks

– Don’t assume a single tool will cater for all your needs

– Change the parameters of the tools

– Remember where the tool excels and what its limitations are

– A tool intended for specific task A can also be used for task B (and

IBB.MB.501
may be better than the tool intended for task B specifically!)

– Crazy input will always give crazy results!


51

You might also like