IBB.MB.501 Database search and sequence alignment
IBB.MB.501 Database search and sequence alignment
501)
IBB.MB.501
Database search and sequence alignment
2
Introduction
– Over the past five decades the use of computers has had a profound
effect on research in the biological sciences
– The amount of information available to researchers in databases,
increases almost exponentially, with biologists and computer scientists
coming together to provide Bioinformatics tools to help extract useful
information from these databases
– The aim of these sessions is to introduce you to the use of some of the
information and software resources available in the public domain
IBB.MB.501
3
Searching sequence databases
– Sequence databases exist for nucleic acids, proteins and complex
carbohydrates
– For nucleic acids and proteins the chemical structure is represented as
a string of characters, such as ACCGTA for nucleic acids or DFGIMCR
for proteins
– Database entries include much more information, or annotation, which
contains the biological, bibliographic and administrative context for
IBB.MB.501
the sequence.
4
NA databases
– For nucleic acids, there are three major public domain databases:
– European Nucleotide Archive, from EMBL-EBI
– NCBI (including GenBank (USA))
– DDBJ (DNA DataBank of Japan)
– All exchange information daily, so that they are essentially identical
IBB.MB.501
5
IBB.MB.501
6
IBB.MB.501
7
IBB.MB.501
8
IBB.MB.501
9
Sequence Alignment
IBB.MB.501
10
Job Dispatcher ❏50+ tools
Bioinformatics Tools (nucleotide and
protein analysis)
❏Recently added:
❏R2DT
❏SSRAECH2SEQ
IBB.MB.501
❏GGSEARCH2SE
Q
11
Tool Categories
▪ Sequence Format Conversion (sfc)
▪ Protein Function Analysis (pfa)
▪ Sequence Operation (so)
▪ Sequence Statistics (seqstats)
▪ Sequence Translation (st)
▪ RNA Analysis (rna)
▪ Phylogeny (phylogeny)
▪ Pairwise Sequence Alignment (psa)
▪ Multiple Sequence Alignment (msa)
IBB.MB.501
▪ Sequence Similarity Search (sss)
▪ Emboss Tools (emboss)
12
Sequence Format Conversion (sfc)
IBB.MB.501
13
Advantages
• No local installation
• Workflows
IBB.MB.501
Typical Bioinformatics Setup
14
How to access tools?
– EBI Service page:
– https://www.ebi.ac.uk/services
IBB.MB.501
15
https://www.ebi.ac.uk/services
IBB.MB.501
16
Sequence Alignment
IBB.MB.501
17
Sequence Alignment
IBB.MB.501
18
Sequence Alignment
IBB.MB.501
19
Sequence alignment
– Substitution matrix
– https://github.com/kimrutherford/EMBOSS/tree/master/emboss/data
IBB.MB.501
20
Sequence Alignment Types
PAIRWISE MULTIPLE
IBB.MB.501
21
Pairwise Sequence Alignment
algorithm
IBB.MB.501
– Dynamic programming for local alignment : Smith-Waterman
Algorithm
22
Local and Global Alignment
– Global
IBB.MB.501
– Local
23
Pairwise Alignment Tools
https://www.ebi.ac.uk/Tools/psa/
– Needle
– Stretcher
– GGSEARCH2SEQ
– Water
– Matcher
– LALIGN
IBB.MB.501
– SSEARCH2SEQ
– GeneWISE
24
Which tool to use?
– Global Alignment
– Local Alignment
IBB.MB.501
25
https://www.ebi.ac.uk/Tools/psa/emboss_needle/
Sequence input
IBB.MB.501
Parameters
Submit! 26
IBB.MB.501
27
Things to remember
– Selection of tool
– Selection of Matrix
IBB.MB.501
– Blosum{n} higher value focus on more closely related proteins.
– PAM{n} higher value focuses on more distantly related proteins.
28
What is BLAST?
IBB.MB.501
29
What is BLAST?
IBB.MB.501
similarity, irrespective of where they are in the sequence
30
BLAST Programs
The most common BLAST search include five programs:
IBB.MB.501
TBLASTX Nt. ➔ Protein Nt. ➔ Protein
31
BLASTN
– BLASTN
– The query is a nucleotide sequence
– The database is a nucleotide database
– No conversion is done on the query or database
– DNA :: DNA homology
– Mapping oligos to a genome
– Annotating genomic DNA with transcriptome data from ESTs
and RNA-Seq
IBB.MB.501
– Annotating untranslated regions
32
BLASTP
– BLASTP
– The query is an amino acid sequence
– The database is an amino acid database
– No conversion is done on the query or database
– Protein :: Protein homology
– Protein function exploration
– Novel gene ➔ make parameters more sensitive
IBB.MB.501
33
BLASTX
– BLASTX
– The query is a nucleotide sequence
– The database is an amino acid database
– All six reading frames are translated on the query and used
to search the database
– Coding nucleotide seq :: Protein homology
– Gene finding in genomic DNA
IBB.MB.501
– Annotating ESTs and transcripts assembled from RNA-Seq
data
34
TBLASTN
– TBLASTN
– The query is an amino sequence
– The database is a nucleotide database
– All six frames are translated in the database and searched
with the protein sequence
– Protein :: Coding nucleotide DB homology
– Mapping a protein to a genome
IBB.MB.501
– Mining ESTs and RNA-Seq data for protein similarities
35
TBLASTX
– TBLASTX
– The query is a nucleotide sequence
– The database is a nucleotide database
– All six frames are translated on the query and on the
database
– Coding :: Coding homology
– Searching distantly-related species
IBB.MB.501
– Sensitive but expensive
36
BLAST output
IBB.MB.501
– Independent of length and database size
2. List of alignments
37
IBB.MB.501
38
IBB.MB.501
39
Multiple Sequence Alignment
– Multiple Sequence Alignment (MSA) can be seen as a generalization of
a Pairwise Sequence Alignment (PSA). Instead of aligning just two
sequences, three or more sequences are aligned simultaneously.
IBB.MB.501
– Determination of a consensus sequence
40
https://www.ebi.ac.uk/Tools/msa/
IBB.MB.501
https://www.ebi.ac.uk/Tools/<tool category> 41
Multiple Sequence Alignment Tools
https://www.ebi.ac.uk/Tools/msa/
– Clustal Omega
– Kalign
– MAFFT
– MUSCLE
IBB.MB.501
– T-Coffee
– EMBOSS Cons
42
Multiple Sequence Alignment Tools
– Use heuristics
– Progressive alignment
– E.g. Clustal Omega
– Iterative alignment
– E.g. MAFFT, MUSCLE, Clustal Omega
– Consistency-based alignment
– E.g. T-Coffee
– Profile (HMM-based) alignment
IBB.MB.501
– E.g. Clustal Omega
43
https://www.ebi.ac.uk/Tools/msa/clustalo
IBB.MB.501
44
IBB.MB.501
45
Consensus Symbols
IBB.MB.501
with weakly similar properties
46
Things to remember
– Tool Errors (not a proper file format, if you provide a single sequence)
IBB.MB.501
47
Things to remember
– Input format
– Try using FASTA format
– Unique sequence identifiers
– First 30 characters in identifier should be unique
– Include sequence!
IBB.MB.501
– Results deleted after 7 days
– Some sequence/program combinations run out of memory
– Use a different program
48
Which tool should I use?
– 3-100 sequences of typical protein length
– MUSCLE, T-Coffee, MAFFT, Clustal Omega
– 100-500 sequences
– Clustal Omega, MUSCLE, MAFFT
– >500 sequences
– Clustal Omega, Kalign
IBB.MB.501
49
Which tool should I use?
– DNA
– MAFFT, Kalign, MUSCLE
IBB.MB.501
50
Final remarks
– Don’t assume a single tool will cater for all your needs
– Remember where the tool excels and what its limitations are
– A tool intended for specific task A can also be used for task B (and
IBB.MB.501
may be better than the tool intended for task B specifically!)