0% found this document useful (0 votes)

6 views51 pages

IBB.MB.501 Database search and sequence alignment

Genome Science

Uploaded by

Muhammad Shahzad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views51 pages

IBB.MB.501 Database search and sequence alignment

Genome Science

Uploaded by

Muhammad Shahzad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Genome Science (IBB.MB.

501)
IBB.MB.501
Database search and sequence alignment

2
Introduction
– Over the past five decades the use of computers has had a profound
effect on research in the biological sciences
– The amount of information available to researchers in databases,
increases almost exponentially, with biologists and computer scientists
coming together to provide Bioinformatics tools to help extract useful
information from these databases
– The aim of these sessions is to introduce you to the use of some of the
information and software resources available in the public domain

IBB.MB.501
3
Searching sequence databases
– Sequence databases exist for nucleic acids, proteins and complex
carbohydrates
– For nucleic acids and proteins the chemical structure is represented as
a string of characters, such as ACCGTA for nucleic acids or DFGIMCR
for proteins
– Database entries include much more information, or annotation, which
contains the biological, bibliographic and administrative context for

IBB.MB.501
the sequence.

4
NA databases

– For nucleic acids, there are three major public domain databases:
– European Nucleotide Archive, from EMBL-EBI
– NCBI (including GenBank (USA))
– DDBJ (DNA DataBank of Japan)
– All exchange information daily, so that they are essentially identical

IBB.MB.501
5
IBB.MB.501
6
IBB.MB.501
7
IBB.MB.501
8
IBB.MB.501
9
Sequence Alignment

IBB.MB.501
10
Job Dispatcher ❏50+ tools
Bioinformatics Tools (nucleotide and
protein analysis)

❏Recently added:

❏R2DT

❏SSRAECH2SEQ

IBB.MB.501
❏GGSEARCH2SE
Q

11
Tool Categories
▪ Sequence Format Conversion (sfc)
▪ Protein Function Analysis (pfa)
▪ Sequence Operation (so)
▪ Sequence Statistics (seqstats)
▪ Sequence Translation (st)
▪ RNA Analysis (rna)
▪ Phylogeny (phylogeny)
▪ Pairwise Sequence Alignment (psa)
▪ Multiple Sequence Alignment (msa)

IBB.MB.501
▪ Sequence Similarity Search (sss)
▪ Emboss Tools (emboss)

12
Sequence Format Conversion (sfc)

❏Convert one sequence format to another.

❏EMBOSS Seqret, MView

IBB.MB.501
13
Advantages

• No local installation
• Workflows

IBB.MB.501
Typical Bioinformatics Setup
14
How to access tools?
– EBI Service page:
– https://www.ebi.ac.uk/services

– Job dispatcher Tool category page:

– https://www.ebi.ac.uk/Tools/<category>
– Eg: https://www.ebi.ac.uk/Tools/msa (Multiple Sequence Alignment)

IBB.MB.501
15
https://www.ebi.ac.uk/services

IBB.MB.501
16
Sequence Alignment

– Identify regions of similarity

IBB.MB.501
17
Sequence Alignment

IBB.MB.501
18
Sequence Alignment

– Match, Mismatch, Gap

– Gap extension penalty

IBB.MB.501
19
Sequence alignment

– Similarity and Identity

– Substitution matrix

– https://github.com/kimrutherford/EMBOSS/tree/master/emboss/data

– Alignment score : calculated based on the match/mismatch of residues

using the substitution matrices

IBB.MB.501
20
Sequence Alignment Types

PAIRWISE MULTIPLE

IBB.MB.501
21
Pairwise Sequence Alignment

– Involves aligning two sequences using a scoring matrix

– Basic of database similarity search

– Dynamic Programming for global alignment : Needleman-Wunsch

algorithm

IBB.MB.501
– Dynamic programming for local alignment : Smith-Waterman

Algorithm
22
Local and Global Alignment

– Global

IBB.MB.501
– Local

23
Pairwise Alignment Tools
https://www.ebi.ac.uk/Tools/psa/

– Needle
– Stretcher
– GGSEARCH2SEQ
– Water
– Matcher
– LALIGN

IBB.MB.501
– SSEARCH2SEQ
– GeneWISE
24
Which tool to use?
– Global Alignment

Needle Stretcher GGSEARCH2SEQ

(big sequences)

– Local Alignment

Water LALIGN Matcher SSEARCH2SE

(big sequences)
Q

IBB.MB.501
25
https://www.ebi.ac.uk/Tools/psa/emboss_needle/

Sequence input

IBB.MB.501
Parameters

Submit! 26
IBB.MB.501
27
Things to remember

– Selection of tool

– Choose local/global based on your requirement

– Selection of Matrix

IBB.MB.501
– Blosum{n} higher value focus on more closely related proteins.
– PAM{n} higher value focuses on more distantly related proteins.

28
What is BLAST?

– Basic BLAST search

– What is BLAST?
– The framework of BLAST
– Different BLAST programs
– BLAST databases you can search
– Where can I run BLAST?

IBB.MB.501
29
What is BLAST?

• BLAST stands for

Basic Local Alignment Search Tool
• Why BLAST is popular?
- Good balance of sensitivity and speed
- Reliable
- Flexible
• Produce local alignments: short significant stretches of

IBB.MB.501
similarity, irrespective of where they are in the sequence

30
BLAST Programs
The most common BLAST search include five programs:

Program Database (Subject) Query

BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nt. ➔ Protein
TBLASTN Nt. ➔ Protein Protein

IBB.MB.501
TBLASTX Nt. ➔ Protein Nt. ➔ Protein

31
BLASTN

– BLASTN
– The query is a nucleotide sequence
– The database is a nucleotide database
– No conversion is done on the query or database
– DNA :: DNA homology
– Mapping oligos to a genome
– Annotating genomic DNA with transcriptome data from ESTs
and RNA-Seq

IBB.MB.501
– Annotating untranslated regions

32
BLASTP

– BLASTP
– The query is an amino acid sequence
– The database is an amino acid database
– No conversion is done on the query or database
– Protein :: Protein homology
– Protein function exploration
– Novel gene ➔ make parameters more sensitive

IBB.MB.501
33
BLASTX

– BLASTX
– The query is a nucleotide sequence
– The database is an amino acid database
– All six reading frames are translated on the query and used
to search the database
– Coding nucleotide seq :: Protein homology
– Gene finding in genomic DNA

IBB.MB.501
– Annotating ESTs and transcripts assembled from RNA-Seq
data

34
TBLASTN

– TBLASTN
– The query is an amino sequence
– The database is a nucleotide database
– All six frames are translated in the database and searched
with the protein sequence
– Protein :: Coding nucleotide DB homology
– Mapping a protein to a genome

IBB.MB.501
– Mining ESTs and RNA-Seq data for protein similarities

35
TBLASTX

– TBLASTX
– The query is a nucleotide sequence
– The database is a nucleotide database
– All six frames are translated on the query and on the
database
– Coding :: Coding homology
– Searching distantly-related species

IBB.MB.501
– Sensitive but expensive

36
BLAST output

1. List of sequences with scores

– Raw score
– Higher is better
– Depends on aligned length
– Expect Value (E-value)
– Smaller is better

IBB.MB.501
– Independent of length and database size
2. List of alignments

37
IBB.MB.501
38
IBB.MB.501
39
Multiple Sequence Alignment
– Multiple Sequence Alignment (MSA) can be seen as a generalization of
a Pairwise Sequence Alignment (PSA). Instead of aligning just two
sequences, three or more sequences are aligned simultaneously.

– MSA is used for:

– Detection of conserved domains in a group of genes or proteins
(conservation analysis)
– Construction of a phylogenetic tree
– Prediction of a protein function/structure

IBB.MB.501
– Determination of a consensus sequence

40
https://www.ebi.ac.uk/Tools/msa/

IBB.MB.501
https://www.ebi.ac.uk/Tools/<tool category> 41
Multiple Sequence Alignment Tools
https://www.ebi.ac.uk/Tools/msa/

– Clustal Omega

– Kalign

– MAFFT

– MUSCLE

IBB.MB.501
– T-Coffee

– EMBOSS Cons
42
Multiple Sequence Alignment Tools
– Use heuristics
– Progressive alignment
– E.g. Clustal Omega
– Iterative alignment
– E.g. MAFFT, MUSCLE, Clustal Omega
– Consistency-based alignment
– E.g. T-Coffee
– Profile (HMM-based) alignment

IBB.MB.501
– E.g. Clustal Omega

43
https://www.ebi.ac.uk/Tools/msa/clustalo

IBB.MB.501
44
IBB.MB.501
45
Consensus Symbols

– An * (asterisk) indicates positions which have a single, fully conserved

residue.

– A : (colon) indicates conservation between groups of amino acids

with strongly similar properties

– A . (period) indicates conservation between groups of amino acids

IBB.MB.501
with weakly similar properties

46
Things to remember

– Check the input size limit (depends on tool)

– Tool Errors (not a proper file format, if you provide a single sequence)

IBB.MB.501
47
Things to remember

– Input format
– Try using FASTA format
– Unique sequence identifiers
– First 30 characters in identifier should be unique
– Include sequence!

– Job can’t be found/other error

IBB.MB.501
– Results deleted after 7 days
– Some sequence/program combinations run out of memory
– Use a different program
48
Which tool should I use?
– 3-100 sequences of typical protein length
– MUSCLE, T-Coffee, MAFFT, Clustal Omega

– 100-500 sequences
– Clustal Omega, MUSCLE, MAFFT

– >500 sequences
– Clustal Omega, Kalign

IBB.MB.501
49
Which tool should I use?

– Small number of unusually long sequence

– KALIGN, MAFFT (fast)

– DNA
– MAFFT, Kalign, MUSCLE

IBB.MB.501
50
Final remarks

– Don’t assume a single tool will cater for all your needs

– Change the parameters of the tools

– Remember where the tool excels and what its limitations are

– A tool intended for specific task A can also be used for task B (and

IBB.MB.501
may be better than the tool intended for task B specifically!)

– Crazy input will always give crazy results!

EPSON 4900 Field Repair Guide
100% (3)
EPSON 4900 Field Repair Guide
532 pages
Veld Products-Lesson Notes
No ratings yet
Veld Products-Lesson Notes
5 pages
Pharmacology Bioavailability
No ratings yet
Pharmacology Bioavailability
46 pages
Pressed PDF
No ratings yet
Pressed PDF
19 pages
Osce Bank PDF
86% (7)
Osce Bank PDF
15 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
Blast
100% (1)
Blast
21 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
Plant Biotechnology
No ratings yet
Plant Biotechnology
44 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bio Tools Booklet
No ratings yet
Bio Tools Booklet
5 pages
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
No ratings yet
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
26 pages
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
100% (3)
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
23 pages
Bif401 Manual 2023
No ratings yet
Bif401 Manual 2023
27 pages
BLAST
No ratings yet
BLAST
30 pages
Diploma - Practical
No ratings yet
Diploma - Practical
11 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
Blast
No ratings yet
Blast
6 pages
Bioinformatics Lab Assignment Group 3
No ratings yet
Bioinformatics Lab Assignment Group 3
7 pages
Fundamentals of bioinformatics_L5
No ratings yet
Fundamentals of bioinformatics_L5
56 pages
بحث المعلوماتية الحيوية
No ratings yet
بحث المعلوماتية الحيوية
39 pages
UNIT IV _ BLAST (1)
No ratings yet
UNIT IV _ BLAST (1)
21 pages
latthika ppt[1]
No ratings yet
latthika ppt[1]
21 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
Lab 1 - Introduction and Protocol
No ratings yet
Lab 1 - Introduction and Protocol
28 pages
Basic Bioinformatics
No ratings yet
Basic Bioinformatics
40 pages
BLAST
No ratings yet
BLAST
17 pages
Basic Local Alignment
No ratings yet
Basic Local Alignment
36 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
No ratings yet
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
6 pages
BLAST Background
100% (1)
BLAST Background
27 pages
Module_4_Reference Course content
No ratings yet
Module_4_Reference Course content
25 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
No ratings yet
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
6 pages
BLAST
100% (1)
BLAST
4 pages
Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Bioinformatics
No ratings yet
Bioinformatics
11 pages
Bio Informatics
No ratings yet
Bio Informatics
46 pages
Entrez
No ratings yet
Entrez
46 pages
Bioinformatics Tutorial 2019
No ratings yet
Bioinformatics Tutorial 2019
54 pages
Bioinformatics Intern
No ratings yet
Bioinformatics Intern
8 pages
8024 Bio Info
No ratings yet
8024 Bio Info
28 pages
Some Significant Databases Blast Blast
No ratings yet
Some Significant Databases Blast Blast
18 pages
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
100% (1)
Blast: Background: BLAST Is One of The Most Widely Used Bioinformatics Programs
4 pages
02. Biological Sequence Databases
No ratings yet
02. Biological Sequence Databases
35 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
Blast Nsuite
No ratings yet
Blast Nsuite
19 pages
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Java Programming: A Comprehensive Guide to Development Tools and Versatility
From Everand
Java Programming: A Comprehensive Guide to Development Tools and Versatility
Ryan roffe
No ratings yet
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Mastering eBPF: From Networking to Security in the Linux Kernel
From Everand
Mastering eBPF: From Networking to Security in the Linux Kernel
Robert Johnson
No ratings yet
JBoss AS 5 Performance Tuning
From Everand
JBoss AS 5 Performance Tuning
Francesco Marchioni
No ratings yet
WildFly Performance Tuning
From Everand
WildFly Performance Tuning
Arnold Johansson
No ratings yet
Website: Vce To PDF Converter: Facebook: Twitter:: Hpe0-J57.Vceplus - Premium.Exam.59Q
No ratings yet
Website: Vce To PDF Converter: Facebook: Twitter:: Hpe0-J57.Vceplus - Premium.Exam.59Q
19 pages
CAMTU Compressed Air - Microbial Test Kit - Weber Scientific
No ratings yet
CAMTU Compressed Air - Microbial Test Kit - Weber Scientific
3 pages
Tactical Aviation Tactics Techniques and Procedures
100% (2)
Tactical Aviation Tactics Techniques and Procedures
275 pages
Analytic Approximation of Fault Current Contributions From Capacitive Components in HVDC Cable Networks
No ratings yet
Analytic Approximation of Fault Current Contributions From Capacitive Components in HVDC Cable Networks
8 pages
Metals and Chemical Change
No ratings yet
Metals and Chemical Change
285 pages
Nursing Care Plan-1 Age: 50Y Medical Diagnoses: Leukemia Assessment Nursing Diagnosis Planning Intervention Scientific Rationale Evaluation
No ratings yet
Nursing Care Plan-1 Age: 50Y Medical Diagnoses: Leukemia Assessment Nursing Diagnosis Planning Intervention Scientific Rationale Evaluation
1 page
Chem M1 Chemistry and You
No ratings yet
Chem M1 Chemistry and You
31 pages
Teset p4-p6 2563 - Exam
No ratings yet
Teset p4-p6 2563 - Exam
25 pages
LKPD Narrative Teks OK
No ratings yet
LKPD Narrative Teks OK
5 pages
Verification and Standardization of Blood Cell Counters For Routine Clinical Laboratory Tests
No ratings yet
Verification and Standardization of Blood Cell Counters For Routine Clinical Laboratory Tests
15 pages
Lecture 06 Hot Dry Climates
No ratings yet
Lecture 06 Hot Dry Climates
8 pages
EENG301L
No ratings yet
EENG301L
3 pages
Poisonous Snakes in Florida
No ratings yet
Poisonous Snakes in Florida
2 pages
Kant On Empiricism and Rationalism
No ratings yet
Kant On Empiricism and Rationalism
23 pages
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
100% (5)
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
41 pages
Don Bosco School, Siliguri: Project Topics For Class 9
No ratings yet
Don Bosco School, Siliguri: Project Topics For Class 9
7 pages
Geas Final
No ratings yet
Geas Final
489 pages
Column Design
No ratings yet
Column Design
17 pages
Lesson 5 - Philosophy and Spirituality
No ratings yet
Lesson 5 - Philosophy and Spirituality
24 pages
Safety_Presentation
No ratings yet
Safety_Presentation
12 pages
Brosur Urites Aim
No ratings yet
Brosur Urites Aim
1 page
اشكاليات مجان وعلاقتها بشبه الجزيرة العمانية
No ratings yet
اشكاليات مجان وعلاقتها بشبه الجزيرة العمانية
22 pages
ACPython Documentation
No ratings yet
ACPython Documentation
15 pages
Andhra Pradesh Current Affairs em Explanation
No ratings yet
Andhra Pradesh Current Affairs em Explanation
77 pages
CLES5500, - 02 (CLES5500, CLES5500-02, BUCL5500, BUCL5500-01) Rev5-12
100% (1)
CLES5500, - 02 (CLES5500, CLES5500-02, BUCL5500, BUCL5500-01) Rev5-12
117 pages

IBB.MB.501 Database search and sequence alignment

Uploaded by

IBB.MB.501 Database search and sequence alignment

Uploaded by

Genome Science (IBB.MB.

❏Convert one sequence format to another.

❏EMBOSS Seqret, MView

– Job dispatcher Tool category page:

– Identify regions of similarity

– Match, Mismatch, Gap

– Gap extension penalty

– Similarity and Identity

– Alignment score : calculated based on the match/mismatch of residues

using the substitution matrices

– Involves aligning two sequences using a scoring matrix

– Basic of database similarity search

– Dynamic Programming for global alignment : Needleman-Wunsch

Needle Stretcher GGSEARCH2SEQ

Water LALIGN Matcher SSEARCH2SE

– Choose local/global based on your requirement

– Basic BLAST search

• BLAST stands for

Program Database (Subject) Query

1. List of sequences with scores

– MSA is used for:

– An * (asterisk) indicates positions which have a single, fully conserved

– A : (colon) indicates conservation between groups of amino acids

– A . (period) indicates conservation between groups of amino acids

– Check the input size limit (depends on tool)

– Job can’t be found/other error

– Small number of unusually long sequence

– Change the parameters of the tools

– Crazy input will always give crazy results!

You might also like