BioPerl Tutorial
BioPerl Tutorial
TUTORIAL (ABREV.)
That
second
argument
of
write_sequence,
'fasta',
is
the
sequence
format.
You
can
choose
among
all
the
formats
supported
by
SeqIO.
Try
writing
the
sequence
file
in
'genbank' format.
2) Another
example
is
the
ability
to
blast
a
sequence
using
the
facilities
as
NCBI.
Please
be
careful
not
to
abuse
the
resources
that
NCBI
provides
and
use
this
only
for
individual
searches.
If
you
want
to
do
a
large
number
of
BLAST
searches,
please
download
the
blast
package
and
install
it
locally.
use Bio::Perl;
$seq = get_sequence('genbank',"ROA1_HUMAN");
# uses the default database - nr in this case
$blast_result = blast_sequence($seq);
write_blast(">roa1.blast",$blast_result);
Sequence
objects
Seq
The
central
sequence
object
in
bioperl.
When
in
doubt
this
is
probably
the
object
that
you
want
to
use
to
describe
a
DNA,
RNA
or
protein
sequence
in
bioperl.
Most
common
sequence
manipulations
can
be
performed
with
Seq.
(Bio::Seq
manpage).
Seq
objects
may
be
created
for
you
automatically
when
you
read
in
a
file
containing
sequence
data
using
the
SeqIO
object
(see
below).
In
addition
to
storing
its
identification
labels
and
the
sequence
itself,
a
Seq
object
can
store
multiple
annotations
and
associated
``sequence
features'',
such
as
those
contained
in
most
Genbank
and
EMBL
sequence
files.
This
capability
can
be
very
useful
-
especially
in
development
of
automated
genome
annotation
systems.
PrimarySeq
A
stripped-down
version
of
Seq.
It
contains
just
the
sequence
data
itself
and
a
few
identifying
labels
(id,
accession
number,
alphabet
=
dna,
rna,
or
protein),
and
no
features.
For
applications
with
hundreds
or
thousands
or
sequences,
using
PrimarySeq
objects
can
significantly
speed
up
program
execution
and
decrease
the
amount
of
RAM
the
program
requires.
(Bio::PrimarySeq
manpage).
RichSeq
Stores
additional
annotations
beyond
those
used
by
standard
Seq
objects.
If
you
are
using
sources
with
very
rich
sequence
annotation,
you
may
want
to
consider
using
these
objects.
RichSeq
objects
are
created
automatically
when
Genbank,
EMBL,
or
Swissprot
format
files
are
read
by
SeqIO.
(Bio::Seq::RichSeqI
manpage)
LargeSeq
Used
for
handling
very
long
sequences
(e.g.
>
100
MB).
(Bio::Seq::LargeSeq
manpage).
LocatableSeq
SeqWithQuality
Used
to
manipulate
sequences
with
quality
data,
like
those
produced
by
phred.,
,
and
in
(Bio::Seq::SeqWithQuality
manpage).
RelSegment
Useful
when
you
want
to
be
able
to
manipulate
the
origin
of
the
genomic
coordinate
system.
This
situation
may
occur
when
looking
at
a
sub-sequence
(e.g.
an
exon)
which
is
located
on
a
longer
underlying
underlying
sequence
such
as
a
chromosome
or
a
contig.
Such
manipulations
may
be
important,
for
example
when
designing
a
graphical
genome
browser.
If
your
code
may
need
such
a
capability,
look
at
the
documentation
the
Bio::DB::GFF::RelSegment
manpage
which
describes
this
feature
in
detail.
LiveSeq
Addresses
the
problem
of
features
whose
location
on
a
sequence
changes
over
time.
This
can
happen,
for
example,
when
sequence
feature
objects
are
used
to
store
gene
locations
on
newly
sequenced
genomes
-
locations
which
can
change
as
higher
quality
sequencing
data
becomes
available.
Although
a
LiveSeq
object
is
not
implemented
in
the
same
way
as
a
Seq
object,
LiveSeq
does
implement
the
SeqI
interface
(see
below).
Consequently,
most
methods
available
for
Seq
objects
will
work
fine
with
LiveSeq
objects.
Section
III.7.4
and
the
Bio::LiveSeq
manpage
contain
further
discussion
of
LiveSeq
objects.
SeqI
Seq
``interface
objects''.
They
are
used
to
ensure
bioperl's
compatibility
with
other
software
packages.
SeqI
and
other
interface
objects
are
not
likely
to
be
relevant
to
the
casual
Bioperl
user.
(Bio::SeqI
manpage)
Example:
create
a
simple
Seq
object.
Can
you
make
it
print
the
accession
number,
alphabet,
and
sequence
each
on
a
separate
line
$seq = Bio::Seq->new(-seq
-description
-display_id
-accession_number
-alphabet
=>
=>
=>
=>
=>
'actgtggcgtcaact',
'Sample Bio::Seq object',
'something',
'BIOL_5310',
'dna' );
In
most
cases,
you
will
probably
be
accessing
sequence
data
from
some
online
data
file
or
database.
Example: The following code example will get 3 different sequences from GenBank.
In
addition,
the
Perl
``tied
filehandle''
syntax
is
available
to
SeqIO,
allowing
you
to
use
the
standard
<>
and
print
operations
to
read
and
write
sequence
objects,
eg:
$in
If
the
``-format''
argument
isn't
used
then
Bioperl
will
try
to
determine
the
format
based
on
the
file's
suffix,
in
a
case-insensitive
manner.
If
there's
no
suffix
available
then
SeqIO
will
attempt
to
guess
the
format
based
on
actual
content.
If
it
can't
determine
the
format
then
it
will
assume
``fasta''.
A
complete
list
of
formats
and
suffixes
can
be
found
in
the
SeqIO
HOWTO
(http://www.bioperl.org/wiki/HOWTO:SeqIO).
Practice:
Get
the
sequences
AJ312413,
NP_001073624,
XM_001807534
with
accession
number
from
genbank
and
write
them
to
a
file
in
fasta
format.
Data
files
storing
multiple
sequence
alignments
also
appear
in
varied
formats.
AlignIO
is
the
Bioperl
object
for
conversion
of
alignment
files.
AlignIO
is
patterned
on
the
SeqIO
object
and
its
commands
have
many
of
the
same
names
as
the
commands
in
SeqIO.
Just
as
in
SeqIO
the
AlignIO
object
can
be
created
with
``-file''
and
``-format''
options:
use Bio::AlignIO;
my $io = Bio::AlignIO->new(-file
=> "example.aln",
-format => "clustalw" );
If
the
``-format''
argument
isn't
used
then
Bioperl
will
try
and
determine
the
format
based
on
the
file's
suffix,
in
a
case-insensitive
manner.
Here
is
the
current
set
of
suffixes:
Format
bl2seq
clustalw
emboss*
fasta
maf
mase
mega
meme
metafasta
msf
nexus
pfam
phylip
po
prodom
psi
selex
stockholm
Suffixes
Comment
aln
water|needle
fasta|fast|seq|fa|fsa|nt|aa
maf
Seaview
meg|mega
meme
msf|pileup|gcg
nexus|nex
pfam|pfm
phylip|phlp|phyl|phy|phy|ph
psi
selex|slx|selx|slex|sx
GCG
interleaved
POA
PSI-BLAST
HMMER
*water,
needle,
matcher,
stretcher,
merger,
and
supermatcher.
Unlike
SeqIO,
AlignIO
cannot
create
output
files
in
every
format.
AlignIO
currently
supports
output
in
these
7
formats:
fasta, mase, selex, clustalw, msf/gcg, phylip
(interleaved), and po.
Another
significant
difference
between
AlignIO
and
SeqIO
is
that
AlignIO
handles
IO
for
only
a
single
alignment
at
a
time
but
SeqIO.pm
handles
IO
for
multiple
sequences
in
a
single
stream.
Syntax
for
AlignIO
is
almost
identical
to
that
of
SeqIO:
use Bio::AlignIO;
$in = Bio::AlignIO->new(-file => "inputfilename" ,
-format => 'clustalw');
$out = Bio::AlignIO->new(-file => ">outputfilename",
-format => 'fasta');
while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
The
only
difference
is
that
the
returned
object
reference,
$aln,
is
to
a
SimpleAlign
object
rather
than
to
a
Seq
object.
AlignIO
also
supports
the
tied
filehandle
syntax
described
above
for
SeqIO.
See
the
Bio::AlignIO
manpage,
the
Bio::SimpleAlign
manpage,
Seq
provides
multiple
methods
for
performing
many
common
(and
some
not-so-
common)
tasks
of
sequence
manipulation
and
data
retrieval.
Here
are
some
of
the
most
useful:
These
methods
return
strings
or
may
be
used
to
set
values:
$seqobj->display_id();
$seqobj->seq();
$seqobj->subseq(5,10);
$seqobj->accession_number();
$seqobj->alphabet();
$seqobj->primary_id();
$seqobj->description();
#
#
#
#
#
#
#
#
For
a
comment
annotation,
you
can
use:
use Bio::Annotation::Comment;
$seq->annotation->add_Annotation('comment',
Bio::Annotation::Comment->new(-text => 'some description');
use Bio::Annotation::Reference;
$seq->annotation->add_Annotation('reference',
Bio::Annotation::Reference->new(-authors =>
-title
=>
-location =>
-medline =>
'author1,author2',
'title line',
'location line',
998122 );
The
following
methods
returns
new
sequence
objects,
but
do
not
transfer
the
features
from
the
starting
object
to
the
resulting
feature:
$seqobj->trunc(5,10);
$seqobj->revcom;
$seqobj->translate;
**Note
that
some
methods
return
strings,
some
return
arrays
and
some
return
objects.
See
the
Bio::Seq
manpage
for
more
information.
Many
of
these
methods
are
self-explanatory.
However,
the
flexible
translation()
method
needs
some
explanation.
Translation
in
bioinformatics
can
mean
two
slightly
different
things:
1. Translating
a
nucleotide
sequence
from
start
to
end.
2. Translate
the
actual
coding
regions
in
mRNAs
or
cDNAs.
The
Bioperl
implementation
of
sequence
translation
does
the
first
of
these
tasks
easily.
Any
sequence
object
which
is
not
of
alphabet
'protein'
can
be
translated
by
simply
calling
the
method
which
returns
a
protein
sequence
object:
$prot_obj = $my_seq_object->translate;
All
codons
will
be
translated,
including
those
before
and
after
any
initiation
and
termination
codons.
For
example,
ttttttatgccctaggggg
will
be
translated
to
FFMP*G
However,
the
translate()
method
can
also
be
passed
several
optional
parameters
to
modify
its
behavior.
For
example,
you
can
tell
translate()
to
modify
the
characters
used
to
represent
terminator
(default
'*')
and
unknown
amino
acids
(default
'X').
$prot_obj = $my_seq_object->translate(-terminator => '-');
$prot_obj = $my_seq_object->translate(-unknown => '_');
You
can
also
determine
the
frame
of
the
translation.
The
default
frame
starts
at
the
first
nucleotide
(frame
0).
To
get
translation
in
the
next
frame,
we
would
write:
$prot_obj = $my_seq_object->translate(-frame => 1);
If
we
want
to
translate
full
coding
regions
(CDS)
the
way
major
nucleotide
databanks
EMBL,
GenBank
and
DDBJ
do
it,
the
translate()
method
has
to
perform
more
checks.
Specifically,
translate()needs
to
confirm
that
the
sequence
has
appropriate
start
and
terminator
codons
at
the
very
beginning
and
the
very
end
of
the
sequence
and
that
there
are
no
terminator
codons
present
within
the
sequence
in
frame
0.
In
addition,
if
the
genetic
code
being
used
has
an
atypical
(non-ATG)
start
codon,
the
translate()
method
needs
to
convert
the
initial
amino
acid
to
methionine.
These
checks
and
conversions
are
triggered
by
setting
``complete''
to
1:
$prot_obj = $my_seq_object->translate(-complete => 1);
If
``complete''
is
set
to
true
and
the
criteria
for
a
proper
CDS
are
not
met,
the
method,
by
default,
issues
a
warning.
By
setting
``throw''
to
1,
one
can
instead
instruct
the
program
to
die
if
an
improper
CDS
is
found,
e.g.
$prot_obj = $my_seq_object->translate(-complete => 1,
-throw => 1);
You
can
also
create
a
custom
codon
table
and
pass
this
object
to
translate:
$prot_obj = $my_seq_object->translate(-codontable => $table_obj);
translate()
can
also
find
the
open
reading
frame
(ORF)
starting
at
the
1st
initiation
codon
in
the
nucleotide
sequence,
regardless
of
its
frame,
and
translate
that:
$prot_obj = $my_seq_object->translate(-orf => 1);
Most
of
the
codon
tables
used
by
translate()
have
initiation
codons
in
addition
to
ATG,
including
the
default
codon
table,
NCBI
``Standard''.
To
tell
translate()
to
use
only
ATG,
or
atg,
as
the
initiation
codon
set
-start
to
``atg'':
$prot_obj = $my_seq_object->translate(-orf => 1,
In
addition
to
the
methods
directly
available
in
the
Seq
object,
bioperl
provides
various
helper
objects
to
determine
additional
information
about
a
sequence.
For
example,
SeqStats
object
provides
methods
for
obtaining
the
molecular
weight
of
the
sequence
as
well
the
number
of
occurrences
of
each
of
the
component
residues
(bases
for
a
nucleic
acid
or
amino
acids
for
a
protein.)
For
nucleic
acids,
SeqStats
also
returns
counts
of
the
number
of
codons
used.
For
example:
use SeqStats;
$seq_stats = Bio::Tools::SeqStats->new($seqobj);
$weight = $seq_stats->get_mol_wt();
$monomer_ref = $seq_stats->count_monomers();
$codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence
Note:
sometimes
sequences
will
contain
ambiguous
codes.
For
this
reason,
get_mol_wt()
returns
a
reference
to
a
two
element
array
containing
a
greatest
lower
bound
and
a
least
upper
bound
of
the
molecular
weight.
The
SeqWords
object
is
similar
to
SeqStats
and
provides
methods
for
calculating
frequencies
of
``words''
(e.g.
tetramers
or
hexamers)
within
the
sequence.
See
the
Bio::Tools::SeqStats
manpage
and
the
Bio::Tools::SeqWords
manpage
for
more
information.
Another
common
sequence
manipulation
task
for
nucleic
acid
sequences
is
locating
restriction
enzyme
cutting
sites.
Bioperl
provides
the
Bio::Restriction::Enzyme,
Bio::Restriction::EnzymeCollection,
and
Bio::Restriction::Analysis
objects
for
this
purpose.
A
new
collection
of
enzyme
objects
would
be
defined
like
this:
use Bio::Restriction::EnzymeCollection;
my $all_collection = Bio::Restriction::EnzymeCollection;
There
are
other
methods
that
can
be
used
to
select
sets
of
enzyme
objects,
such
as
unique_cutters() and
blunt_enzymes().
You
can
also
select
a
Enzyme
object
by
name,
like
so:
my $ecori_enzyme = $all_collection->get_enzyme('EcoRI');
Once
an
appropriate
enzyme
has
been
selected,
the
sites
for
that
enzyme
on
a
given
nucleic
acid
sequence
can
be
obtained
using
the
fragments()
method.
The
syntax
for
performing
this
task
is:
use Bio::Restriction::Analysis;
For
more
information,
including
creating
your
own
RE
database
(REBASE),
see
the
Bio::Restriction::Enzyme
manpage,
the
Bio::Restriction::EnzymeCollection
manpage,
the
Bio::Restriction::Analysis
manpage,
and
the
Bio::Restriction::IO
manpage.
Identifying
amino
acid
cleavage
sites
(Sigcleave)
Predict
aa
cleavage
sites.
Please
see
the
Bio::Tools::Sigcleave
manpage
for
details
OddCodes:
listing
of
an
amino
acid
sequence
showing
where
the
functional
aspects
of
amino
acids
are
located
or
where
the
positively
charged
ones
are. See
the
documentation
in
the
Bio::Tools::OddCodes
manpage
for
further
details.
SeqPattern:
used
to
manipulate
sequences
using
Perl
regular
expressions.
More
detail
can
be
found
in
the
Bio::Tools::SeqPattern
manpage.
You
may
want
to
change
some
parameter
of
the
remote
job
and
this
example
shows
how
to
change
the
matrix:
$Bio::Tools::Run::RemoteBlast::HEADER{'MATRIX_NAME'} = 'BLOSUM25';
http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
Note
that
the
script
has
to
be
broken
into
two
parts.
The
actual
Blast
submission
and
the
subsequent
retrieval
of
the
results.
At
times
when
the
NCBI
Blast
is
being
heavily
used,
the
interval
between
when
a
Blast
submission
is
made
and
when
the
results
are
available
can
be
substantial.
The
object
$rc
would
contain
the
blast
report
that
could
then
be
parsed
with
Bio::Tools::BPlite
or
Bio::SearchIO.
The
default
object
returned
is
SearchIO
after
version
1.0.
The
object
type
can
be
changed
using
the
-readmethod
parameter
but
bear
in
mind
that
the
favored
Blast
parser
is
Bio::SearchIO,
others
won't
be
supported
in
later
versions.
**Note
that
to
make
this
script
actually
useful,
one
should
add
details
such
as
checking
return
codes
from
the
Blast
to
see
if
it
succeeded
and
a
``sleep''
loop
to
wait
between
consecutive
requests
to
the
NCBI
server.
See
the
Bio::Tools::Run::RemoteBlast
manpage
for
details.
It
should
also
be
noted
that
the
syntax
for
creating
a
remote
blast
factory
is
slightly
different
from
that
used
in
creating
StandAloneBlast,
Clustalw,
and
T-Coffee
factories.
Specifically
RemoteBlast
requires
parameters
to
be
passed
with
a
leading
hyphen,
as
in
'-prog' => 'blastp',
while
the
other
programs
do
not
pass
parameters
with
a
leading
hyphen.
No
matter
how
Blast
searches
are
run
(locally
or
remotely,
with
or
without
a
Perl
interface),
they
return
large
quantities
of
data
that
are
tedious
to
sift
through.
Bioperl
offers
several
different
objects
-
Search.pm/SearchIO.pm,
and
BPlite.pm
(along
with
its
minor
modifications,
BPpsilite
and
BPbl2seq)
for
parsing
Blast
reports.
Search
and
SearchIO
which
are
the
principal
Bioperl
interfaces
for
Blast
and
FASTA
report
parsing,
are
described
in
this
section.
The
older
BPlite
is
described
in
section
III.4.3.
We
recommend
you
use
SearchIO,
it's
certain
to
be
supported
in
future
releases.
The
Search
and
SearchIO
modules
provide
a
uniform
interface
for
parsing
sequence-
similarity-search
reports
generated
by
BLAST
(in
standard
and
BLAST
XML
formats),
PSI-BLAST,
RPS-BLAST,
bl2seq
and
FASTA.
The
SearchIO
modules
also
provide
a
parser
for
HMMER
reports
and
in
the
future,
it
is
envisioned
that
the
Search/SearchIO
syntax
will
be
extended
to
provide
a
uniform
interface
to
an
even
wider
range
of
report
parsers
including
parsers
for
Genscan.
Parsing
sequence-similarity
reports
with
Search
and
SearchIO
is
straightforward.
Initially
a
SearchIO
object
specifies
a
file
containing
the
report(s).
The
method
next_result()
reads
the
next
report
into
a
Search
object
in
just
the
same
way
that
the
next_seq()
method
of
SeqIO
reads
in
the
next
sequence
in
a
file
into
a
Seq
object.
Once
a
report
(i.e.
a
SearchIO
object)
has
been
read
in
and
is
available
to
the
script,
the
report's
overall
attributes
(e.g.
the
query)
can
be
determined
and
its
individual
hits
can
be
accessed
with
the
next_hit()
method.
Individual
high-scoring
segment
pairs
for
each
hit
can
then
be
accessed
with
the
next_hsp()
method.
Except
for
the
additional
syntax
required
to
enable
the
reading
of
multiple
reports
in
a
single
file,
the
remainder
of
the
Search/SearchIO
parsing
syntax
is
very
similar
to
that
of
the
BPlite
object
it
is
intended
to
replace.
Sample
code
to
read
a
BLAST
report
might
look
like
this:
# Get the report
$searchio = new Bio::SearchIO (-format => 'blast',
-file
=> $blast_report);
$result = $searchio->next_result;
# Get info about the entire report
$result->database_name;
$algorithm_type = $result->algorithm;
# get info about the first hit
$hit = $result->next_hit;
$hit_name = $hit->name ;
# get info about the first hsp of the first hit
$hsp = $hit->next_hsp;
$hsp_start = $hsp->query->start;
Bioperl's
older
BLAST
report
parsers
-
BPlite,
BPpsilite,
BPbl2seq
and
Blast.pm
-
are
no
longer
supported
but
since
legacy
Bioperl
scripts
have
been
written
which
use
these
objects,
they
are
likely
to
remain
within
Bioperl
for
some
time.
A
complete
description
of
the
module
can
be
found
in
the
Bio::Tools::BPlite
manpage.
RefSeq
Accession
abbreviation
key:
http://www.ncbi.nlm.nih.gov/projects/RefSeq/key.html
-
accessions