Bio Tools Booklet
Bio Tools Booklet
Prodigal
Prodigal is a gene-finding tool for bacteria and archaea. A special mode can be used
for certain bacteria which have non-standard genetic codes. Prodigal outputs the
coordinates of the genes found and their translation into protein. To get nucleotide
sequences Prodigal can also be run through the command line in the following way:
GenScan: http://hollywood.mit.edu/GENSCAN.html
GenScan is a web-based tool for finding genes and exons in nucleotide sequences. It
is meant for vertebrates and certain plants. If the sequences to be scanned are too
large, it is possible to download GenScan and run it from the command line.
BLAST: Basic Local Alignment Search Tool
Blast is a tool for comparing sequences to each other. This can be used simply to
compare two sequences or to compare a sequence of interest against a very large
database. The standard usage of blast is to compare against a database. The Blast
suite includes many different tools, the main ones are:
Nucleotide blast is used to compare nucleotide sequences against each other.
Megablast is optimized for very similar sequences, while less similar sequences can
be found using blastn.
Protein blast is used to compare amino acid sequences against each other.
The standard way of doing this is through blastp. The other protein blast tools are
designed to find more distantly related proteins (e.g. PSI-blast) by considering the
conservation pattern of amino acids and protein domains.
Other Blast tools are used to compare nucleotide sequences against protein
sequences. This is done by translating the nucleotides into proteins in all three
possible frames for each DNA strand. Blastx is used for comparing a nucleotide
sequence to a protein database, while Tblastn is used for comparing a protein
sequence against a nucleotide database. Tblastx compares nucleotide sequences
against each other, after translating both into protein. This is useful for identifying
proteins with similar functions in distantly related organisms.
Blast can be run online or locally from the command line. In the latter case,
you can build your own database of relevant reference sequences.
BLAST+
BLAST+ is a set of command line tools that have the same functionality as online
Blast but uses custom, locally-built databases.
To format a BLAST database, use the command makeblastdb as follows:
$ makeblastdb -dbtype type -in input_fasta_file -out database_name
where type is either prot (for a protein database) or nucl (for a nucleotide database).
Three files will be created with the name database_name plus an extension.
To run a BLAST Nucleotide search, type:
$ blastn -query query_fasta_file -db database_name -evalue evalue_threshold
-outfmt output_format 1 > output_file
The commands blasp, blastx, tblastn, and tblastx can be used analogously. To
see all the options available for a given tool, type the name of the desired tool
followed by the flag -help. Note that the default e-value threshold (10) is very high
and will give many false positives. It is usually a good idea to use a much lower
value, such as 10e-10.
1 Use -outfmt 7.
BLAST+ is available on Bioconda. To install it, run
tRNAscan: http://lowelab.ucsc.edu/tRNAscan-SE/
tRNAscan is a tool for identifying transporter RNA in nucleotide sequences. It can be
run online or downloaded to be run locally.
501
tRNAscan is available on Bioconda. To install it, run
Barrnap
Barrnap is a tool for finding ribosomal RNA in nucleotide sequences. It can take
bacterial, archaeal and eukaryotic sequences.
Weblogo: http://weblogo.berkeley.edu/logo.cgi
Weblogo is a tool for producing logos of conserved sequences based on short
multiple alignments. The fasta or clustalw sequences are pasted or uploaded, and an
image is generated of the chosen format and size.
Pfam: http://pfam.xfam.org/
Pfam is a large database of protein families, many of which have extensive
annotation. You can search through it by providing an accession number (provided
by e.g. online blast), keywords or an amino acid sequence.
UniProtKB: http://www.uniprot.org/
UniProtKB is a high-quality annotated protein database. The annotation is either
done manually (collected in the SwissProt database) or automatically (TrEMBL
database).
TMHMM: http://www.cbs.dtu.dk/services/TMHMM-2.0/
TMHMM is a tool for predicting transmembrane domains by inputting amino acid
sequences in fasta format. The output is a list of partitions of your protein sequence
into regions inside/outside the cell and regions inside the membrane, together with a
plot showing the probability for each amino acid to be placed in each type of region.
Philius: http://www.yeastrc.org/philius
Philius is a tool for predicting transmembrane domains and signal peptides based on
an amino acid sequence (fasta format is supported only by submitting it through an e-
mail form). The output is a confidence measure of the sequence being
transmembrane and a partitioning of your protein sequence into regions
inside/outside the cell and regions inside the membrane, together with a confidence
measure for each region (press the "show list" link next to "Predicted protein
segments" to view these statistics).
Galaxy: https://usegalaxy.org/
Galaxy is an open source, web-based platform for data intensive biomedical
research. The interface is divided into three panels; Tools (left), Display (center) and
History (right). You use the tools panel to upload data and select tools to run. Every
time you upload data or run a tool a new item appears in the History panel. From the
History panel you can choose to view your raw data and/or results from the tools you
have used which will then be displayed in the Display panel. Some files are in binary
format (for example BAM files) and they cannot be viewed. If you choose to view
them they will be downloaded to your computer instead.
When you need to execute the same tool on a number of datasets, there is an
option available to run them all at once in parallel (as shown in the figure below).
Most/all of the tools available in Galaxy are also available as open source
software to be run from the command line. While that may be the ‘standard’ way to
run these tools the Galaxy environment is a great platform to get familiar with the
programs, data files and the results.
Galaxy 101: https://usegalaxy.org/u/aun1/p/galaxy101
Following is a short list of the tools in Galaxy, some of which you will be using
through Galaxy in the labs:
FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FastQC is a practical tool that allows you to check various quality aspects of your
sequencing data prior to any downstream analsysis. The input format for FastQC is
sequencing data in the fastq format.
STAR: https://code.google.com/p/rna-star/
STAR is an ultrafast RNA sequencing aligner. It takes in sequencing data in fastq
format and aligns the sequences to a reference genome. The output is a list of
aligned sequences in SAM/BAM format.
Cufflinks: http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/
Cufflinks is a transcript assembler that is it assembles aligned reads into transcripts,
i.e introns and exons. It also handles the job of calculating FPKM values for
transcripts, both novel and known (annotated) ones. Furthermore, cufflinks includes a
module called cuffdiff that calculates differential expression between two (or more)
groups.