0% found this document useful (0 votes)
46 views26 pages

Retrieve GenBank Sequences with R

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views26 pages

Retrieve GenBank Sequences with R

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Getting Sequences from GenBank using R-packages

• Open R and select the working directory where you want to output sequence files

Misc>Change Working Directory>select a folder (e.g., R_class_winter_2015)

Alternatively:

setwd("/Users/jcsantos/Desktop/R_class_winter_2015/1_getting_sequences_from_GenBank")!
!
#I open terminal and drag the folder to it to get the path. Then, copy and paste.!

• We need to install and load the following packages:

[Link]("ape")!
[Link]("seqinr")!
!
library(ape) #this is a general R-package for phylogenetics and comparative methods!
library("seqinr") #this is an specialized package for nucleotide sequence management!

• Let’s check that our packages have been loaded correctly


!
sessionInfo()!
34
Getting Sequences from GenBank using R-packages

• Let's use 'ape' to read the sequence from GenBank this with the function: ?[Link]!

• This function connects to the GenBank database, and reads nucleotide sequences using
accession numbers given as arguments.

• Usage (do not run)


[Link]([Link], [Link] = [Link], [Link] = FALSE)!

#[Link]: a vector of mode character giving the accession numbers.!


#[Link]: the names to give to each sequence; by default the accession numbers.!
#[Link]: a logical whether to return the sequences as an object "DNAbin”.!

• Let's read the casque-headed lizard (Basiliscus basiliscus) RAG1 sequence JF806202
!
seq_1_DNAbin <- [Link]("JF806202") #save as DNAbin object:!
attr(seq_1_DNAbin, "species") #to get the specie name of the sequence!
seq_1_DNAbin$JF806202!
str(seq_1_DNAbin) # we get the structure of the object!
!
#save as character object:!
!
seq_1_character <- [Link]("JF806202", [Link] = TRUE)!
seq_1_character #this is not a very nice format!

35
Read sequences using accession numbers
• Create a vector of GenBank accession numbers that we want

lizards_accession_numbers <- c("JF806202", "HM161150", "FJ356743", "JF806205", !


"JQ073190", "GU457971", "FJ356741", "JF806207",!
"JF806210", "AY662592", "AY662591", "FJ356748", !
"JN112660", "AY662594", "JN112661", "HQ876437", !
"HQ876434", "AY662590", "FJ356740", "JF806214", !
"JQ073188", "FJ356749", "JQ073189", "JF806216", !
"AY662598", "JN112653", "JF806204", "FJ356747", !
"FJ356744", "HQ876440", "JN112651", "JF806215",!
"JF806209") !
#create a vector a GenBank accession numbers!

• Get those sequences and save them in a single DNAbin object:!


!
lizards_sequences <- [Link](lizards_accession_numbers) #read sequences and place
them in a DNAbin object!
!
lizards_sequences #a brief summary of what is in the object, including base composition!
!
!
str(lizards_sequences) #a list of the DNAbin elements with length of the sequences!
#notice the one of the attributes is the species names!

36
Read sequences and create a fasta file format

• Lets explore more the DNAbin object:

attributes(lizards_sequences) #see the list of attributes and contents !


!
names(lizards_sequences) #the accession numbers!
!
attr(lizards_sequences, "species") # we get the species list. Notice this !
# attr is slightly different function!

• However, it is hard remember which accession number corresponds to which species.


So we can use the previous information to create first a vector with such information

lizards_sequences_GenBank_IDs <- paste(attr(lizards_sequences, "species"), names


(lizards_sequences), sep ="_RAG1_") !
!
## build a character vector with the species, GenBank accession numbers, and gene!
## name "_RAG1_” this is its common abbreviation: recombination activating protein 1!
## notice the use of the paste function: textA, textB, textC!
## results in: textAtextCtextB!
!
lizards_sequences_GenBank_IDs #a more informative vector of names for our sequences!

37
Write a fasta file format
• Let’s write sequences to a text file in fasta format using [Link](). However, only
accession numbers are included.

?[Link] # This function writes in a file a list of DNA sequences in sequential,


interleaved, or FASTA format.!
!
### we are going to write in fasta format!
!
[Link](lizards_sequences, file ="lizard_fasta_1.fasta", format = "fasta", append =
FALSE, nbcol = 6, colsep = " ", colw = 10)!
!
########### Some relevant arguments for [Link]()!
!
#x: a list or a matrix of DNA sequences.!
!
#file: a file name specified to contain our sequences!
!
#format: Three choices are possible: "interleaved", "sequential", or "fasta", or any
#unambiguous abbreviation of these.!
!
#append: a logical, if TRUE the data are appended to the file without erasing the data
#possibly existing in the file, otherwise the file is overwritten (FALSE the default).!
!
#nbcol: a numeric specifying the number of columns per row (6 by default)!
!
#colsep: a character used to separate the columns (a single space by default).!
!
#colw: a numeric specifying the number of nucleotides per column (10 by default).!
###########! 38
Write a fasta file format
• Lets explore our recently created file ‘lizard_fasta_1.fasta’. Drag and drop this file in the
text editor

• This file has our sequences, but we only have the accession numbers
39
Rewrite a fasta file format with more information
• Read our fasta file using the seqinr package

lizard_seq_seqinr_format <- [Link](file = "lizard_fasta_1.fasta", seqtype = "DNA",


[Link] = TRUE, forceDNAtolower = FALSE)!
!
lizard_seq_seqinr_format #this shows different form to display the same sequence !
#information !
!
• Rewrite our fasta file using the name vector that we created previously
!
[Link](sequences = lizard_seq_seqinr_format, names = lizards_sequences_GenBank_IDs,
nbchar = 10, [Link] = "lizard_seq_seqinr_format.fasta")!
!
#Suggestion: Do not rearrange, delete or add sequenced to the fasta file, as the
function will assign the names in the order provided in the file and the name vector!
!
• Let’s check our new fasta file ‘lizard_seq_seqinr_format.fasta’
!

40
Get sequences without using accession numbers

• We can use a package that use an API (application programming interface) to interact
with the NCBI website.

More info in: [Link]

[Link] ("rentrez")!
library (rentrez)!
!
• Let’s get some lizard sequences
!
lizard <- "Basiliscus basiliscus[Organism]" #We want a character vector!
!
#nucleotide database (nuccore) and retmax determines the max number!
lizard_search <- entrez_search(db="nuccore", term=lizard, retmax=40) !
lizard_search!
lizard_search$ids #gives you the NCBI ids!
!
!
#gets your sequences as a character vector!
lizard_seqs <- entrez_fetch(db="nuccore", id=lizard_search$ids, rettype="fasta")!
lizard_seqs!

41
Get sequences without using accession numbers

• Lets get our Basiliscus basiliscus RAG 1 sequence


!
Bbasiliscus_RAG1 <- "Basiliscus basiliscus[Organism] AND RAG1[Gene]”!
!
Bbasiliscus_RAG1_search <- entrez_search(db="nuccore", term=Bbasiliscus_RAG1, retmax=10) !
#nucleotide database (nuccore) and retmax determines no more than 10 access numbers to
return!
!
Bbasiliscus_RAG1_search$ids #gives you the NCBI ids!
!
Bbasiliscus_RAG1_seqs <- entrez_fetch(db="nuccore", id=Bbasiliscus_RAG1_search$ids,
rettype="fasta")!
!
Bbasiliscus_RAG1_seqs #notice \n (new line) delimiter. Other common delimiters are \r !
#(carriage return) and \t (tab).!
!
write(Bbasiliscus_RAG1_seqs, "Bbasiliscus_RAG1.fasta", sep="\n") #gets sequence to a
file!
!
• We can read our fasta file using seqinr package
!
Bbasiliscus_RAG1_seqinr_format <- [Link](file = "Bbasiliscus_RAG1.fasta", seqtype =
"DNA", [Link] = TRUE, forceDNAtolower = FALSE)!
!
Bbasiliscus_RAG1_seqinr_format # you can also check the .fasta file in the working
folder!
!
42
Example: Accessing Cytochrome B Sequences

• We can use the ‘rentrez’ package to get lots of sequences using taxonomic
classifications for specific markers
!
Liolaemus_CYTB <- "Liolaemus[Organism] AND CYTB[Gene]” !
!
#This is a well-studied gene from this genus of South American lizards!
!
Liolaemus_CYTB_search <- entrez_search(db="nuccore", term=Liolaemus_CYTB, retmax=100) !
!
Liolaemus_CYTB_search #There are 2539 sequences that match this query !
!
!
• Let’s adjust the search and fetch all sequences of of sequences using taxonomic
classifications for specific markers!
!
!
Liolaemus_CYTB_search_2 <- entrez_search(db="nuccore", term=Liolaemus_CYTB, retmax=2539)!
!
Liolaemus_CYTB_search_2$ids #gives you the NCBI ids!
!
Liolaemus_CYTB_seqs <- entrez_fetch(db="nuccore", id=Liolaemus_CYTB_search_2$ids ,
rettype="fasta")!
!
#we get an error “client error: (414) Request-URI Too Long”. We are asking too many
sequences!

43
Example: Accessing Cytochrome B Sequences

• Lets adjust the search and fetch by smaller chunks so we can get the first 1500
sequences!
!
Liolaemus_CYTB_seqs_part_1 <- entrez_fetch(db="nuccore", id=Liolaemus_CYTB_search_2$ids
[1:500] , rettype="fasta")!
!
Liolaemus_CYTB_seqs_part_2 <- entrez_fetch(db="nuccore", id=Liolaemus_CYTB_search_2$ids
[501:1000] , rettype="fasta")!
!
Liolaemus_CYTB_seqs_part_3 <- entrez_fetch(db="nuccore", id=Liolaemus_CYTB_search_2$ids
[1001:1500] , rettype="fasta")!
!
!
• Lets write as single file by appending all 3 chucks of sequences
!
write(Liolaemus_CYTB_seqs_part_1, "Liolaemus_CYTB_seqs.fasta", sep="\n")!
!
write(Liolaemus_CYTB_seqs_part_2, "Liolaemus_CYTB_seqs.fasta", sep="\n", append = TRUE)
#it gets the sequences to the same file by changing the logical argument of append from
#the default FALSE to TRUE (i.e., can abbreviate TRUE with T or other unambiguous
#abbreviation)!
!
write(Liolaemus_CYTB_seqs_part_3, "Liolaemus_CYTB_seqs.fasta", sep="\n", append = TRUE) !
#you will get a 1.3 Mb file with all 1500 sequences!

44
Example: Accessing Cytochrome B Sequences

• We can read our fasta file using the seqinr package and rename the sequences
!
Liolaemus_CYTB_seqs_seqinr_format <- [Link](file = "Liolaemus_CYTB_seqs.fasta",
seqtype = "DNA", [Link] = TRUE, forceDNAtolower = FALSE)!
!
Liolaemus_CYTB_seqs_seqinr_format!
!
Liolaemnus_CYTB_names <- attr(Liolaemus_CYTB_seqs_seqinr_format, "name")!
!
Liolaemnus_CYTB_names <- gsub("\\..*","", Liolaemnus_CYTB_names) !
!
#eliminate characters after "." using ?gsub (Pattern Matching and Replacement)!
!
Liolaemnus_CYTB_names <- gsub("^.*\\|", "", Liolaemnus_CYTB_names) !
!
#eliminate characters before "|" using ?gsub (Pattern Matching and Replacement)!
!
Liolaemnus_CYTB_names!
!
!

45
Example: Accessing Cytochrome B Sequences

• We can read our fasta file using ape package to get accession numbers and species
names
!
Liolaemus_CYTB_seqs_ape_format <- [Link](Liolaemnus_CYTB_names)!
!
attr(Liolaemus_CYTB_seqs_ape_format, "species") !
#to get the species names of the sequence!
!
names(Liolaemus_CYTB_seqs_ape_format)!
!
Liolaemus_CYTB_seqs_GenBank_IDs <- paste(attr(Liolaemus_CYTB_seqs_ape_format,
"species"), names(Liolaemus_CYTB_seqs_ape_format), sep="_CYTB_") !
## build a vector object with the species, GenBank accession numbers, and type of gene!
!
Liolaemus_CYTB_seqs_GenBank_IDs #vector of names to add to sequences!
!
# Read our fasta file 'Liolaemus_CYTB_seqs.fasta' using seqinr package!
!
Liolaemus_CYTB_seqs_seqinr_format <- [Link](file = "Liolaemus_CYTB_seqs.fasta",
seqtype = "DNA", [Link] = TRUE, forceDNAtolower = FALSE)!
!
# Rewrite our fasta file using the name vector that we created previously!
!
[Link](sequences = Liolaemus_CYTB_seqs_seqinr_format, names =
Liolaemus_CYTB_seqs_GenBank_IDs, nbchar = 10, [Link] =
"Liolaemus_CYTB_seqs_seqinr_format.fasta”)!

46
47
Alignment and Simultaneous Tree Estimation

• We are going to use SATe-2 (SATé - Simultaneous Alignment and Tree Estimation)
!
!
URL: [Link]
!
!

48
Alignment and Simultaneous Tree Estimation

• From the Developers’ webpage (University of Kansas: Jiaye Yu, Mark Holder, Jeet
Sukumaran, Siavash Mirarab, and Jamie Oaks):

SATé is a software package for inferring a sequence alignment and phylogenetic tree.
The iterative algorithm involves repeated alignment and tree searching operations. The
original data set is divided into smaller subproblems by a tree-based decomposition.
These sub-problems are aligned and further merged for phylogenetic tree inference.

Currently, the following tools are supported, and are bundled with the SATé distribution:

ClustalW 2.0.12 (sequence alignment program)


MAFFT 6.717 (sequence alignment program)
MUSCLE 3.7 (sequence alignment program)
OPAL 1.0.3 (sequence alignment program)

PRANK 100311 (phylogeny-aware alignment program)

RAxML 7.2.6 (phylogeny estimator program)


FastTree 2.1.4 (phylogeny estimator program)
!
49
SATe-2 needs Python 2.7 (Upgrade Python Instructions)
• MAC OS: Open terminal (go the HD>Applications>Utilities>Terminal)

• MAC OS: Check your version of Python

python --version!
!
• MAC OS: if necessary upgrade python to 2.7 as required by SATe-II
!
[Link]

50
Install SATe-2
• Download SATe-II precompiled from UT-Austin website:

[Link]

***For those more adventurous you can download the command based 'SATe-II' from:

[Link]

download: [Link]

Follow the instructions in the main webpage

51
Preparing FASTA filed for SATe-2
• Download the FASTA files from the course website

Liolaemus_CYTB.fasta
Lizard_RAG1.fasta

• Create two output folders for the alignment results in your desktop and place the fasta
files in the corresponding one

folder: Liolaemus_CYTB
folder: Lizards_RAG1

52
Running SATe-2
• Open SATe-II GUI by clicking on the executable on the program folder:

• Explore the console and the options in the SATe-II GUI version:

53
Running SATe-2
• Explore the options in the SATe-II GUI version:

External Tools:
Aligner: [ClustalW2, MAFFT, PRANK, OPAL]
Merger: [MUSCLE, OPAL]
Tree Estimator: [RAXML, FASTTREE]
Model: [RAxML-options: GTRCAT, GTRGAMMA, GTRGAMMAI;
FASTTREE-options: GTR+G20, GTR+CAT, JC+G20, JC+CAT ]

Sequences and Tree:


Sequence file ...: [This is the folder where our fasta file resides]
Multi-locus Data [option]
Data Type: [DNA, RNA, Protein]
Initial Aligment [option]
Tree file (optional): [Provide if you have an initial phylogeny associated with the sequences]

Workflow Settings:
Algorithm [option] Two-Phase (not SATe-II)
Post-Processing [option] Extra RAxML Search

54
Running SATe-2
• Explore the options in the SATe-II GUI version:

Job Settings:
Job Name: [give a name for the job]
Output Dir.: [Select the corresponding directory for the output aligment]
CPU(s) Available: [It will depend on your computer]
Max. Memory (MB): [It will depend on your computer]

SATe-II Settings
Quick Set: [Presets: SATe-II_fast, SATe-II_ML, SATe-II_simple, custom]
Max. Subproblem:
Percentage [default 50]
Size [default 10]
Decomposition:
Centroid (fast) or Longest (slow)
Apply Stop Rule: [options]
Stopping Rule: Blind Mode Enabled
Time Limit (hr) [default 24 hours]
Iteration limit [default 1 iterations]
Return: [Default are Final or Best alignment]

55
Running SATe-2: Select the Following Options
External Tools:
Aligner: [MAFFT]
Merger: [MUSCLE]
Tree Estimator: [RAXML]
Model: [GTRGAMMAI]

Sequences and Tree:


Sequence file ...: [Liolaemus_CYTB.fasta]
Data Type: [DNA]
Tree file (optional): [None]

Workflow Settings:
Algorithm: [None] Two-Phase (not SATe-II)
Post-Processing: [None] Extra RAxML Search

Job Settings:
Job Name: [Liolaemus_CYTB_alignment]
Output Dir.: [Liolaemus_CYTB] Select the corresponding directory for the output alignment
CPU(s) Available: [2] It will depend on your computer
Max. Memory (MB): [1000] It will depend on your computer

SATe-II Settings
Quick Set: [SATe-II_fast]
Iteration Limit: [3]
Leave other options unchanged
56
Running SATe-2

57
Running SATe-2

• Explore the output in a text editor. The alignment is located in these .aln files in fasta
format:

satejob.marker001.Liolaemus_CYTB.aln

Repeat the same process with


the Lizard_RAG1.fasta file

58
Mesquite: Visually explore the alignments

• Download mesquite:

[Link]

[Link]

59

You might also like