RNA Seq R - Final Decode
RNA Seq R - Final Decode
with R
HANDS-ON
SHIVANGI AGARWAL 1
What will you learn ?
✔ BASICS INTRODUCTION TO RNA-SEQUENCING
✔ RNA-SEQ EXPERIMENT AND BENCH WORKFLOW
✔ INFORMATICS STEPS IN RNA SEQ DATA ANALYSIS
✔ HANDS-ON ANALYSIS ON AN EXAMPLE RNA SEQUENCING DATA USING R
✔ INSTALLATION OF DIFFERENT R PACKAGES
✔ DOWNLOADING AND QUALITY CHECK USING FASTQC TOOLKIT
✔ DOWNLOADING AND INDEXING OF REFERENCE GENOME
✔ FILE FORMATS (FASTA, GTF, BAM, SAM)
✔ ALIGNMENT
✔ GENERATION OF COUNT MATRIX (GENE ID VS COUNT VALUE FOR EACH SAMPLE)
✔ DESeq2 ALGORITHM
✔ DIFFERENTIAL GENE EXPRESSION USING DESeq2 R PACKAGE
✔ PLOTTING OF HEATMAP AND VOLCANO PLOTS IN R
2
BASIC INTRODUCTION
● Ribonucleic acid (RNA) are polymeric molecules consisting of
nucleotides, essential in the coding, decoding, regulation, and
expression of genes.
● These include mRNA, ribosomal RNA (rRNA), transfer RNA
(tRNA), long ncRNA (lncRNA; transcripts longer than 200
nucleotides not translated into protein), and many smaller
ncRNAs such as microRNA (miRNA).
● mRNA accounts for only 1-4% of total RNA in a population
● Most prevalent is rRNA, which typically accounts for 80–95% of
the total RNA population. The remainder of ncRNAs are present
in much smaller amounts.
3
mRNA
4
CENTRAL DOGMA OF MOLECULAR BIOLOGY
5
INTRODUCTION TO RNA-SEQ
● RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation
sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the
continuously changing cellular transcriptome.
● Specifically, RNA-Seq facilitates the ability to look at alternative gene spliced transcripts, post-transcriptional
modifications, gene fusions, mutations/SNPs and changes in gene expression over time, or differences in gene
expression in different groups or treatments. In addition to mRNA transcripts, RNA-Seq can look at different populations
of RNA to include total RNA, small RNA, such as miRNA, tRNA and ribosomal profiling.
● RNA-Seq can also be used to determine exon/intron boundaries and verify or amend previously annotated 5’ and 3’ gene
boundaries. Recent advances in RNA-Seq include single-cell sequencing, in situ sequencing of fixed tissue, and native
RNA molecule sequencing with single-molecule real-time sequencing.
● RNA-Seq allows researchers to detect both known and novel features in a single assay, enabling the identification of transcript
isoforms, gene fusions, single nucleotide variants, and other features without the limitation of prior knowledge.
6
INTRODUCTION TO RNA-SEQ
7
BENEFITS OF RNA-SEQUENCING
RNA-Seq with next-generation sequencing (NGS) is increasingly the method of choice for scientists studying the transcriptome.
• Captures both known and novel features; does not require predesigned probes
8
Two main approaches for RNA selection
9
Well defined genomes
10
Popular genome assemblies
11
RNA-SEQ EXPERIMENT AND WORKFLOW
12
GENERAL RNA SEQUENCING METHODS
RNA Seq library protocols
14
Bulk Seq vs Single-Cell Seq
15
Bulk Seq vs Single-Cell Seq cont..
16
Library strategies: single end vs paired end
Paired-end sequencing and alignment. Paired-end sequencing enables boths ends of the DNA fragment to
be sequenced. Because the distance between each paired read is known, alignment algorithms can use
this information to map the reads over repetitive regions more precisely. Image courtesy of Illumina, Inc.
17
APPLICATIONS OF mRNA SEQUENCING
● mRNA-seq is a powerful tool to analyze the cell transcriptome profile. Novogene’s
professional services help on research goals in a wide range of applications,
including:
● Quantitative profiling of transcripts in different tissues or samples, under various
conditions and treatments
● Discovery of novel transcripts, alternative splicing (AS), and transcript
variations
● Research of developmental mechanisms and drug resistance through tissue-specific
transcripts or time-course gene expression
● Biomarker discovery based on novel transcripts/isoforms, SNP/InDel
identification, and fusion gene analysis
● Omics analysis in combination with the transcriptome
● Investigation of pathogenic mechanisms and clinical subtypes in clinical diagnosis
20
STEPS IN ANALYSIS OF RNA SEQ DATA
21
FASTQ DOWNLOAD AND SPLITTING
fastq-dump SRR5924196
Read 21980257 spots for SRR20074028Written 21980257 spots for
SRR20074028
https://www.ncbi.nlm.nih.gov/sra/SRR20074028
25
FASTQ FILE FORMAT
26
FASTQ FILE FORMAT
27
PHRED SCORE
28
PHRED SCORE DENOTED AS CHARACTERS
29
QUALITY CONTROL CHECK PARAMETERS
● Per base sequence quality
● Per sequence quality scores
● Per base sequence content
● Per sequence GC content
● Per base N content
● Overrepresented Sequences
● Adapter Content
30
FASTQC DOWNLOAD
FASTQC
UBUNTU USERS
sudo apt install fastqc
./fastqc
or download zip file https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip it using command
unzip fastqc_v0.11.9.zip
cd FastQC
./fastqc
WINDOWS USERS
DOWNLOAD Zip file from the below link
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
unzip it
double click to open “fastqc” file
31
FASTQC REPORT
31
FASTQC REPORT
32
FASTQC REPORT
● Basic Statistics:: Simple information about input FASTQ file: its name, type of quality score encoding, total
number of reads, read length and GC content.
● Per base sequence quality
A box-and-whisker plot showing aggregated quality score statistics at each position along all reads in the file.
The blue line is the mean quality score at each base position/window. A primer on sequencing quality
scores has been prepared by Illumina. The red line within each yellow box represents the median quality
score at that position/window. Yellow box is the inner-quartile range for 25th to 75th percentile. The upper
and lower whiskers represent the 10th and 90th percentile scores.
What to look for: The distribution of average read quality should be fairly tight in the upper range of the
plot.
33
FASTQC REPORT
Per sequence quality scores
● A plot of the total number of reads vs the average quality score over full
length of that read.
● What to look for: The distribution of average read quality should be
fairly tight in the upper range of the plot.
34
FASTQC REPORT
Per base sequence content
● This plot reports the percent of bases called for each of the four nucleotides at each position across all
reads in the file. Again, the X-axis is non-uniform as described for Per base sequence quality.
● What to look for: For whole genome shotgun DNA sequencing the proportion of each of the four bases
should remain relatively constant over the length of the read with %A=%T and %G=%C. With most
RNA-Seq library preparation protocols there is clear non-uniform distribution of bases for the first
10-15 nucleotides; this is normal and expected depending on the type of library kit used (e.g. TruSeq RNA
Library Preparation). RNA-Seq data showing this non-uniform base composition will always be classified
as Failed by FastQC for this module even though the sequence is perfectly good.
35
FASTQC REPORT
Per sequence GC content
● Plot of the number of reads vs. GC% per read. The
displayed Theoretical Distribution assumes a uniform
GC content for all reads.
● What to look for: For whole genome shotgun sequencing
the expectation is that the GC content of all reads should
form a normal distribution with the peak of the curve at
the mean GC content for the organism sequenced. If the
observed distribution deviates too far from the
theoretical, FastQC will call a Fail. There are many
situations in which this may occur which are expected so
the assignment can be ignored. For example, in RNA
sequencing there may be a greater or lesser
distribution of mean GC content among transcripts
causing the observed plot to be wider or narrower
than an idealized normal distribution. The plot below
is from some very high quality RNA-Seq data yet FastQC
still assigned a Warn flag to it because the observed
distribution was narrower than the theoretical.
36
FASTQC REPORT
Per base N content
● Percent of bases at each position or bin with no base call, i.e. ‘N’. What to expect: You should
never see any point where this curve rises noticeably above zero. If it does this indicates a
problem occurred during the sequencing run. The example below is a case where an error
caused the instrument to be unable to call a base for approximately 20% of the reads at
position 29.
37
FASTQC REPORT
Duplication Levels
● Percentage of reads of a given sequence in the file which are present a given number of times in the file. (This is the
blue line. The red line is more difficult to interpret.) There are generally two sources of duplicate reads: PCR
duplication in which library fragments have been over represented due to biased PCR enrichment or truly over
represented sequences such as very abundant transcripts in an RNA-Seq library. The former is a concern because
PCR duplicates misrepresent the true proportion of sequences in your starting material. The latter is an expected
case and not of concern because it does faithfully represent your input.
What to expect: When sequencing RNA
there will be some very highly abundant
transcripts and some lowly abundant. It is
expected that duplicate reads will be
observed for high abundance transcripts.
The RNA-Seq data below was flagged as
Failed by FastQC even though the
duplication is expected in this case.
38
FASTQC REPORT
Overrepresented Sequences
● List of sequences which appear more than expected in the file. Only the first 50bp are
considered. A sequence is considered overrepresented if it accounts for ≥ 0.1% of the total
reads.
● What to expect: In DNA-Seq data no single sequence should be present at a high enough
frequency to be listed, though it is not unusual to see a small percentage of adapter reads. For
RNA-Seq data it is possible that there may be some transcripts that are so abundant that they
register as overrepresented sequence.
Adapter Content
● Cumulative plot of the fraction of reads where the sequence library adapter sequence is
identified at the indicated base position. Only adapters specific to the library type are searched.
● The example below is for a high quality RNA-Seq library with a small percentage of the library
having inserts smaller than 150bp.
39
FASTQC REPORT
Kmer Content
● Measures the count of each short nucleotide of
length k (default = 7) starting at each positon
along the read. Any given Kmer should be
evenly represented across the length of the
read. A list of kmers which appear at specific
positions with greater than expected
frequency are reported.
● RNA-seq libraries may have highly
represented Kmers that are derived from
highly expressed sequences.
40
FASTA FORMAT
● The FASTA format is a text-based format for representing either nucleotide
sequences or amino acid (protein) sequences. The format originates from the
FASTA software package, but has now become a near universal standard in the
field of bioinformatics.
● The description line (defline) or header/identifier line, which begins with '>',
gives a name and/or a unique identifier for the sequence, and may also contain
additional information. In a deprecated practice, the header line sometimes
contained more than one header, separated by a ^A (Control-A) character. In the
original Pearson FASTA format, one or more comments, distinguished by a
semi-colon at the beginning of the line, may occur after the header. Some
databases and bioinformatics applications do not recognize these comments and
follow the NCBI FASTA specification. An example of a multiple sequence FASTA
file follows:
42
GTF (General Feature Format)
• Containing 9 columns of data, plus optional track definition lines.
43
GTF FILE
44
ALIGNERS
“splicing-aware” aligner that can recognize the difference between a read aligning across an exon–intron
boundary and a read with a short insertion.
A splice-aware aligner would know not to try to align RNA-seq reads to introns, and would somehow identify possible downstream
exons and try to align to those instead, ignoring introns altogether.
45
STAR ALIGNER ALGORITHM
STAR is shown to have high accuracy and outperforms other aligners by more than a factor of 50 in mapping speed, but it is memory intensive.
The algorithm achieves this highly efficient mapping by performing a two-step process:
1.Seed searching
2. Clustering, stitching, and scoring
Seed searching
• For every read that STAR aligns, STAR will search for the longest sequence (MMP) that exactly matches one or more locations on the
reference genome. These longest matching sequences are called the Maximal Mappable Prefixes (MMPs).
• The different parts of the read that are mapped separately are called ‘seeds’. So the first MMP that is mapped to the genome is
called seed1.
• STAR will then search again for only the unmapped portion of the read to find the next longest sequence that exactly matches the reference
genome, or the next MMP, which will be seed2.
• STAR uses an uncompressed suffix array (SA) to efficiently search for the MMPs, this allows for quick searching against even the largest
reference genomes.
• If STAR does not find an exact matching sequence for each part of the read due to mismatches or indels, the previous MMPs will
be extended.
• If extension does not give a good alignment, then the poor quality or adapter sequence (or other contaminating sequence) will be
soft clipped.
46
STAR ALIGNER ALGORITHM [Seed Searching]
Soft clipped
47
STAR ALIGNER ALGORITHM (Stitching)
Clustering, stitching, and scoring
• The separate seeds are stitched together to create a complete read by first clustering the seeds together based on proximity to a set of
• Then the seeds are stitched together based on the best alignment for the read (scoring based on mismatches, indels, gaps, etc.).
48
51
SAM FORMAT
● SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header
section, which is optional, and an alignment section.
● Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for
essential alignment information such as mapping position, and variable number of optional fields for
flexible or aligner specific information.
● The header section Each header line begins with the character ‘@’ followed by one of the two-letter
header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from
@CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines
the format and content of VALUE. Thus header lines match /^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[
-~]+)+$/ or /^@CO\t.*/.
● The alignment section: mandatory fields In the SAM format, each alignment line typically represents the
linear alignment of a segment. Each line consists of 11 or more TAB-separated fields.
53
SAM FORMAT
54
SAM FORMAT
55
SAM FORMAT
● 1.4 The alignment section: mandatory fields In the SAM format, each alignment line
typically represents the linear alignment of a segment. Each line consists of 11 or more
TAB-separated fields. The first eleven fields are always present and in the order shown
below; if the information represented by any of these fields is unavailable, that field’s
value will be a placeholder, either ‘0’ or ‘*’ as determined by the field’s type
56
SAM FORMAT
● QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come
from the same template. A QNAME ‘*’ indicates the information is unavailable. In a SAM file, a
read may occupy multiple alignment lines, when its alignment is chimeric or when multiple
mappings are given.
● FLAG: Combination of bitwise FLAGs:
RNAME: Reference sequence NAME of the alignment. If @SQ header lines are present, RNAME (if not ‘*’) must be
present in one of the SQ-SN tag. An unmapped segment without coordinate has a ‘*’ at this field.
57
SAM FORMAT
● POS: 1-based leftmost mapping POSition of the first CIGAR operation that
“consumes” a reference base (see table below). The first base in a reference
sequence has coordinate 1. POS is set as 0 for an unmapped read without
coordinate. If POS is 0, no assumptions can be made about RNAME and
CIGAR. 5. MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position
is wrong}, rounded to the nearest integer. A value 255 indicates that the
mapping quality is not available. 6. CIGAR: CIGAR string. The CIGAR
operations are given in the following table (set ‘*’ if unavailable).
7. RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. For the last read, the
next read is the first read in the template. If @SQ header lines are present, RNEXT (if not ‘*’ or ‘=’) must be present in
one of the SQ-SN tag
58
SAM FORMAT
● PNEXT: 1-based Position of the primary alignment of the NEXT read in the template. Set
as 0 when the information is unavailable. This field equals POS at the primary line of
the next read. If PNEXT is 0, no assumptions can be made on RNEXT and bit 0x20.
● TLEN: signed observed Template LENgth. For primary reads where the primary
alignments of all reads in the template are mapped to the same reference sequence, the
absolute value of TLEN equals the distance between the mapped end of the template
and the mapped start of the template, inclusively (i.e., end − start + 1).1.
● SEQ: segment SEQuence. This field can be a ‘*’ when the sequence is not stored. If not a
‘*’, the length of the sequence must equal the sum of lengths of M/I/S/=/X operations in
CIGAR. An ‘=’ denotes the base is identical to the reference base.
● QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ
format). A base quality is the phred-scaled base error probability which equals −10
log10 Pr{base is wrong}. This field can be a ‘*’ when quality is not stored. If not a ‘*’, SEQ
must not be a ‘*’ and the length of the quality string ought to equal the length of SEQ.
59
SAM FORMAT
60
BAM FORMAT
● BAM is compressed in the BGZF format. All multi-byte numbers in BAM are little-endian, regardless of
the machine endianness. The format is formally described in the following table where values in brackets
are the default when the corresponding information is not available; an underlined word in uppercase
denotes a field in the SAM format.
● A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned
sequences up to 128 Mb.
● BAM files contain a header section and an alignment section:
● Header—Contains information about the entire file, such as sample name, sample length, and alignment
method. Alignments in the alignments section are associated with specific information in the header
section.
● Alignments—Contains read name, read sequence, read quality, alignment information, and custom tags.
The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor
string.
● The alignments section includes the following information for each or read pair:
❖ RG: Read group, which indicates the number of reads for a specific sample.
❖ BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read.
❖ SM: Single-end alignment quality.
❖ AS: Paired-end alignment quality.
❖ NM: Edit distance tag, which records the Levenshtein distance between the read and the reference.
❖ XN: Amplicon name tag, which records the amplicon tile ID associated with the read.
61
BAM FORMAT
62
BAI FORMAT
63
QUANTIFICATION TOOLS
64
COUNTING RULES
Count reads, not base-pairs.
Discard a read if
it cannot be uniquely mapped
its alignment overlaps with several genes
the alignment quality score is bad
(for paired-end reads) the mates do not map to the same gene
65
Gene expression table
67
DIFFERENTIAL GENE EXPRESSION
• The advent of gene expression measurement with RNA sequencing (RNA-seq) technology has
affected the number of microarray studies being undertaken
• Differential gene expression (DGE) analysis requires that gene expression values be compared
between sample group types.
• RPKM and FPKM normalize the most important factor for comparing samples-sequencing depth.
• Two of the most commonly used programs, DESeq and edgeR, perform similarly in ranking
differentially expressed genes, but DESeq has been shown to have relatively conservative false
discovery rates (FDRs),
• Manipulating Gene Expression to Treat Disease.
• Identify gene expression patterns that will improve patient survival by providing information about
tumor prognosis and sensitivity to specific therapies.
69
NEGATIVE BINOMIAL ALGORITHM
The steps performed by the DESeq function, briefly, they are:
1. estimation of size factors by estimateSizeFactors
2. estimation of dispersion by estimateDispersions
3. negative binomial GLM fitting for and Wald statistics by nbinomWaldTest
The differential expression analysis in DESeq2 uses a generalized linear model of the form:
Kij counts for gene i, sample j are modeled using a negative binomial distribution (NB) with fitted mean
and a gene-specific dispersion parameter. The fitted mean is composed of a sample-specific size
µij
factor sj and a parameter qij proportional to the expected true concentration of fragments for sample j.
70
The coefficients give the log2 fold changes for gene i for each column of the model matrix.
βi
NEGATIVE BINOMIAL ALGORITHM
If sample A has been sampled deeper than sample B, we expect counts to be higher
71
ESTIMATE SIZE FACTOR (median of ratios method)
72
RATIO OF ACTUAL COUNT AND SIZE FACTOR
73
INPUT DATA FORMAT
74
DESeq2 commands (DESeq2-Nextgen.R)
setwd("/Users/apple/Documents/nextgen")
Count_data = read.table(file = "TCGA-count-data.csv", header = T, sep = ",",row.names=1,check.names = FALSE)
Col_data = read.table(file = "TCGA-column-data.csv", header = T, sep = ",",row.names = 1)
rownames(Col_data)
colnames(Count_data)
all(rownames(Col_data)==colnames(Count_data))
boxplot(Count_data)
hist(Count_data[,1]) # Plotting only the first sample (column 1)
#install.packages("DESeq2")
library(DESeq2) # load the DESeq2 package
#count no of NA values in matrix
which(is.na(Count_data),arr.ind=TRUE)
sum(is.na(Count_data))
#replace missing values in matrix with rowsums
library(zoo)
Count_data[]<-t(na.aggregate(t(Count_data)))
Count_data
75
DESeq2 commands (DESeq2-Nextgen.R)
dds = DESeqDataSetFromMatrix(countData = round(Count_data), colData = Col_data, design = ~
condition) # we're testing for the different conditions
dds$condition <- relevel(dds$condition, ref = "normal")
dds
dds <- DESeq(dds)res1 <- results(dds)
summary(res1)
###keep only sig results, padj<0.05 and log2FoldChange >1
SigUp <- subset(res1, padj < 0.05 & log2FoldChange >1)
write.csv(resSigsee, "Upregulated.csv")
###keep only sig results, padj<0.05 and log2FoldChange < -1
resSigDown <- subset(res1, padj < 0.05 & log2FoldChange < -1)
write.csv(resSigsee2, "Downregulated.csv")
###keep UP and Down in one file with padj<0.05
resSig <- subset(res1, log2FoldChange >1 & padj <0.05 | log2FoldChange < -1 & padj < 0.05)
write.csv(resSig, "DE.csv")
76
POINTS TO TAKE CARE WHILE RUNNIG DESEQ2
• The column names of count data should be in the same order as the row
names of the column data.
all(rownames(Col_data)==colnames(Count_data))
• Always define the reference sample using “relevel” command before actually
running DESeq2 in R.
77
Volcano Plot(DESeq2-Nextgen.R)
78
Volcano Plot
79
https://bioconductor.org/packages/devel/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html
Heatmap
80
Heatmap
81
pheatmap.R
install.packages("pheatmap")
library(pheatmap)
getwd()
setwd("C:/Users/shivangi/Desktop/plots/Heatmaps/")
##write.csv(test, "heatmap_data2.csv")
pheat <- read.csv("heatmap_data2.csv", header = T, row.names = 1, check.names = FALSE)
pheat <- as.matrix(pheat)
pheatmap(pheat)
pheatmap (pheat, cluster_rows = FALSE, cluster_cols = FALSE)
pheatmap (pheat, cluster_cols = FALSE)
pheatmap(pheat, cluster_row = FALSE)
pheatmap(pheat, kmeans_k = 4)
pheatmap(pheat, scale = "row", clustering_distance_rows = "correlation")
pheatmap(pheat, color = colorRampPalette(c("navy", "white", "firebrick3"))(50))
pheatmap(pheat, legend = FALSE)
pheatmap(pheat, display_numbers = TRUE)
pheatmap(pheat, display_numbers = TRUE, number_format = "%.1e")
pheatmap(pheat, cluster_row = FALSE, legend_breaks = -1:4, legend_labels = c("0",
"1e-4", "1e-3", "1e-2", "1e-1", "1"))
pheatmap(pheat, cellwidth = 15, cellheight = 12, main = "Example heatmap")
pheatmap(pheat, cellwidth = 15, cellheight = 12, fontsize = 8)
82
Standard measures of RNA Quantification
• These three metrics attempt to normalize for sequencing depth and gene length.
• FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was
sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one
read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM
takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
83
Downstream Analysis
• Gene set enrichment analysis
• Pathway analysis
• Differential gene expression
• Principal Component Analysis
84
Thank You
Email: [email protected]
Linkedin: https://www.linkedin.com/in/shivangi-agarwal-4a6a3467/
85
READ MATERIALS
● https://www.technologynetworks.c
om/genomics/articles/rna-seq-basi
cs-applications-and-protocol-29946
1
● https://www.nature.com/articles/s
41576-019-0150-2
86