0% found this document useful (0 votes)

30 views

RNA Seq R - Final Decode

Uploaded by

Komal Kumar Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

RNA Seq R - Final Decode

Uploaded by

Komal Kumar Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

RNA SEQUENCING DATA ANALYSIS

with R

HANDS-ON

SHIVANGI AGARWAL 1
What will you learn ?
✔ BASICS INTRODUCTION TO RNA-SEQUENCING
✔ RNA-SEQ EXPERIMENT AND BENCH WORKFLOW
✔ INFORMATICS STEPS IN RNA SEQ DATA ANALYSIS
✔ HANDS-ON ANALYSIS ON AN EXAMPLE RNA SEQUENCING DATA USING R
✔ INSTALLATION OF DIFFERENT R PACKAGES
✔ DOWNLOADING AND QUALITY CHECK USING FASTQC TOOLKIT
✔ DOWNLOADING AND INDEXING OF REFERENCE GENOME
✔ FILE FORMATS (FASTA, GTF, BAM, SAM)
✔ ALIGNMENT
✔ GENERATION OF COUNT MATRIX (GENE ID VS COUNT VALUE FOR EACH SAMPLE)
✔ DESeq2 ALGORITHM
✔ DIFFERENTIAL GENE EXPRESSION USING DESeq2 R PACKAGE
✔ PLOTTING OF HEATMAP AND VOLCANO PLOTS IN R

2
BASIC INTRODUCTION
● Ribonucleic acid (RNA) are polymeric molecules consisting of
nucleotides, essential in the coding, decoding, regulation, and
expression of genes.
● These include mRNA, ribosomal RNA (rRNA), transfer RNA
(tRNA), long ncRNA (lncRNA; transcripts longer than 200
nucleotides not translated into protein), and many smaller
ncRNAs such as microRNA (miRNA).
● mRNA accounts for only 1-4% of total RNA in a population
● Most prevalent is rRNA, which typically accounts for 80–95% of
the total RNA population. The remainder of ncRNAs are present
in much smaller amounts.

3
mRNA

4
CENTRAL DOGMA OF MOLECULAR BIOLOGY

5
INTRODUCTION TO RNA-SEQ
● RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation
sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the
continuously changing cellular transcriptome.
● Specifically, RNA-Seq facilitates the ability to look at alternative gene spliced transcripts, post-transcriptional
modifications, gene fusions, mutations/SNPs and changes in gene expression over time, or differences in gene
expression in different groups or treatments. In addition to mRNA transcripts, RNA-Seq can look at different populations
of RNA to include total RNA, small RNA, such as miRNA, tRNA and ribosomal profiling.
● RNA-Seq can also be used to determine exon/intron boundaries and verify or amend previously annotated 5’ and 3’ gene
boundaries. Recent advances in RNA-Seq include single-cell sequencing, in situ sequencing of fixed tissue, and native
RNA molecule sequencing with single-molecule real-time sequencing.
● RNA-Seq allows researchers to detect both known and novel features in a single assay, enabling the identification of transcript
isoforms, gene fusions, single nucleotide variants, and other features without the limitation of prior knowledge.

6
INTRODUCTION TO RNA-SEQ

7
BENEFITS OF RNA-SEQUENCING
RNA-Seq with next-generation sequencing (NGS) is increasingly the method of choice for scientists studying the transcriptome.

• Covers an extremely broad dynamic range

• Provides sensitive, accurate measurement of gene expression

• Captures both known and novel features; does not require predesigned probes

• Generates both qualitative and quantitative data

• Reveals the full transcriptome, not just a few selected transcripts

• Can be applied to any species, even if a reference sequence is not available

8
Two main approaches for RNA selection

9
Well defined genomes

10
Popular genome assemblies

11
RNA-SEQ EXPERIMENT AND WORKFLOW

12
GENERAL RNA SEQUENCING METHODS
RNA Seq library protocols

14
Bulk Seq vs Single-Cell Seq

15
Bulk Seq vs Single-Cell Seq cont..

16
Library strategies: single end vs paired end

Paired-end sequencing and alignment. Paired-end sequencing enables boths ends of the DNA fragment to
be sequenced. Because the distance between each paired read is known, alignment algorithms can use
this information to map the reads over repetitive regions more precisely. Image courtesy of Illumina, Inc.

17
APPLICATIONS OF mRNA SEQUENCING
● mRNA-seq is a powerful tool to analyze the cell transcriptome profile. Novogene’s
professional services help on research goals in a wide range of applications,
including:
● Quantitative profiling of transcripts in different tissues or samples, under various
conditions and treatments
● Discovery of novel transcripts, alternative splicing (AS), and transcript
variations
● Research of developmental mechanisms and drug resistance through tissue-specific
transcripts or time-course gene expression
● Biomarker discovery based on novel transcripts/isoforms, SNP/InDel
identification, and fusion gene analysis
● Omics analysis in combination with the transcriptome
● Investigation of pathogenic mechanisms and clinical subtypes in clinical diagnosis

20
STEPS IN ANALYSIS OF RNA SEQ DATA

Splicing Aware mapping: Reads

map across splice junctions. A
“splicing-aware” aligner that
can recognize the difference
between a read aligning across
an exon–intron boundary and a
read with a short insertion

21
FASTQ DOWNLOAD AND SPLITTING

fastq-dump SRR5924196
Read 21980257 spots for SRR20074028Written 21980257 spots for
SRR20074028

fastq-dump SRR5924196 --split-files

Read 21980257 spots for SRR20074028Written 21980257 spots for
SRR20074028

https://www.ncbi.nlm.nih.gov/sra/SRR20074028

25
FASTQ FILE FORMAT

26
FASTQ FILE FORMAT

27
PHRED SCORE

28
PHRED SCORE DENOTED AS CHARACTERS

29
QUALITY CONTROL CHECK PARAMETERS
● Per base sequence quality
● Per sequence quality scores
● Per base sequence content
● Per sequence GC content
● Per base N content
● Overrepresented Sequences
● Adapter Content

30
FASTQC DOWNLOAD
FASTQC

UBUNTU USERS
sudo apt install fastqc
./fastqc
or download zip file https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip it using command
unzip fastqc_v0.11.9.zip
cd FastQC
./fastqc

WINDOWS USERS
DOWNLOAD Zip file from the below link
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
unzip it
double click to open “fastqc” file

31
FASTQC REPORT

32
FASTQC REPORT
● Basic Statistics:: Simple information about input FASTQ file: its name, type of quality score encoding, total
number of reads, read length and GC content.
● Per base sequence quality
A box-and-whisker plot showing aggregated quality score statistics at each position along all reads in the file.
The blue line is the mean quality score at each base position/window. A primer on sequencing quality
scores has been prepared by Illumina. The red line within each yellow box represents the median quality
score at that position/window. Yellow box is the inner-quartile range for 25th to 75th percentile. The upper
and lower whiskers represent the 10th and 90th percentile scores.
What to look for: The distribution of average read quality should be fairly tight in the upper range of the
plot.

33
FASTQC REPORT
Per sequence quality scores

● A plot of the total number of reads vs the average quality score over full
length of that read.
● What to look for: The distribution of average read quality should be
fairly tight in the upper range of the plot.

34
FASTQC REPORT
Per base sequence content
● This plot reports the percent of bases called for each of the four nucleotides at each position across all
reads in the file. Again, the X-axis is non-uniform as described for Per base sequence quality.
● What to look for: For whole genome shotgun DNA sequencing the proportion of each of the four bases
should remain relatively constant over the length of the read with %A=%T and %G=%C. With most
RNA-Seq library preparation protocols there is clear non-uniform distribution of bases for the first
10-15 nucleotides; this is normal and expected depending on the type of library kit used (e.g. TruSeq RNA
Library Preparation). RNA-Seq data showing this non-uniform base composition will always be classified
as Failed by FastQC for this module even though the sequence is perfectly good.

35
FASTQC REPORT
Per sequence GC content
● Plot of the number of reads vs. GC% per read. The
displayed Theoretical Distribution assumes a uniform
GC content for all reads.
● What to look for: For whole genome shotgun sequencing
the expectation is that the GC content of all reads should
form a normal distribution with the peak of the curve at
the mean GC content for the organism sequenced. If the
observed distribution deviates too far from the
theoretical, FastQC will call a Fail. There are many
situations in which this may occur which are expected so
the assignment can be ignored. For example, in RNA
sequencing there may be a greater or lesser
distribution of mean GC content among transcripts
causing the observed plot to be wider or narrower
than an idealized normal distribution. The plot below
is from some very high quality RNA-Seq data yet FastQC
still assigned a Warn flag to it because the observed
distribution was narrower than the theoretical.

36
FASTQC REPORT
Per base N content
● Percent of bases at each position or bin with no base call, i.e. ‘N’. What to expect: You should
never see any point where this curve rises noticeably above zero. If it does this indicates a
problem occurred during the sequencing run. The example below is a case where an error
caused the instrument to be unable to call a base for approximately 20% of the reads at
position 29.

37
FASTQC REPORT
Duplication Levels
● Percentage of reads of a given sequence in the file which are present a given number of times in the file. (This is the
blue line. The red line is more difficult to interpret.) There are generally two sources of duplicate reads: PCR
duplication in which library fragments have been over represented due to biased PCR enrichment or truly over
represented sequences such as very abundant transcripts in an RNA-Seq library. The former is a concern because
PCR duplicates misrepresent the true proportion of sequences in your starting material. The latter is an expected
case and not of concern because it does faithfully represent your input.
What to expect: When sequencing RNA
there will be some very highly abundant
transcripts and some lowly abundant. It is
expected that duplicate reads will be
observed for high abundance transcripts.
The RNA-Seq data below was flagged as
Failed by FastQC even though the
duplication is expected in this case.

38
FASTQC REPORT
Overrepresented Sequences
● List of sequences which appear more than expected in the file. Only the first 50bp are
considered. A sequence is considered overrepresented if it accounts for ≥ 0.1% of the total
reads.
● What to expect: In DNA-Seq data no single sequence should be present at a high enough
frequency to be listed, though it is not unusual to see a small percentage of adapter reads. For
RNA-Seq data it is possible that there may be some transcripts that are so abundant that they
register as overrepresented sequence.
Adapter Content
● Cumulative plot of the fraction of reads where the sequence library adapter sequence is
identified at the indicated base position. Only adapters specific to the library type are searched.
● The example below is for a high quality RNA-Seq library with a small percentage of the library
having inserts smaller than 150bp.

39
FASTQC REPORT
Kmer Content
● Measures the count of each short nucleotide of
length k (default = 7) starting at each positon
along the read. Any given Kmer should be
evenly represented across the length of the
read. A list of kmers which appear at specific
positions with greater than expected
frequency are reported.
● RNA-seq libraries may have highly
represented Kmers that are derived from
highly expressed sequences.

40
FASTA FORMAT
● The FASTA format is a text-based format for representing either nucleotide
sequences or amino acid (protein) sequences. The format originates from the
FASTA software package, but has now become a near universal standard in the
field of bioinformatics.
● The description line (defline) or header/identifier line, which begins with '>',
gives a name and/or a unique identifier for the sequence, and may also contain
additional information. In a deprecated practice, the header line sometimes
contained more than one header, separated by a ^A (Control-A) character. In the
original Pearson FASTA format, one or more comments, distinguished by a
semi-colon at the beginning of the line, may occur after the header. Some
databases and bioinformatics applications do not recognize these comments and
follow the NCBI FASTA specification. An example of a multiple sequence FASTA
file follows:

42
GTF (General Feature Format)
• Containing 9 columns of data, plus optional track definition lines.

43
GTF FILE

44
ALIGNERS

“splicing-aware” aligner that can recognize the difference between a read aligning across an exon–intron
boundary and a read with a short insertion.
A splice-aware aligner would know not to try to align RNA-seq reads to introns, and would somehow identify possible downstream
exons and try to align to those instead, ignoring introns altogether.

45
STAR ALIGNER ALGORITHM
STAR is shown to have high accuracy and outperforms other aligners by more than a factor of 50 in mapping speed, but it is memory intensive.
The algorithm achieves this highly efficient mapping by performing a two-step process:
1.Seed searching
2. Clustering, stitching, and scoring

Seed searching
• For every read that STAR aligns, STAR will search for the longest sequence (MMP) that exactly matches one or more locations on the
reference genome. These longest matching sequences are called the Maximal Mappable Prefixes (MMPs).
• The different parts of the read that are mapped separately are called ‘seeds’. So the first MMP that is mapped to the genome is
called seed1.
• STAR will then search again for only the unmapped portion of the read to find the next longest sequence that exactly matches the reference
genome, or the next MMP, which will be seed2.
• STAR uses an uncompressed suffix array (SA) to efficiently search for the MMPs, this allows for quick searching against even the largest
reference genomes.
• If STAR does not find an exact matching sequence for each part of the read due to mismatches or indels, the previous MMPs will
be extended.
• If extension does not give a good alignment, then the poor quality or adapter sequence (or other contaminating sequence) will be
soft clipped.
46
STAR ALIGNER ALGORITHM [Seed Searching]

Searching of MMPs: seed2

Searching of MMPs: seed1

Reads with mismatching/indels

Soft clipped

47
STAR ALIGNER ALGORITHM (Stitching)
Clustering, stitching, and scoring

• The separate seeds are stitched together to create a complete read by first clustering the seeds together based on proximity to a set of

‘anchor’ seeds, or seeds that are not multi-mapping.

• Then the seeds are stitched together based on the best alignment for the read (scoring based on mismatches, indels, gaps, etc.).

48
51
SAM FORMAT
● SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header
section, which is optional, and an alignment section.
● Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for
essential alignment information such as mapping position, and variable number of optional fields for
flexible or aligner specific information.
● The header section Each header line begins with the character ‘@’ followed by one of the two-letter
header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from
@CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines
the format and content of VALUE. Thus header lines match /^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[
-~]+)+$/ or /^@CO\t.*/.
● The alignment section: mandatory fields In the SAM format, each alignment line typically represents the
linear alignment of a segment. Each line consists of 11 or more TAB-separated fields.

53
SAM FORMAT

54
SAM FORMAT

55
SAM FORMAT
● 1.4 The alignment section: mandatory fields In the SAM format, each alignment line
typically represents the linear alignment of a segment. Each line consists of 11 or more
TAB-separated fields. The first eleven fields are always present and in the order shown
below; if the information represented by any of these fields is unavailable, that field’s
value will be a placeholder, either ‘0’ or ‘*’ as determined by the field’s type

56
SAM FORMAT
● QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come
from the same template. A QNAME ‘*’ indicates the information is unavailable. In a SAM file, a
read may occupy multiple alignment lines, when its alignment is chimeric or when multiple
mappings are given.
● FLAG: Combination of bitwise FLAGs:

RNAME: Reference sequence NAME of the alignment. If @SQ header lines are present, RNAME (if not ‘*’) must be
present in one of the SQ-SN tag. An unmapped segment without coordinate has a ‘*’ at this field.

57
SAM FORMAT
● POS: 1-based leftmost mapping POSition of the first CIGAR operation that
“consumes” a reference base (see table below). The first base in a reference
sequence has coordinate 1. POS is set as 0 for an unmapped read without
coordinate. If POS is 0, no assumptions can be made about RNAME and
CIGAR. 5. MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position
is wrong}, rounded to the nearest integer. A value 255 indicates that the
mapping quality is not available. 6. CIGAR: CIGAR string. The CIGAR
operations are given in the following table (set ‘*’ if unavailable).

7. RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. For the last read, the
next read is the first read in the template. If @SQ header lines are present, RNEXT (if not ‘*’ or ‘=’) must be present in
one of the SQ-SN tag

58
SAM FORMAT
● PNEXT: 1-based Position of the primary alignment of the NEXT read in the template. Set
as 0 when the information is unavailable. This field equals POS at the primary line of
the next read. If PNEXT is 0, no assumptions can be made on RNEXT and bit 0x20.
● TLEN: signed observed Template LENgth. For primary reads where the primary
alignments of all reads in the template are mapped to the same reference sequence, the
absolute value of TLEN equals the distance between the mapped end of the template
and the mapped start of the template, inclusively (i.e., end − start + 1).1.
● SEQ: segment SEQuence. This field can be a ‘*’ when the sequence is not stored. If not a
‘*’, the length of the sequence must equal the sum of lengths of M/I/S/=/X operations in
CIGAR. An ‘=’ denotes the base is identical to the reference base.
● QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ
format). A base quality is the phred-scaled base error probability which equals −10
log10 Pr{base is wrong}. This field can be a ‘*’ when quality is not stored. If not a ‘*’, SEQ
must not be a ‘*’ and the length of the quality string ought to equal the length of SEQ.

59
SAM FORMAT

60
BAM FORMAT
● BAM is compressed in the BGZF format. All multi-byte numbers in BAM are little-endian, regardless of
the machine endianness. The format is formally described in the following table where values in brackets
are the default when the corresponding information is not available; an underlined word in uppercase
denotes a field in the SAM format.
● A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned
sequences up to 128 Mb.
● BAM files contain a header section and an alignment section:
● Header—Contains information about the entire file, such as sample name, sample length, and alignment
method. Alignments in the alignments section are associated with specific information in the header
section.
● Alignments—Contains read name, read sequence, read quality, alignment information, and custom tags.
The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor
string.
● The alignments section includes the following information for each or read pair:
❖ RG: Read group, which indicates the number of reads for a specific sample.
❖ BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read.
❖ SM: Single-end alignment quality.
❖ AS: Paired-end alignment quality.
❖ NM: Edit distance tag, which records the Levenshtein distance between the read and the reference.
❖ XN: Amplicon name tag, which records the amplicon tile ID associated with the read.

61
BAM FORMAT

62
BAI FORMAT

63
QUANTIFICATION TOOLS

64
COUNTING RULES
Count reads, not base-pairs.

Count each read at most once.

Discard a read if
it cannot be uniquely mapped
its alignment overlaps with several genes
the alignment quality score is bad
(for paired-end reads) the mates do not map to the same gene

65
Gene expression table

67
DIFFERENTIAL GENE EXPRESSION
• The advent of gene expression measurement with RNA sequencing (RNA-seq) technology has
affected the number of microarray studies being undertaken

• Differential gene expression (DGE) analysis requires that gene expression values be compared
between sample group types.
• RPKM and FPKM normalize the most important factor for comparing samples-sequencing depth.
• Two of the most commonly used programs, DESeq and edgeR, perform similarly in ranking
differentially expressed genes, but DESeq has been shown to have relatively conservative false
discovery rates (FDRs),
• Manipulating Gene Expression to Treat Disease.
• Identify gene expression patterns that will improve patient survival by providing information about
tumor prognosis and sensitivity to specific therapies.

69
NEGATIVE BINOMIAL ALGORITHM
The steps performed by the DESeq function, briefly, they are:
1. estimation of size factors by estimateSizeFactors
2. estimation of dispersion by estimateDispersions
3. negative binomial GLM fitting for and Wald statistics by nbinomWaldTest
The differential expression analysis in DESeq2 uses a generalized linear model of the form:

Kij counts for gene i, sample j are modeled using a negative binomial distribution (NB) with fitted mean

and a gene-specific dispersion parameter. The fitted mean is composed of a sample-specific size
µij
factor sj and a parameter qij proportional to the expected true concentration of fragments for sample j.
70
The coefficients give the log2 fold changes for gene i for each column of the model matrix.
βi
NEGATIVE BINOMIAL ALGORITHM

If sample A has been sampled deeper than sample B, we expect counts to be higher

71
ESTIMATE SIZE FACTOR (median of ratios method)

72
RATIO OF ACTUAL COUNT AND SIZE FACTOR

73
INPUT DATA FORMAT

COUNT DATA MATRIX COLUMN DATA

74
DESeq2 commands (DESeq2-Nextgen.R)
setwd("/Users/apple/Documents/nextgen")
Count_data = read.table(file = "TCGA-count-data.csv", header = T, sep = ",",row.names=1,check.names = FALSE)
Col_data = read.table(file = "TCGA-column-data.csv", header = T, sep = ",",row.names = 1)
rownames(Col_data)
colnames(Count_data)
all(rownames(Col_data)==colnames(Count_data))
boxplot(Count_data)
hist(Count_data[,1]) # Plotting only the first sample (column 1)
#install.packages("DESeq2")
library(DESeq2) # load the DESeq2 package
#count no of NA values in matrix
which(is.na(Count_data),arr.ind=TRUE)
sum(is.na(Count_data))
#replace missing values in matrix with rowsums
library(zoo)
Count_data[]<-t(na.aggregate(t(Count_data)))
Count_data

75
DESeq2 commands (DESeq2-Nextgen.R)
dds = DESeqDataSetFromMatrix(countData = round(Count_data), colData = Col_data, design = ~
condition) # we're testing for the different conditions
dds$condition <- relevel(dds$condition, ref = "normal")
dds
dds <- DESeq(dds)res1 <- results(dds)
summary(res1)
###keep only sig results, padj<0.05 and log2FoldChange >1
SigUp <- subset(res1, padj < 0.05 & log2FoldChange >1)
write.csv(resSigsee, "Upregulated.csv")
###keep only sig results, padj<0.05 and log2FoldChange < -1
resSigDown <- subset(res1, padj < 0.05 & log2FoldChange < -1)
write.csv(resSigsee2, "Downregulated.csv")
###keep UP and Down in one file with padj<0.05
resSig <- subset(res1, log2FoldChange >1 & padj <0.05 | log2FoldChange < -1 & padj < 0.05)
write.csv(resSig, "DE.csv")

76
POINTS TO TAKE CARE WHILE RUNNIG DESEQ2
• The column names of count data should be in the same order as the row
names of the column data.

all(rownames(Col_data)==colnames(Count_data))

• Always define the reference sample using “relevel” command before actually
running DESeq2 in R.

dds$condition <- relevel(dds$condition, ref = "normal")

77
Volcano Plot(DESeq2-Nextgen.R)

78
Volcano Plot

79
https://bioconductor.org/packages/devel/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html
Heatmap

80
Heatmap

81
pheatmap.R
install.packages("pheatmap")
library(pheatmap)
getwd()
setwd("C:/Users/shivangi/Desktop/plots/Heatmaps/")
##write.csv(test, "heatmap_data2.csv")
pheat <- read.csv("heatmap_data2.csv", header = T, row.names = 1, check.names = FALSE)
pheat <- as.matrix(pheat)
pheatmap(pheat)
pheatmap (pheat, cluster_rows = FALSE, cluster_cols = FALSE)
pheatmap (pheat, cluster_cols = FALSE)
pheatmap(pheat, cluster_row = FALSE)
pheatmap(pheat, kmeans_k = 4)
pheatmap(pheat, scale = "row", clustering_distance_rows = "correlation")
pheatmap(pheat, color = colorRampPalette(c("navy", "white", "firebrick3"))(50))
pheatmap(pheat, legend = FALSE)
pheatmap(pheat, display_numbers = TRUE)
pheatmap(pheat, display_numbers = TRUE, number_format = "%.1e")
pheatmap(pheat, cluster_row = FALSE, legend_breaks = -1:4, legend_labels = c("0",
"1e-4", "1e-3", "1e-2", "1e-1", "1"))
pheatmap(pheat, cellwidth = 15, cellheight = 12, main = "Example heatmap")
pheatmap(pheat, cellwidth = 15, cellheight = 12, fontsize = 8)

82
Standard measures of RNA Quantification
• These three metrics attempt to normalize for sequencing depth and gene length.

• RPKM (Reads Per Kilobase Million)

• Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
• Divide the read counts by the “per million” scaling factor.
• This normalizes for sequencing depth, giving you reads per million (RPM)
• Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

• FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was
sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one
read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM
takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).

TPM (Transcript Per Kilobase Million)

TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:
• Divide the read counts by the length of each gene in kilobases.

• This gives you reads per kilobase (RPK).

• Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
• Divide the RPK values by the “per million” scaling factor. This gives you TPM.

83
Downstream Analysis
• Gene set enrichment analysis
• Pathway analysis
• Differential gene expression
• Principal Component Analysis

84
Thank You

Email: [email protected]

Linkedin: https://www.linkedin.com/in/shivangi-agarwal-4a6a3467/

85
READ MATERIALS

● https://www.technologynetworks.c
om/genomics/articles/rna-seq-basi
cs-applications-and-protocol-29946
1
● https://www.nature.com/articles/s
41576-019-0150-2

Rnaseq by Example
No ratings yet
Rnaseq by Example
163 pages
Kaeser Screw Compressor M57
100% (2)
Kaeser Screw Compressor M57
230 pages
Crospovidone EP 10.6 PDF
No ratings yet
Crospovidone EP 10.6 PDF
2 pages
Intro_to_RNA-seq_concepts
No ratings yet
Intro_to_RNA-seq_concepts
85 pages
nihms-977214
No ratings yet
nihms-977214
21 pages
RNA Seq Tutorial
0% (1)
RNA Seq Tutorial
139 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
Transcriptome Software Paper
No ratings yet
Transcriptome Software Paper
7 pages
2023-GenomicaFuncional y Biocomputacion-Day1
No ratings yet
2023-GenomicaFuncional y Biocomputacion-Day1
92 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
RNA-Seq and Transcriptome Analysis: Jessica Holmes
No ratings yet
RNA-Seq and Transcriptome Analysis: Jessica Holmes
98 pages
A Guide to Basic RNA Sequencing Data
No ratings yet
A Guide to Basic RNA Sequencing Data
30 pages
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
No ratings yet
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
120 pages
Day1 Laros RNASeq Galaxy 2012
No ratings yet
Day1 Laros RNASeq Galaxy 2012
40 pages
Survey RNA-Seq data analysis (2016)
No ratings yet
Survey RNA-Seq data analysis (2016)
19 pages
BGi RNA-Seq Analysis
No ratings yet
BGi RNA-Seq Analysis
19 pages
Gene Expression RNA Sequence
No ratings yet
Gene Expression RNA Sequence
120 pages
Bianca Castiglioni
No ratings yet
Bianca Castiglioni
96 pages
RNA seq Data Analysis
No ratings yet
RNA seq Data Analysis
90 pages
EBTY348L_Comp Genomics lectures_Even Sem_2024-25 _set 2
No ratings yet
EBTY348L_Comp Genomics lectures_Even Sem_2024-25 _set 2
29 pages
Complete_Bulk_RNA_Sequencing_Presentation
No ratings yet
Complete_Bulk_RNA_Sequencing_Presentation
10 pages
Chapter 3 Inspection of Sequence Quality PDF
No ratings yet
Chapter 3 Inspection of Sequence Quality PDF
18 pages
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
No ratings yet
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
51 pages
RNA Sequencing (RNA-seq)– Comprehensive Notes
No ratings yet
RNA Sequencing (RNA-seq)– Comprehensive Notes
5 pages
NGS QC Metrics
No ratings yet
NGS QC Metrics
7 pages
NGS Data Analysis
No ratings yet
NGS Data Analysis
4 pages
ExSeq Presentation With Background
No ratings yet
ExSeq Presentation With Background
40 pages
RNA Sequencing: An Introduction To Efficient Planning and Execution of RNA Sequencing (RNA-Seq) Experiments
No ratings yet
RNA Sequencing: An Introduction To Efficient Planning and Execution of RNA Sequencing (RNA-Seq) Experiments
6 pages
3_RNAseq_background
No ratings yet
3_RNAseq_background
42 pages
Lecture 01 - Genome Sequencing
No ratings yet
Lecture 01 - Genome Sequencing
48 pages
HMCW NGS Data Format
No ratings yet
HMCW NGS Data Format
21 pages
Module 7 8 Lecture Slides
No ratings yet
Module 7 8 Lecture Slides
59 pages
FastQC TutorialAndFAQ
No ratings yet
FastQC TutorialAndFAQ
8 pages
Chapter On Transcriptomics
No ratings yet
Chapter On Transcriptomics
13 pages
The RNA World 11th Lect High-throughput Methods GH AY16 2017
No ratings yet
The RNA World 11th Lect High-throughput Methods GH AY16 2017
59 pages
Kratz et al. 2014. The devil in details RNAseq - copia
No ratings yet
Kratz et al. 2014. The devil in details RNAseq - copia
3 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
BN335 L6 Transcriptomics JH
No ratings yet
BN335 L6 Transcriptomics JH
9 pages
CE6068 Lecture 4
No ratings yet
CE6068 Lecture 4
82 pages
Lab02 - Reading Results
No ratings yet
Lab02 - Reading Results
16 pages
DNA sequencing next generation sequencing
No ratings yet
DNA sequencing next generation sequencing
31 pages
Blank en Berg Pittsburgh 2011 Ngs
No ratings yet
Blank en Berg Pittsburgh 2011 Ngs
59 pages
NGS ToolsFormats r1 BDG
No ratings yet
NGS ToolsFormats r1 BDG
32 pages
RNA Seq - Applications and Best Practices
No ratings yet
RNA Seq - Applications and Best Practices
34 pages
Analysis Results
No ratings yet
Analysis Results
29 pages
Nazarov QC-Statistics
No ratings yet
Nazarov QC-Statistics
50 pages
Combined
No ratings yet
Combined
417 pages
Transcriptome Analysis
No ratings yet
Transcriptome Analysis
6 pages
Intro 2 RNAseq
No ratings yet
Intro 2 RNAseq
98 pages
Summary Bioinformation Technology
No ratings yet
Summary Bioinformation Technology
15 pages
Camara 2017
No ratings yet
Camara 2017
7 pages
Introduction To Differential Gene Expression Analysis Using RNA-seq
No ratings yet
Introduction To Differential Gene Expression Analysis Using RNA-seq
97 pages
The Bench Scientist's Guide To Statistical Analysis of RNA-Seq Data
No ratings yet
The Bench Scientist's Guide To Statistical Analysis of RNA-Seq Data
10 pages
Margue Rat 2010
No ratings yet
Margue Rat 2010
11 pages
RNA Sequencing Process and Applications-F19960606001
No ratings yet
RNA Sequencing Process and Applications-F19960606001
7 pages
Fastqc 1.1 What Is Fastqc
No ratings yet
Fastqc 1.1 What Is Fastqc
16 pages
05 Introduction To Next-Generation Sequencing (NGS)
No ratings yet
05 Introduction To Next-Generation Sequencing (NGS)
25 pages
Tutorial RNA-Seq Analysis Part 1
No ratings yet
Tutorial RNA-Seq Analysis Part 1
8 pages
WES Shivangi
No ratings yet
WES Shivangi
43 pages
Data Analysis in Next Generation Sequencing
100% (1)
Data Analysis in Next Generation Sequencing
78 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
DNA Code Basics
From Everand
DNA Code Basics
Zara Sagan
No ratings yet
Emerging Diseases Need For Focused Research in Smallmillets
No ratings yet
Emerging Diseases Need For Focused Research in Smallmillets
11 pages
Investigation of Diversity and Dominance of Fungal Biota in Stored Wheat Grains From Governmental Warehouses in West Bengal, India
No ratings yet
Investigation of Diversity and Dominance of Fungal Biota in Stored Wheat Grains From Governmental Warehouses in West Bengal, India
12 pages
3D Hologram Technology in Learning Environment
No ratings yet
3D Hologram Technology in Learning Environment
12 pages
Gus 1
No ratings yet
Gus 1
1 page
Histochemical GUS Assay (Exp 6, CSS451)
No ratings yet
Histochemical GUS Assay (Exp 6, CSS451)
2 pages
Mammon_Goes_to_Silent
No ratings yet
Mammon_Goes_to_Silent
268 pages
Incident_Response_Simulation_Based_On_500_Use_Cases_1733765527
100% (1)
Incident_Response_Simulation_Based_On_500_Use_Cases_1733765527
45 pages
Solution Manual for Chemistry Principles and Reactions 8th Edition by Masterton Hurley ISBN 130507937X 9781305079373 - PDF Format Is Available With All Chapters
100% (18)
Solution Manual for Chemistry Principles and Reactions 8th Edition by Masterton Hurley ISBN 130507937X 9781305079373 - PDF Format Is Available With All Chapters
53 pages
Surabhi
No ratings yet
Surabhi
11 pages
cascade design
No ratings yet
cascade design
10 pages
Chapter 11 Dual Nature Of Radiation And Matter
No ratings yet
Chapter 11 Dual Nature Of Radiation And Matter
15 pages
Geography Chapter 1 - Resources and Development
No ratings yet
Geography Chapter 1 - Resources and Development
21 pages
EOHSMS-02-F06 Hot Work Permit
No ratings yet
EOHSMS-02-F06 Hot Work Permit
2 pages
b2 Mastery Booklet Organisation v3
No ratings yet
b2 Mastery Booklet Organisation v3
30 pages
Creating Augmented and Virtual Realities Erin Pangilinan & Steve Lukas & Vasanth Mohan download
100% (2)
Creating Augmented and Virtual Realities Erin Pangilinan & Steve Lukas & Vasanth Mohan download
37 pages
Cbse Sample Papers For Class 9 Social Science With Answers 2014 Paper 1
No ratings yet
Cbse Sample Papers For Class 9 Social Science With Answers 2014 Paper 1
11 pages
Gastroesophagial Reflux Disease Seminar
0% (1)
Gastroesophagial Reflux Disease Seminar
117 pages
Taser7 Product Card PDF
No ratings yet
Taser7 Product Card PDF
2 pages
Schneider Electric - Electronic Trip Circuit Breaker Basics
100% (3)
Schneider Electric - Electronic Trip Circuit Breaker Basics
20 pages
PNG-EGPP-MEC-SPE-00007-Tech. Spec. for Submersible Firewater Borehole Pumps_D01
No ratings yet
PNG-EGPP-MEC-SPE-00007-Tech. Spec. for Submersible Firewater Borehole Pumps_D01
25 pages
Bioaccumulation of Nickel by Five Wild Plant Species On Nickel-Contaminated Soil
No ratings yet
Bioaccumulation of Nickel by Five Wild Plant Species On Nickel-Contaminated Soil
6 pages
Research Proposal
No ratings yet
Research Proposal
14 pages
Challenges in Commercial Pig Production in Botswan PDF
No ratings yet
Challenges in Commercial Pig Production in Botswan PDF
11 pages
Influence of Geological Structures in Aiding Landslide Initiation in Chimanimani, Zimbabwe
No ratings yet
Influence of Geological Structures in Aiding Landslide Initiation in Chimanimani, Zimbabwe
10 pages
FOIA To A Judge and Clerk
100% (18)
FOIA To A Judge and Clerk
3 pages
InstallationManualAlphaESS - SMILE BAT 10.1P (AUS) 20220708
No ratings yet
InstallationManualAlphaESS - SMILE BAT 10.1P (AUS) 20220708
28 pages
Acids, Bases and Salts Notes
50% (2)
Acids, Bases and Salts Notes
4 pages
2 0 2 0 Rigging Study & Lifting Study: September
No ratings yet
2 0 2 0 Rigging Study & Lifting Study: September
57 pages
Anaphylaxis
100% (1)
Anaphylaxis
14 pages
Cold Chain 101 The First Steps: Andrew Gibson
No ratings yet
Cold Chain 101 The First Steps: Andrew Gibson
20 pages
Jetsun Milarepa
No ratings yet
Jetsun Milarepa
19 pages
ARCH 75 - AI - Lec Module 3 - LIGHTING SYSTEMS - 2019
No ratings yet
ARCH 75 - AI - Lec Module 3 - LIGHTING SYSTEMS - 2019
27 pages
MSDS A100 (Topsol) 2023
No ratings yet
MSDS A100 (Topsol) 2023
12 pages

RNA Seq R - Final Decode

Uploaded by

RNA Seq R - Final Decode

Uploaded by

RNA SEQUENCING DATA ANALYSIS

• Covers an extremely broad dynamic range

• Provides sensitive, accurate measurement of gene expression

• Generates both qualitative and quantitative data

• Reveals the full transcriptome, not just a few selected transcripts

• Can be applied to any species, even if a reference sequence is not available

Splicing Aware mapping: Reads

fastq-dump SRR5924196 --split-files

Searching of MMPs: seed2

Reads with mismatching/indels

‘anchor’ seeds, or seeds that are not multi-mapping.

Count each read at most once.

COUNT DATA MATRIX COLUMN DATA

dds$condition <- relevel(dds$condition, ref = "normal")

• RPKM (Reads Per Kilobase Million)

TPM (Transcript Per Kilobase Million)

• This gives you reads per kilobase (RPK).

You might also like