0% found this document useful (0 votes)
32 views

HMCW NGS Data Format

ngs

Uploaded by

pooja pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

HMCW NGS Data Format

ngs

Uploaded by

pooja pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Basic NGS data format concepts

• Understanding Fastq
• Understanding Quality scores
• Quality control and read cleaning
• De novo Genome Assembling
• Assembly Quality evaluation
The Illumina HiSeq
Libraries, lanes, and flowcells
Flowcell

Lanes
Illumina

Each reaction produces a unique Each NGS machine processes a


library of DNA fragments for single flowcell containing several
sequencing. independent lanes during a single
sequencing run
Understanding FASTQ and FASTA file format
FASTQ format is a text--based format for storing a biological FASTA format is a text-based format for
sequence and its corresponding quality scores. It has representing either nucleotide sequences or
become the standard for storing the output of high amino acid (protein) sequences, in which
throughput sequencing instruments. nucleotides or amino acids are represented
using single-letter codes.
Example:

1. begins with a '@' character and is followed by a


sequence identifier 1. The first line in a FASTA file starts with a ">"
2. the raw sequence letters (greater-than) letter
3. begins with a '+' character and is optionally 2. Following lines (used for a unique description of
followed by the same sequence identifier
the sequence) are the actual sequence in
4. encodes the quality values for the sequence in
and must contain the same number of symbols as standard one-letter character string. It may be
leeers in the sequence. either single line or multiple lines
Understanding Quality scores
Fastq contains quality information.
Phred quality scores Q are defined as a property which is logarithmically related to the base
calling error probability P
Where P is the probability that the corresponding base call is incorrect

Accuracy
Error Error(%) Accuracy (%) Phred Quality Score
(1 - Error)
10 (10/100) = 0.1 90 0.9 10
1 (1/100) = 0.01 99 0.99 20
0.1 (0.1/100) = 0.001 99.9 0.999 30
0.01 (0.01/100) = 0.0001 99.99 0.9999 40
Understanding Quality scores
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@

The fourth line contains the base quali7es where BQ + 33 = ASCII value shown in the base quality string
Sequences ID @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG means:
Quality control and read cleaning
• Before alignment there is some7mes the need to preprocess/ manipulate the
FASTA/FASTQ files to produce beeer mapping results. It is important to do quality
control checks to understand whether your data has any problems of which you should
be aware before doing any further analysis

• FastQC: quality control checks on raw sequence data coming from high throughput
sequencing pipelines (hep:// www.bioinforma7cs.bbsrc.ac.uk/projects/fastqc/).
• Trimmomatic: A flexible read trimming tool for Illumina NGS data
(http://www.usadellab.org/cms/?page=trimmomatic)
Trimmomatic:
A flexible read trimming tool for Illumina NGS data
Paired End:
With most new data sets you can use gentle quality trimming and adapter clipping.

java -jar trimmomatic-0.39.jar PE input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz


output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36

Single End:
java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10

This will perform the following:


• Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
• Remove leading low quality or N bases (below quality 3) (LEADING:3)
• Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
• Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15
(SLIDINGWINDOW:4:15)
• Drop reads below the 36 bases long (MINLEN:36)
FastQC: quality controls of sequencing files
90th percentile of the data

Very good quality

median
Reasonable quality

mean
PER BASE SEQUENCE QUALITY

Poor quality

10th percen7le of thedata


ACCTGGGATCAAACATTCAGGACATATAGCACAATAGGAC

A very good quality sequence


Genome Assembly
Genome Assembly
De novo assembly method:
De novo assembler Software: SPades
SPAdes (St. Petersburg genome assembler) is a genome
assembly algorithm which was designed for single cell
and multi-cells bacterial data sets
• K-mer creation
K-mer collection at 21, 33, 55, 77, 99 and 127
• K-mer joining (De Bruijn Graph)
Node and Edge joining
• Contig Generation

SPades running commond


spadey.py –careful PE -1 pairend_R1 -2 pair_end_R2 -0
outputfolder
Assembly Quality evaluation
Marker gene based (checkM) (https://ecogenomics.github.io/CheckM/)

marker gene building


Commond:
checkm lineage_wf -t 8 -x fa genomefolder checkmouputfolder

genome qc check
Donovan H. Parks et al. Genome Res. 2015;25:1043-1055
Trimmomatic: for QC preprocessing
trimmomatic-0.39.jar PE QCtestWadpt_L001_R1_001.fastq.gz
QCtestWadpt_L001_R2_001.fastq.gz Output_R1_P.fastq.gz
Output_R1_UP.fastq.gz Output_R2_P.fastq.gz Output_R2_UP.fastq.gz
ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:2:True LEADING:20 TRAILING:10
MINLEN:100
Spades for de novo assembly

spades.py --careful -t 2 -1 Output_R1_P.fastq.gz -2 Output_R2_P.fastq.gz –o


wgsAssemby

You might also like