HMCW NGS Data Format
HMCW NGS Data Format
• Understanding Fastq
• Understanding Quality scores
• Quality control and read cleaning
• De novo Genome Assembling
• Assembly Quality evaluation
The Illumina HiSeq
Libraries, lanes, and flowcells
Flowcell
Lanes
Illumina
Accuracy
Error Error(%) Accuracy (%) Phred Quality Score
(1 - Error)
10 (10/100) = 0.1 90 0.9 10
1 (1/100) = 0.01 99 0.99 20
0.1 (0.1/100) = 0.001 99.9 0.999 30
0.01 (0.01/100) = 0.0001 99.99 0.9999 40
Understanding Quality scores
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@
The fourth line contains the base quali7es where BQ + 33 = ASCII value shown in the base quality string
Sequences ID @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG means:
Quality control and read cleaning
• Before alignment there is some7mes the need to preprocess/ manipulate the
FASTA/FASTQ files to produce beeer mapping results. It is important to do quality
control checks to understand whether your data has any problems of which you should
be aware before doing any further analysis
• FastQC: quality control checks on raw sequence data coming from high throughput
sequencing pipelines (hep:// www.bioinforma7cs.bbsrc.ac.uk/projects/fastqc/).
• Trimmomatic: A flexible read trimming tool for Illumina NGS data
(http://www.usadellab.org/cms/?page=trimmomatic)
Trimmomatic:
A flexible read trimming tool for Illumina NGS data
Paired End:
With most new data sets you can use gentle quality trimming and adapter clipping.
Single End:
java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10
median
Reasonable quality
mean
PER BASE SEQUENCE QUALITY
Poor quality
genome qc check
Donovan H. Parks et al. Genome Res. 2015;25:1043-1055
Trimmomatic: for QC preprocessing
trimmomatic-0.39.jar PE QCtestWadpt_L001_R1_001.fastq.gz
QCtestWadpt_L001_R2_001.fastq.gz Output_R1_P.fastq.gz
Output_R1_UP.fastq.gz Output_R2_P.fastq.gz Output_R2_UP.fastq.gz
ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:2:True LEADING:20 TRAILING:10
MINLEN:100
Spades for de novo assembly