Lecture 03
Lecture 03
Lecture 03
Amit Ghosh
IIT Kharagpur
http://www.energy.iitkgp.ac.in/~amitghosh/
ATGCAGAGAGTCGA…….
Library CAGAGGCTACGGATGC…….
AGTGATAGCTATGACA…….
Single-end sequencing
Paired-end sequencing
Reads
P(Error)=10(-Q/10)
Q = 40, P(Error) = 10-4
Q = 10, P(Error) = 10-1
• Format
1. Sequence ID
2. Sequence
3. Quality ID
4. Quality Score
FastQC
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Good Data
Bad Data
Good Data
Not so good
Algorithms
if one mapping takes 0.1 sec, mapping 100
Trivial search (slowest) million reads will take –
Blast etc. 0.1x100x106= 106 seconds = 11.5 days
Hash-table based
Suffix-tree based
-large memory requirement
School of Energy Science & Engineering
Burrows-Wheeler Aligners
Most widely used tools:
bwa: http://bio-bwa.sourceforge.net/
Bowtie: http://bowtie-bio.sourceforge.net/index.shtml
BWA
Bowtie
BWT(T) =
Burrows, M; Wheeler, DJ. A block sorting lossless data compression algorithm, Digital
School of Energy
Equipment Science
Corporation, & Engineering
Palo Alto, CA 1994, Technical Report 124, 1994
Last-First (LF) mapping
Query Q = aac
• If range becomes empty the query does not occur in the text
• Read Name
• Map: 0 OK, 4 unmapped, 16 mapped
• Sequence, quality score
• MD: mismatch info: 3 match, then C ref, 30 match, then T ref, 3 match
• NM: number of mismatch
• BAM: binary compressed SAM format
Higher confidence in the SNP call and reduced number of false positives
Diseased vs healthy
Cancer vs Normal