0% found this document useful (0 votes)

19 views

Lecture_28_Unit6_1

Uploaded by

idadetu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Lecture_28_Unit6_1

Uploaded by

idadetu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Read mapping

Lecture 28
Unit 6
CSE 815 Bioinformatics
Hamid D. Ismail, Ph.D.
Reads mapping/aligning
• Read mapping or sequence alignment is the process of aligning sequenced
DNA/RNA to a reference sequence.
• This reference could be a complete genome, a set of genes, or any other
DNA sequence.
• Most NGS applications depends on read mapping:
Reads
mapping

Genome Variant Gene Epigenetic Metagenomics

assembly calling expression analysis analysis

Hamid D. Ismail, PhD 2

Understanding reads’ mapping
• Mapping is the process of finding the original location of a DNA read in a
reference sequence, typically a reference genome.

Reference sequences

Coverage
depth Aligned
short reads

Hamid D. Ismail, PhD 3

Some factor affecting mapping quality
• Quality of reads: High-quality reads (i.e., those with few errors) are more
likely to map correctly to the reference sequence.
• Read length: Longer reads generally provide more reliable mapping
because they offer more sequence context to match against the reference
genome.
• Coverage depth refers to the number of times a particular base in a genome
is sequenced. This metric is essential for assessing the quality and reliability
of sequencing data.
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑎𝑠𝑒𝑠 𝑖𝑛 𝑟𝑒𝑎𝑑𝑠
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 𝑑𝑒𝑝𝑡ℎ = × 100%
𝐺𝑒𝑛𝑜𝑚𝑒 𝑙𝑒𝑛𝑔𝑡ℎ
If the total number of bases is 30 million and the genome length is 3 million
bases, the average coverage depth would be 10X i.e. each base is
sequenced, on average, 10 times.

Hamid D. Ismail, PhD 4

The steps of reads’ mapping
1- Acquiring the sequencing reads in fastq format (See unit 4)). For example;
download human WGS:
fasterq-dump --progress SRR26329589
2- Quality control of the sequencing reads to make sure that you fix the quality
problems (see unit 4). For example:
fasqc SRR26329589_1.fastq SRR26329589_2.fastq
3- Download the sequence of the reference genome of the organism.
4- Indexing the reference genome with samtools for faster and efficient alignment.
Different tools adopt different indexing methods.
5- Indexing the reference sequence and mapping reads using one of the read
aligners/ mappers.
6- Post-alignment processing (sorting, indexing, remove duplicates)
7- Quality assessment of alignment (coverage).

Hamid D. Ismail, PhD 5

Downloading a reference genome
• Choose the database where you can download the reference genomes of the
organism under study.
• Most sequences of reference genomes are available in:
- NCBI genome database: https://www.ncbi.nlm.nih.gov/genome/
- Ensembl genome browser: https://useast.ensembl.org/index.html
- UCSC Genome Browser: https://genome.ucsc.edu/

Hamid D. Ismail, PhD 6

Downloading a reference genome
• There are different way to download a reference genome from the NCBI
database:
1- NCBI datasets command line tool:
Download the reference genome sequence of:
"Severe acute respiratory syndrome coronavirus 2"

• This will download a compressed folder including the fasta sequence and
annotation files.

Hamid D. Ismail, PhD 7

Downloading a reference genome
2- NCBI E-utilities or E-Direct if you know the accession number.

efetch -db nuccore \

-id ”NC_000001” \
-format fasta -mode text \
> NC_000001.fasta

Hamid D. Ismail, PhD 8

Downloading a reference genome
3- Using BioPython

from Bio import Entrez

Entrez.email = "[email protected]"
handle = Entrez.efetch(db="nucleotide",
id="NC_000001",
rettype="fasta",
retmode="text")
data = handle.read()
handle.close()
with open("NC_000001.fasta", "w") as file:
file.write(data)

Hamid D. Ismail, PhD 9

Downloading a reference genome
3- Using “wget” to download a reference genome from UCSC.
• The genomes are available at: http://hgdownload.soe.ucsc.edu/goldenPath/

wget http://hgdownload.soe.ucsc.edu/goldenPath/<GenomPath>

• Example: Downloading the human reference genome from UCSC.

• Be organized to create a directory for the reference genome (i.e. refHuman)
• Run the following command:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz \
--no-check-certificate
Gunzip -d hg38.fa.gz

CSE815 - Hamid D. Ismail, PhD 10

Indexing a reference genome with samtools
• The samtools is a collection of tools used for sequence indexing, and
mapped read manipulation.
• The “faidx” tool in samtools indexes a FASTA formatted sequence file and
creates an index file that allows for fast random access to the sequence data.
• The syntax for indexing a reference sequence:
samtools faidx your_reference_genome.fasta
• The output is a file with “.fai” extension, which includes indices of the fasta
file.
• The faidx index file is a text file consisting of lines, each with five TAB-delimited
columns:
▪ NAME (name of this reference sequence).
▪ LENGTH (length of sequence)
▪ OFFSET: Sequence's first base in bytes
▪ LINEBASES: the number of bases on each line
▪ LINEWIDTH: the number of bytes in each line

Hamid D. Ismail, PhD 11

Indexing a reference genome with samtools
• For indexing the human reference genome:
samtools faidx hg38.fa

• The index file will be hg38.fa.fai

Hamid D. Ismail, PhD 12

Read mapping with aligners
• Read aligners are tools used to align sequencing reads to a reference
genome.
• The reference genome is usually aligned by the aligner before used in the
alignment. Thus, mapping performed in two steps (i) indexing (ii) mapping.
• There are different data structures for indexing used by each aligner:
- Suffix Array: This is used by aligners like BWA.
- Burrows-Wheeler Transform (BWT) is used alongside suffix arrays to
compress the genome sequence to allows for efficient search operations.
- FM-index is based on the BWT, used by Bowtie, Bowtie2, and BWA.
- Hash Tables are used by aligners like STAR for quick lookup operations.

Hamid D. Ismail, PhD 13

Read aligner data structures
Suffix BWT
array

Suffix Hash
tree table

Hamid D. Ismail, PhD 14

Popular read aligners
• BWA (Burrows-Wheeler Aligner):
- BWA consists of three algorithms: BWA-backtrack, BWA-SW, and BWA-MEM.
- The BWA-MEM algorithm is particularly popular for high-quality alignments of
reads longer than 70bp and is efficient with high error rates.
• Bowtie2:
- Bowtie2 is a fast and memory-efficient tool for aligning sequencing reads to
long reference sequences.
- It is particularly good for aligning reads of about 50 up to 100s of characters to
relatively large genomes.
• STAR (Spliced Transcripts Alignment to a Reference):
- STAR is an aligner specifically designed for RNA-seq data, which can
efficiently handle spliced alignments of RNA-seq reads.
- It works well with reads from a wide range of lengths, (very short to very long).

Hamid D. Ismail, PhD 15

Read mapping with BWA
• BWA (Burrows-Wheeler Aligner):
- BWA consists of three algorithms: BWA-backtrack, BWA-SW, and BWA-MEM.
- The BWA-MEM algorithm is particularly popular for high-quality alignments of
reads longer than 70bp and is efficient with high error rates.
• Bowtie2:
- Bowtie2 is a fast and memory-efficient tool for aligning sequencing reads to
long reference sequences.
- It is particularly good for aligning reads of about 50 up to 100s of characters to
relatively large genomes.
• STAR (Spliced Transcripts Alignment to a Reference):
- STAR is an aligner specifically designed for RNA-seq data, which can
efficiently handle spliced alignments of RNA-seq reads.
- It works well with reads from a wide range of lengths, (very short to very long).

Hamid D. Ismail, PhD 16

Value Proposition Design - Thiết Kế Giải Pháp Giá Trị
100% (1)
Value Proposition Design - Thiết Kế Giải Pháp Giá Trị
314 pages
Project Documentation On Online Banking
100% (2)
Project Documentation On Online Banking
43 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Blast
100% (1)
Blast
21 pages
EBTY348L_Comp Genomics lectures_Even Sem_2024-25 _set 2
No ratings yet
EBTY348L_Comp Genomics lectures_Even Sem_2024-25 _set 2
29 pages
R NGS
No ratings yet
R NGS
29 pages
lecture1_BIOF242_shuvadeep
No ratings yet
lecture1_BIOF242_shuvadeep
38 pages
Bioinformatica Clinica
No ratings yet
Bioinformatica Clinica
25 pages
Glossary of Terms B4B
No ratings yet
Glossary of Terms B4B
8 pages
4Bioinformaticsdatabases
No ratings yet
4Bioinformaticsdatabases
71 pages
PAM Blosum: Assignment 1 Bioinformatics (DSE 1)
100% (3)
PAM Blosum: Assignment 1 Bioinformatics (DSE 1)
9 pages
Bioinformatics lecture 1
No ratings yet
Bioinformatics lecture 1
48 pages
Software: Next-Generation Sequence Alignment Software
No ratings yet
Software: Next-Generation Sequence Alignment Software
3 pages
Short Read Alignment: BNFO 601
No ratings yet
Short Read Alignment: BNFO 601
12 pages
tutorial_raw
No ratings yet
tutorial_raw
13 pages
BLAST
No ratings yet
BLAST
11 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
Sequence search algorithms
No ratings yet
Sequence search algorithms
3 pages
FPGA Based Parallel Computation Techniques For Bioinformatics Applications
No ratings yet
FPGA Based Parallel Computation Techniques For Bioinformatics Applications
5 pages
Bioinformatics 29 1 15
No ratings yet
Bioinformatics 29 1 15
7 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Homer: Mapping Reads To The Genome
No ratings yet
Homer: Mapping Reads To The Genome
5 pages
Fast and Memory-Efficient Regular Expression Matching For Deep Packet Inspection
No ratings yet
Fast and Memory-Efficient Regular Expression Matching For Deep Packet Inspection
15 pages
Multiple Sequence Alignment: Sumbitted To: DR - Navneet Choudhary
No ratings yet
Multiple Sequence Alignment: Sumbitted To: DR - Navneet Choudhary
23 pages
NGS notes
No ratings yet
NGS notes
2 pages
Unit-Ii Bda
No ratings yet
Unit-Ii Bda
103 pages
Lab03 - Lab Manual
No ratings yet
Lab03 - Lab Manual
16 pages
E2017018 PDF
No ratings yet
E2017018 PDF
7 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
Bioinformatics Cheat Sheet
No ratings yet
Bioinformatics Cheat Sheet
4 pages
The Ensembl Computing Architecture: References
No ratings yet
The Ensembl Computing Architecture: References
6 pages
Entrez
No ratings yet
Entrez
46 pages
Intro To Using Galaxy - For Bioinformatics: Carrie Ganote
No ratings yet
Intro To Using Galaxy - For Bioinformatics: Carrie Ganote
26 pages
Application Note - Whole Genome
No ratings yet
Application Note - Whole Genome
3 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
Next Generation Sequencing Analysis Lecture 02.
No ratings yet
Next Generation Sequencing Analysis Lecture 02.
19 pages
Illumina Idt Glossary 070 2017 019
No ratings yet
Illumina Idt Glossary 070 2017 019
8 pages
Information Theory of DNA Shotgun Sequencing
No ratings yet
Information Theory of DNA Shotgun Sequencing
17 pages
Storage Solutions For Bioinformatics: Li Yan
No ratings yet
Storage Solutions For Bioinformatics: Li Yan
30 pages
Biological Databases (1)
No ratings yet
Biological Databases (1)
41 pages
Data Retrieval System: Text-Based Database Searching
No ratings yet
Data Retrieval System: Text-Based Database Searching
54 pages
The Bioinformatics Toolbox Extends MATLAB
No ratings yet
The Bioinformatics Toolbox Extends MATLAB
19 pages
Hadoop Fundamentals
No ratings yet
Hadoop Fundamentals
45 pages
2022-Turn-To-diarize Online Speaker Diarization Constrained by
No ratings yet
2022-Turn-To-diarize Online Speaker Diarization Constrained by
8 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
2022 12 23 521809v1 Full
No ratings yet
2022 12 23 521809v1 Full
25 pages
Ashish (File Oganization) - 1
No ratings yet
Ashish (File Oganization) - 1
12 pages
Introduction To Bioinformatics Presentation
No ratings yet
Introduction To Bioinformatics Presentation
13 pages
Designing_and_Building_an_Automatic_Information_Re
No ratings yet
Designing_and_Building_an_Automatic_Information_Re
7 pages
KCL NGScourse Session1 Handout
No ratings yet
KCL NGScourse Session1 Handout
19 pages
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
No ratings yet
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
5 pages
2 PDF
No ratings yet
2 PDF
22 pages
Lecture14-Perl in Bioinformatics
No ratings yet
Lecture14-Perl in Bioinformatics
19 pages
Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
No ratings yet
Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
81 pages
Introduction To Sushi, A NGS Data Analysis Workflow Manager
No ratings yet
Introduction To Sushi, A NGS Data Analysis Workflow Manager
37 pages
Data Mining & Sequence Retrieval Practical
No ratings yet
Data Mining & Sequence Retrieval Practical
46 pages
FILESYSTEM
No ratings yet
FILESYSTEM
8 pages
Lecture 09 RAG (1)
No ratings yet
Lecture 09 RAG (1)
16 pages
Bio Tools Booklet
No ratings yet
Bio Tools Booklet
5 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
IT Application Tools in Business
No ratings yet
IT Application Tools in Business
4 pages
Software Requirements Specification New FINAL
No ratings yet
Software Requirements Specification New FINAL
15 pages
NPTEL CC Assignment5
No ratings yet
NPTEL CC Assignment5
4 pages
Chapter 3 - Locating Facilities Using A Distance-Based Approach (Compatibility Mode)
No ratings yet
Chapter 3 - Locating Facilities Using A Distance-Based Approach (Compatibility Mode)
5 pages
Archit Bansal: Work History
No ratings yet
Archit Bansal: Work History
2 pages
Honeywell Whitepaper-Optimizing-ProcessSafety
No ratings yet
Honeywell Whitepaper-Optimizing-ProcessSafety
12 pages
Status Name Displayname
No ratings yet
Status Name Displayname
7 pages
Sex Online
No ratings yet
Sex Online
15 pages
Differentiate Between Real DOM and Virtual DOM
No ratings yet
Differentiate Between Real DOM and Virtual DOM
20 pages
Resume Software Engineer (1)
No ratings yet
Resume Software Engineer (1)
1 page
50 Critical ERP Statistics - 2020 Market Trends Data and Analysis
No ratings yet
50 Critical ERP Statistics - 2020 Market Trends Data and Analysis
8 pages
Seminar Report File - Copy
No ratings yet
Seminar Report File - Copy
11 pages
6.3-Artificial-intelligence-EMK-Notes-2024
No ratings yet
6.3-Artificial-intelligence-EMK-Notes-2024
4 pages
Voucher Awi@hotspot 1mb Up 295 08.18
No ratings yet
Voucher Awi@hotspot 1mb Up 295 08.18
8 pages
SCOPIA XT5000 QSG v3
No ratings yet
SCOPIA XT5000 QSG v3
2 pages
Digital India
No ratings yet
Digital India
2 pages
3.6.3 Módulo de Gestión
No ratings yet
3.6.3 Módulo de Gestión
4 pages
Export AD Users To CSV.v1.0
No ratings yet
Export AD Users To CSV.v1.0
2 pages
Prof. Shardul Agravat: Unit - 1 Android OS
No ratings yet
Prof. Shardul Agravat: Unit - 1 Android OS
31 pages
SAIL - CET (Managers) INFORAMATION HANDOUT
No ratings yet
SAIL - CET (Managers) INFORAMATION HANDOUT
6 pages
APS3-400 Installation Guide (SC300) - A4
No ratings yet
APS3-400 Installation Guide (SC300) - A4
114 pages
Time-Series Anomaly Detection Service at Microsoft
No ratings yet
Time-Series Anomaly Detection Service at Microsoft
9 pages
Startup
No ratings yet
Startup
1 page
Automata
No ratings yet
Automata
2 pages
Brianna M Rsume
No ratings yet
Brianna M Rsume
1 page
# Question Type Accuracy: View Player Data
No ratings yet
# Question Type Accuracy: View Player Data
18 pages
Top 100 Selenium Interview Questions Answers
No ratings yet
Top 100 Selenium Interview Questions Answers
25 pages
DVB t2 Thesis
100% (3)
DVB t2 Thesis
5 pages

Lecture_28_Unit6_1

Uploaded by

Lecture_28_Unit6_1

Uploaded by

Read mapping

Genome Variant Gene Epigenetic Metagenomics

Hamid D. Ismail, PhD 2

Hamid D. Ismail, PhD 3

Hamid D. Ismail, PhD 4

Hamid D. Ismail, PhD 5

Hamid D. Ismail, PhD 6

Hamid D. Ismail, PhD 7

efetch -db nuccore \

Hamid D. Ismail, PhD 8

from Bio import Entrez

Hamid D. Ismail, PhD 9

• Example: Downloading the human reference genome from UCSC.

CSE815 - Hamid D. Ismail, PhD 10

Hamid D. Ismail, PhD 11

• The index file will be hg38.fa.fai

Hamid D. Ismail, PhD 12

Hamid D. Ismail, PhD 13

Hamid D. Ismail, PhD 14

Hamid D. Ismail, PhD 15

Hamid D. Ismail, PhD 16

You might also like