0% found this document useful (0 votes)
19 views

Lecture_28_Unit6_1

Uploaded by

idadetu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lecture_28_Unit6_1

Uploaded by

idadetu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Read mapping

Lecture 28
Unit 6
CSE 815 Bioinformatics
Hamid D. Ismail, Ph.D.
Reads mapping/aligning
• Read mapping or sequence alignment is the process of aligning sequenced
DNA/RNA to a reference sequence.
• This reference could be a complete genome, a set of genes, or any other
DNA sequence.
• Most NGS applications depends on read mapping:
Reads
mapping

Genome Variant Gene Epigenetic Metagenomics


assembly calling expression analysis analysis

Hamid D. Ismail, PhD 2


Understanding reads’ mapping
• Mapping is the process of finding the original location of a DNA read in a
reference sequence, typically a reference genome.

Reference sequences

Coverage
depth Aligned
short reads

Hamid D. Ismail, PhD 3


Some factor affecting mapping quality
• Quality of reads: High-quality reads (i.e., those with few errors) are more
likely to map correctly to the reference sequence.
• Read length: Longer reads generally provide more reliable mapping
because they offer more sequence context to match against the reference
genome.
• Coverage depth refers to the number of times a particular base in a genome
is sequenced. This metric is essential for assessing the quality and reliability
of sequencing data.
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑎𝑠𝑒𝑠 𝑖𝑛 𝑟𝑒𝑎𝑑𝑠
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 𝑑𝑒𝑝𝑡ℎ = × 100%
𝐺𝑒𝑛𝑜𝑚𝑒 𝑙𝑒𝑛𝑔𝑡ℎ
If the total number of bases is 30 million and the genome length is 3 million
bases, the average coverage depth would be 10X i.e. each base is
sequenced, on average, 10 times.

Hamid D. Ismail, PhD 4


The steps of reads’ mapping
1- Acquiring the sequencing reads in fastq format (See unit 4)). For example;
download human WGS:
fasterq-dump --progress SRR26329589
2- Quality control of the sequencing reads to make sure that you fix the quality
problems (see unit 4). For example:
fasqc SRR26329589_1.fastq SRR26329589_2.fastq
3- Download the sequence of the reference genome of the organism.
4- Indexing the reference genome with samtools for faster and efficient alignment.
Different tools adopt different indexing methods.
5- Indexing the reference sequence and mapping reads using one of the read
aligners/ mappers.
6- Post-alignment processing (sorting, indexing, remove duplicates)
7- Quality assessment of alignment (coverage).

Hamid D. Ismail, PhD 5


Downloading a reference genome
• Choose the database where you can download the reference genomes of the
organism under study.
• Most sequences of reference genomes are available in:
- NCBI genome database: https://www.ncbi.nlm.nih.gov/genome/
- Ensembl genome browser: https://useast.ensembl.org/index.html
- UCSC Genome Browser: https://genome.ucsc.edu/

Hamid D. Ismail, PhD 6


Downloading a reference genome
• There are different way to download a reference genome from the NCBI
database:
1- NCBI datasets command line tool:
Download the reference genome sequence of:
"Severe acute respiratory syndrome coronavirus 2"

• This will download a compressed folder including the fasta sequence and
annotation files.

Hamid D. Ismail, PhD 7


Downloading a reference genome
2- NCBI E-utilities or E-Direct if you know the accession number.

efetch -db nuccore \


-id ”NC_000001” \
-format fasta -mode text \
> NC_000001.fasta

Hamid D. Ismail, PhD 8


Downloading a reference genome
3- Using BioPython

from Bio import Entrez


Entrez.email = "[email protected]"
handle = Entrez.efetch(db="nucleotide",
id="NC_000001",
rettype="fasta",
retmode="text")
data = handle.read()
handle.close()
with open("NC_000001.fasta", "w") as file:
file.write(data)

Hamid D. Ismail, PhD 9


Downloading a reference genome
3- Using “wget” to download a reference genome from UCSC.
• The genomes are available at: http://hgdownload.soe.ucsc.edu/goldenPath/

wget http://hgdownload.soe.ucsc.edu/goldenPath/<GenomPath>

• Example: Downloading the human reference genome from UCSC.


• Be organized to create a directory for the reference genome (i.e. refHuman)
• Run the following command:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz \
--no-check-certificate
Gunzip -d hg38.fa.gz

CSE815 - Hamid D. Ismail, PhD 10


Indexing a reference genome with samtools
• The samtools is a collection of tools used for sequence indexing, and
mapped read manipulation.
• The “faidx” tool in samtools indexes a FASTA formatted sequence file and
creates an index file that allows for fast random access to the sequence data.
• The syntax for indexing a reference sequence:
samtools faidx your_reference_genome.fasta
• The output is a file with “.fai” extension, which includes indices of the fasta
file.
• The faidx index file is a text file consisting of lines, each with five TAB-delimited
columns:
▪ NAME (name of this reference sequence).
▪ LENGTH (length of sequence)
▪ OFFSET: Sequence's first base in bytes
▪ LINEBASES: the number of bases on each line
▪ LINEWIDTH: the number of bytes in each line

Hamid D. Ismail, PhD 11


Indexing a reference genome with samtools
• For indexing the human reference genome:
samtools faidx hg38.fa

• The index file will be hg38.fa.fai

Hamid D. Ismail, PhD 12


Read mapping with aligners
• Read aligners are tools used to align sequencing reads to a reference
genome.
• The reference genome is usually aligned by the aligner before used in the
alignment. Thus, mapping performed in two steps (i) indexing (ii) mapping.
• There are different data structures for indexing used by each aligner:
- Suffix Array: This is used by aligners like BWA.
- Burrows-Wheeler Transform (BWT) is used alongside suffix arrays to
compress the genome sequence to allows for efficient search operations.
- FM-index is based on the BWT, used by Bowtie, Bowtie2, and BWA.
- Hash Tables are used by aligners like STAR for quick lookup operations.

Hamid D. Ismail, PhD 13


Read aligner data structures
Suffix BWT
array

Suffix Hash
tree table

Hamid D. Ismail, PhD 14


Popular read aligners
• BWA (Burrows-Wheeler Aligner):
- BWA consists of three algorithms: BWA-backtrack, BWA-SW, and BWA-MEM.
- The BWA-MEM algorithm is particularly popular for high-quality alignments of
reads longer than 70bp and is efficient with high error rates.
• Bowtie2:
- Bowtie2 is a fast and memory-efficient tool for aligning sequencing reads to
long reference sequences.
- It is particularly good for aligning reads of about 50 up to 100s of characters to
relatively large genomes.
• STAR (Spliced Transcripts Alignment to a Reference):
- STAR is an aligner specifically designed for RNA-seq data, which can
efficiently handle spliced alignments of RNA-seq reads.
- It works well with reads from a wide range of lengths, (very short to very long).

Hamid D. Ismail, PhD 15


Read mapping with BWA
• BWA (Burrows-Wheeler Aligner):
- BWA consists of three algorithms: BWA-backtrack, BWA-SW, and BWA-MEM.
- The BWA-MEM algorithm is particularly popular for high-quality alignments of
reads longer than 70bp and is efficient with high error rates.
• Bowtie2:
- Bowtie2 is a fast and memory-efficient tool for aligning sequencing reads to
long reference sequences.
- It is particularly good for aligning reads of about 50 up to 100s of characters to
relatively large genomes.
• STAR (Spliced Transcripts Alignment to a Reference):
- STAR is an aligner specifically designed for RNA-seq data, which can
efficiently handle spliced alignments of RNA-seq reads.
- It works well with reads from a wide range of lengths, (very short to very long).

Hamid D. Ismail, PhD 16

You might also like