Efficient Read Alignment Using Burrows Wheeler Transform and Wavelet Tree
Efficient Read Alignment Using Burrows Wheeler Transform and Wavelet Tree
Efficient Read Alignment Using Burrows Wheeler Transform and Wavelet Tree
Abstract—In genome sequence alignment problem, a several bioinformatics tools for read mapping, e.g. SOAP [9],
reference string and number of query strings referred as short BWA [5], BWA-SW [13] and Bowtie [11].
reads, are given, goal is to seek out occurrences of these query
strings in the reference string. Huge amount of reads generated Four categories of alignment programs are currently used to
by new sequencing technologies (Illumina/Solexa) need the map the short reads sequences. First category is based on
development of an efficient algorithm requiring both less hashing of read sequence such as RMAP (Smith et. al., 2008),
memory and computational time. There are number of indexing MAQ (Li et al 2008(1)) and ZOOM (Li et. al. 2008(2)). These
and string matching techniques to align short reads on reference programs have flexible memory space but do not support
string(genome). Size of index of the reference string in each of gapped alignment and multithreading. In second category of
existing techniques is large. In this paper, a new self compressed alignment, programs are based on hashing of reference genome
index technique (BWT-WT) is proposed. BWT-WT scheme is such as SOAP (Li et. al. 2008(3)), and BFAST [14]. Programs
based on Burrow Wheeler Transform (BWT) and Wavelet tree of this category supports multithreading for alignment of reads
(WT). BWT-WT also supports exact alignment of DNA sequence but size of reference genome index is very large. Third
reads. Performances of BWT-WT with other BWT based tools of category programs are based on merge sorting of reference
short read alignments are compared. Experiments show that genome as well as merge sorting of read sequence such as
BWT-WT based program achieves more compression and also Malhis (Malhis et. al. 2009) but these programs are not very
faster searching in comparison to other existing tools.
much popular as they do not support pair end mapping. Fourth
Keywords—Burrows-Wheeler Transform, FM Index, Full Text
category of program is based on Burrows Wheeler Transform
Index, Wavelet Tree and Sequence Analysis. (BWT, 1994) which is efficient in both memory footprint as
well as speed. Some of the software programs of this category
I. INTRODUCTION are: BWA [5], Bowtie [11] and SOAP [9].
Next generation sequencing machine Illumina/Solexa Programs of fourth category mentioned above are having
generates millions of short reads DNA sequences in a single relatively small memory footprint, efficient in searching and
run of the machine. These reads must be mapped to one or support exact matching as well as inexact matching with some
more reference genomes. The orientation of a read relative to bounded allowed differences. Exact matching by these
genome in not known. To match these reads, the main problem programs take few seconds to align the reads but to align the
is how to align the reads to reference genome accounting for inexact reads it takes too much time to find all the similar
exact matching with a reasonable amount of time and memory substrings. In case of DNA profiling multiple reference
space? There are number of applications where short read genomes are used for analysis and identification of gene
alignments are used. Example includes: assembling reads into a behaviour, the size of index again become an issue, so
genome, aligning reads to one reference genome for analysis of reduction of index size is required. As a result, development of
genomic variation, aligning a micro-biome to a set of reference efficient program requiring lesser memory and computation
genomes for species or functional analysis etc. time is need of today.
Searching biological sequences in genome and protein is BWT [2] based algorithm uses number of external tools
important to understand genetic blue print of living organism. such as move to front encoding (which is used to rearrange the
This resulted in a fast development of new technologies characters in similar order), run length coding and variable
generating vast amounts of sequence data to be analyzed [7]. length coding to compress the reference sequence.
For this reason, today the focus changed from data acquisition The Wavelet Tree [4] was invented in 2003 by Grossi,
to efficient data storage and processing methods. To regain the Gupta and Vitter, as a data structure to represent a sequence
original ordering of the reads, often they are aligned to a and answer some queries on it. It is a milestone in compressed
reference genome, where the massive number of sequences that full text indexing which adapts the compressibility of the data
need to be processed requires smooth search scheme and data in many ways excellently. Two key approaches to achieve this
structures. are using specific coding (Entropy Coding) on bitmaps and
A lot of effort has been made to develop methods that are modifying the tree shape.
both memory efficient and fast. One approach to derive In this paper, a new self compressed indexing supporting
suitable data structures is the Burrows-Wheeler Transform exact alignment of DNA sequence reads is proposed, which is
(BWT), which can be understood as a rearrangement of based on BWT & Wavelet Tree. The advantage of BWT-WT
characters in a sequence. Therefore, it has been integrated in is, it provide index of optimal size and supports number of
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
queries such as access, rank & select in constant time. 3. Construct the transformed text Tbwt by taking the last
Performance of BWT-WT for DNA sequence alignment is column of BWM.
compared with other BWT based tools. Experiments show that
BWT-WT based program achieves more compression and The transformed text Tbwt in the last column is also denoted
efficient searching in comparison to other techniques. as L (last). In particular, the first Column of BWM F (first), is
obtained by lexicographically sorting the characters of T. Fig. 1
This paper is organized as follows. Sec. II describes the shows the construction of BWT.
related concepts. Section III presents proposed compression
and indexing techniques based on Burrows-Wheeler Transform Index Cyclic Shifting Index BWT Matrix
and Wavelet tree. Sec. IV presents the experimental setup and 0 AGCAGT$ 0 $AGCAGT
analysis of the results. Finally, Sec V concludes the paper. 1 GCAGT$A After 1 AGCAGT$
2 CAGT$AG
Sorting 2 AGT$AGC
II. RELATED CONCEPTS
3 AGT$AGC 3 CAGT$AG
A. Suffix Tree and Suffix Array 4 GCAGT$A
4 GT$AGCA
Suffix tree [1] has been used as an important data structure in 5 GT$AGCA
5 T$AGCAG
string processing. This data structure plays a prominent role in
6 $AGCAGT 6 T$AGCAG
algorithms but is not as prevalent in actual implementations of
software tools. There are two major reasons for this. The first Fig. 1. Construction of Burrows Wheeler Transform Matrix for Text
reason is the space consumption, as the suffix tree requires T=AGCAGT$. TBWT=T$CGAAG
quite large space, though its performance is asymptotically
linear. The second reason is that the suffix tree demonstrates a C. FM-Index
poor locality of memory reference. It causes a significant loss In 2000, six years after the BWT was appeared, Paolo
of efficiency in architectures of cached processor. Ferragina and Giovanni Manzini[3] published a paper
Suffix array [6] is introduced by Manber & Myers [6] as a describing how the BWT, together with some small auxiliary
simple, space efficient indexing method alternative to suffix data structures, can be used as a space-efficient index of
trees. It is key data structure for solving a number of problems reference string T?. They named it as FM Index. Just as the
on data compression and information retrieval for biological Last to First Mapping [3, 8] was the key to understanding how
sequence analysis and pattern discovery. It is defined as the the BWT is reversible, it is also the key to how it can be used
permutation of index numbers giving the starting positions of as an index?
suffixes of a given string in alphabetical order. Table I shows D. Wavelet Tree
the suffix array for the string “AGCAGT$”.
A wavelet tree [4] is a binary tree of bit strings to represent a
TABLE I. SUFFIX ARRAY FOR TEXT T=AGCAGT$ given text T. For an alphabet Σ and a text of length n, the tree
needs O(log2n) bits of storage and supports the determination
Suffixes Ordered Suffixes
I S[i] I S[i] Ssuf
of character at a specified position in O(log|Σ|) time. In
0 AGCAGT$ 0 6 $ addition, it allows to obtain the number of occurrences of a
1 GCAGT$ 1 0 AGCAGT$ given character up to a specified position in O(log |Σ|) time.
2 CAGT$ 2 3 AGT$ Fig. 2 shows the wavelet tree for AGCAGT$.
3 AGT$ 3 2 CAGT$
4 GT$ 4 1 GCAGT$ E. Existing Technique
5 T$ 5 4 GT$
There are number of techniques for short read alignment to
6 $ 6 5 T$
reference genome such as MAQ, BWA, Bowtie and SOAP .In
Burrows Wheeler Aligner (BWA) [5], short read alignments
B. Burrrows Wheeler Transform are performed. BWA is based on Burrows Wheeler Transform
DNA sequencing algorithms based on Burrows Wheeler BWT and FM Indexes [3]. In BWA alignment, an index based
Transform (BWT) [2] are widely used in genome sequencing on BWT and Suffix Array is created. To search efficiently
analysis. The main concept of BWT is to sort all rotations of a BWA use FM Index [3, 7] which is based on backward search
given string in lexical order in form of BWM (Burrows method. FM Index uses number of other auxiliary data
Wheeler Matrix) and then return the last column as a result. structures such as count & occurrence table for performing the
This last column, i.e., the BWT string, can be easily search operation. Count table is use to store the number of
compressed, because it has many repeated characters together. characters involved in the string and number of character
BWT also allows fast string matching on compressed text. It is smaller than any character c, occurrence table is use to store
implemented by the following steps: the rank of character. The size of suffix array and occurrence
1. Derive a conceptual matrix M whose rows are n cyclic table is too large, so here only sample values are used to store
shifts of the text T, n being the length of text. and other values are calculating on demand. In order to
perform exact matching, count and locate function are used.
2. Lexicographically sort the text of resultant matrix Count function return no of occurrence of pattern P into Text
called BWM. T, whereas Locate function return the location of pattern P
into text T.
134
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Wavelet Tree for text T= AGCAGT$
135
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
(3) if c is in the left sub-tree of v then
(4) r ← rank0(Bv(r))
(5) v←leftchild(v)
(6) else
(7) r←rank1(Bv(r))
(8) v←rightchild(v)
(9) return i
WTBWT-select(c, i)
(1) v ← leaf representing c: r ← i
(2) while v is not root do
(3) p←parent (v)
(4) if v is in the left child of p then
(5) r←select0(Bp(r)) //selectc(T, i) - the position of the
ith occurrence of c in text T.
(6) else
(7) r ← select1(Bp(r))
Fig. 4. Wavelet Tree Index Based on BWT (8) v ← p
(9) return r
TABLE II. INDEX FOR STRING T = AGCAGCAGACT$
136
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
So indices corresponding to pattern P of suffix interval are 3. English texts from the Wikipedia dump .
[2,5] from table II, Hence pattern GCA occurs in string 4. Simulated Data of DNA sequence (Arabidopsis thaliana)
T=AGCAGCAGACT$ at two times and their starting position and their short read archives (http://plants.ensembl.org/) is
are 2nd and 5th in text T. used to compare CPU time depicted in TABLE V.
.
IV. EXPERIMENTAL SETUP & RESULTS ANALYSIS G++4.7.3 is used to build all the source code for experiments
The experiments were conducted on a HP Pavilion g series through the Succinct Data Structure Library (SDSL).
with a 2.8 GHz four-core Intel@CoreTM i3-860 chip with 4
MB L3 Cache, but no parallelism was used. The machine runs TABLE III shows the space required for index prepared to be
64-bit Ubuntu 12.04 operating system and has 4 GB internal used in BWT-WT. Comparison of index size of proposed
memory and one 500 GB Serial ATA Hard Drive (7,200 approach BWT-WT with other tools BWA [5], Soap [9] and
RPM). Following real-world biological and non-biological Bowtie [11] in TABLE IV and Figure 5.Comparision of CPU
data to test the efficiency and usability of proposed method: time of proposed scheme with others is shown in TABLE 5.
1. The human genome sequences from NCBI.
2. Protein data from the Pizza & Chili Corpus .
TABLE IV. INDEX SIZE COMPARISON OF PROPOSED BWT-WT INDEX WITH OTHERS
Program Read Length single pair end read (bp) CPU Time (s)
Bowtie 36 375
Soap 36 249
BWA 36 289
BWT-WT 36 284
137
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
Index
Size
(MB)
[5] H. Li and R. Durbin. Fast and accurate short read alignment with
V. CONCLUSION burrows–wheeler transform. Bioinformatics, 25(14): pp. 1754–1760,
2009.
In this paper it is shown that how to extend the BWT based [6] U. Manber and G. Myers. Suffix arrays: a new method for on-line string
approach to WT based data structure for compressed indexes. searches. In Proceedings of the first annual ACM-SIAM symposium on
BWT-WT is a simple and faster scheme for short read Discrete algorithms, SODA ’90, pp. 319–327, Philadelphia, PA, USA,
alignment. Experiments show that BWT-WT based program 1990. Society for Industrial and Applied Mathematics.
achieves more compression and also efficient searching speed [7] D.Zhang, Q.Liu Compression and Indexing based on BWT: A
Survey.Web Information System and Application Confrence, 2013.
in comparisons to BWT based approach. As a future work,
[8] Schindler, M. (1997, March). A fast block-sorting algorithm for lossless
one can consider approximate matches (insert, delete, gaps). data compression. In Proceedings of the Conference on Data
Compression (Vol. 469). IEEE Computer Society.
REFERENCES [9] Li, R., Li, Y., Kristiansen, K., & Wang, J. (2008). SOAP: short
oligonucleotide alignment program. Bioinformatics, 24(5), pp. 713-714.
[1] D. Adjeroh, T. Bell, and A. Mukherjee. The Burrows-Wheeler [10] B.Langmead, C.Trapnell, M.Pop, S.Salzberg. 2009. Ultrafast and
Transform: Data Compression, Suffix Arrays, and Pattern Matching. memory-efficient alignment of short DNA sequences to the human
Springer, 1 edition, 2008. genome. Genome Biology 2009,Vol.10,Issue 3,Article R25.
[2] M. Burrows and D. J. Wheeler. A block-sorting lossless data [11] Succinct Data Structure Library: https://github.com/simongog/sdsl-lite
compression algorithm. Systems Research, Research R(124): pp.1–24, [12] H. Li and R. Durbin. Fast and accurate long read alignment with
1994. burrows–wheeler transform. Bioinformatics, 26(5): pp.589-95, 2010.
[3] P. Ferragina, G. Manzini, V. M¨akinen, and G. Navarro. Compressed [13] R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable
representations of sequences and full-text indexes. ACM Trans. dictionaries with applications to encoding k-ary trees and multisets. In
Algorithms, 3, 2007. SODA, pp. 233–242, 2002.
[4] R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed
text indexes. In Proceedings of the fourteenth annual ACM-SIAM
symposium on Discrete algorithms, SODA ’03, pp. 841–850,
Philadelphia, PA, USA, 2003. Society for Industrial and Applied
Mathematics.
138
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.