0% found this document useful (0 votes)
117 views6 pages

Had Oop Cancer

This document discusses using Hadoop to analyze cancer data through bioinformatics tools like Cloudburst and Crossbow. It provides an overview of how next generation sequencing is generating enormous amounts of genetic data that requires high-performance computing. Hadoop provides a scalable and cost-effective platform for processing and analyzing this genomic data through its HDFS distributed file system and MapReduce programming model. The document demonstrates running bioinformatics tools on Hadoop to more efficiently analyze open cancer datasets.

Uploaded by

Anil Kumar Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views6 pages

Had Oop Cancer

This document discusses using Hadoop to analyze cancer data through bioinformatics tools like Cloudburst and Crossbow. It provides an overview of how next generation sequencing is generating enormous amounts of genetic data that requires high-performance computing. Hadoop provides a scalable and cost-effective platform for processing and analyzing this genomic data through its HDFS distributed file system and MapReduce programming model. The document demonstrates running bioinformatics tools on Hadoop to more efficiently analyze open cancer datasets.

Uploaded by

Anil Kumar Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A case study on Scientific Analysis of Cancer data using Hadoop

Sankalp Jain, Amit Singh, Anju Singh, Bhagyashri Pathak, Anil Gupta, Lakshmi Panat,
Ishan Batra, Anuradha Tomar, Sarita Narwal and Amit Saxena ∗

Abstract invent of next generation sequencing and use of high


throughput methods there is an enormous change in the
The new age research in health area is supported research associated with genetics. From human genome
to a greater extent by the information technology and sequencing to personal genome sequencing there has
scientific equipments. Enormous data are generated by been substantial growth in the technologies used for
the scientific instruments using the real patient’s data the process. DNA sequencing labs can produce over
to understand the genetic implications of diseases. Next terabytes of data in a week, so to deal with that huge
generation technologies are the major contributor in the amount of data there is a need of cyber-infrastructure
data generation. Many databases for cancer genomics and high end software tools. Hadoop [3] is a good so-
data like TCGA, UCSC, Cancer Gene Census etc. are lution to deal with big data analytics. It is cost effec-
available for public access. This can be used by re- tive, flexible, scalable, fault tolerace as well as easy
searchers along with the mapreduce technique for data to use platform. Use of hadoop based tools in bioin-
analysis. Bioinformatics tools like Crowssbow, Cloud- formatics can revolutionize our understanding of biol-
burst can use Hadoop to analyse large datasets more ef- ogy, health and natural world. There are several Bioin-
ficiently compared to traditional methods. In this paper formatics software applications that runs on hadoop to
we describe a methodology to analyze the Open access help researchers. Some tools like cloudburst [4] and
cancer data using Bioinformatics tools like Cloudburst crossbow [5] are developed for hadoop. Cloudburst is
and Crossbow with Hadoop on inexpensive commodity a new parallel read-mapping algorithm optimized for
platform. mapping NGS data to human genome and other ref-
erences genome. Crossbow is also a tool that runs on
hadoop which is a software pipeline for whole genome
1. INTRODUCTION resequencing analysis uses hadoop to compress hours
of computations into only a few hours. In this paper we
Bioinformatics has a great challenge to process, summarize our implementation of Hadoop cluster and
store and analyze data that is generated by Next Gen- analysing open cancer data by running various bioin-
eration Sequencing (NGS) [1] labs. Cost of data pro- formatics tools like cloudburst and crossbow on it.
duction has decreased by using new technologies but
analysis of generated data is still a challenge. The DNA
sequence is required for comparative genomics studies. 2. Hadoop framework
In this a comparision among genomes of various or-
ganisms is done to understand their functions by find- The Apache Hadoop software [3] is an open source
ing similarity or differences. Sequencing is becoming batch processing framework for the data intensive dis-
a general purpose tool to identify functional sequences tributed processing on large data sets distributed across
and characterize genomes [2]. It is also heplful when we clusters of computers using simple programming mod-
have a reference genome and compare healthy with the els. There are also other framework that implement
mutated. The next generation sequencing techniques mapreduce framework like Phoenix, Disco, and Mars,
also promises the availability of cost-effective gene se- but Hadoop is open source and other are not [6].
quencing. There is a high demand for low cost sequenc- Hadoop was created by Dong Cutting and Michael J.
ing which has paved the way to discover new meth- Cafarella in 2005 and it was initially inspired by papers
ods known as next generation sequencing. With the on Google mapreduce [7] and Google filesystem [8]
∗ SIG Initiative, Centre for Development of Advanced Computing published by Google. Doug named it after his son’s
Pune, INDIA - 411007, [email protected], http://cdac.in stuffed toy elephant. It was originally developed to sup-
port distribution for the Nutch search engine project [9]. designed keeping a simple cluster architecture in mind
Apache Hadoop is an open source, and written in Java suitable for commodity platform. The Cluster can have
and it is a new way of storing and processing large thousands of simple components called nodes. Every
data. It can run on large cluster of commodity hardware node can have its own computing cores, memory and
and it is scalable and highly fault tolerence. Hadoop disks for storage. These nodes are intergrated to forma
works on master-slave architecture where Namenode & a rack and a group of racks form cluster. All nodes
Jobtracker work as master and Datanode & Tasktracker are conneted by a high speed network. The commod-
work as slave. There are various users of Hadoop like ity hardware was used to process massive data in the
facebook, amazon, yahoo, twitter etc. [10] fraction of time. Hadoop was designed for distributed
applications to extract most out of cluster architecture.
3. Hadoop and Bioinformatics The layout of the data across the cluster is such that
data is evenly distributed across the cluster and the ad-
Hadoop and the MapReduce programming vantage of data locality is best used. The Hadoop can
paradigm already have a substantial base in the bioin- be divided into two major components:
formatics community [11], especially in the field of
next-generation sequencing analysis, and such use is 4.1. HDFS
increasing. This is due to the cost-effectiveness of
Hadoop-based analysis on commodity Linux clusters,
HDFS is a distributed file system used by
and ease-of-use of the MapReduce method in par-
hadoop [15]. Each hadoop cluster consists of one Na-
allelization of many data analysis algorithms. The
menode & multiple Datanodes. Namenode stores the
initial delay in the adoption of Hadoop for Big Data
metadata for storage and Datanode stores the actual
was mostly due to a lack of information and inertia
data. HDFS stores large data on various machines in
within the community. Hadoop began to be used in
cluster and split these large file into a fixed size of block
Bioinformatics in May 2009. Hadoop is used mostly in
(64 MB or 128 MB) and store in HDFS with replica-
Next Generation Sequencing because thats where most
tion factor which is by default three. HDFS is highly
of the Big Data is generated [12].
fault-tolerant and is designed to be deployed on low-
cost hardware and provides high throughput access to
3.1. MPI and map-reduce
application data. HDFS like any other file system is
MPI is a Massage Passing Interface [13] in which based on very simple design. A file is splitted in to
there is no data locality, we send data to another node equal size blocks and further into the size of storage
for it to be computed on, thus MPI depends on net- block size. Hadoop used the file block size as a unit
work speed for good performance. In MapReduce with for the distributed parts of a file across disks. If disk
HDFS duplicates data so that we can do our compute in in any node fails the same file block can be stored on
local storage, thus MapReduce takes advantage of local multiple nodes across the cluster. The number of copies
storage to avoid the network bottleneck when working is by default set to three. Since HDFS is a distributed
with large data. MPI is good for communication inten- file system it is capable of managing the storage across
sive tasks but MapReduce is best for distributed batch a network. The files can be distributed and managed in
processing. MPI uses cluster, which consists of dedi- the same or different racks of a cluster. It splits , scatters
cated high performance servers but Hadoop can run on replicate and manage data across the nodes in a cluster.
commodity hardware. Hadoop is best framework for
dealing with unstructured large data. MPI is best for 4.2. Data nodes
problems having large computation and Hadoop is best
when problem is having large data [14]. So we have All nodes that can store data are called data node.
adopted the mapreduce model that is suitable for cancer Data Node connects to Name node at startup and does
data analysis. filesystem operations. Applications can directly talk
with Data node one Name node has provided its loca-
4. Hadoop implementation on testbed ar- tion. TaskTracker instances also communicates to the
chitecture Data node.The DataNodes are responsible for serving
read and write requests from the file systems clients.
Hadoop is a framework of open source tools main- The DataNodes also perform block creation, deletion,
tained by Apache. The name Hadoop was not designed and replication upon instruction from the NameNode
keeping traditional architecture in mind. It was rather [15] .
4.3. Name node 4.7. Mapreduce

Namenode is responsible to keep an index of data MapReduce is a computational mechanism to ex-


residing on different nodes [16]. Hadoop treats all ecute an application in parallel on computing cores of
nodes as data nodes but designated one node as Name nodes by dividing it into task. These tasks are allo-
node. The name node decides for each hadoop file the cated to the data chunks. The intermediate results are
location of the disks on which the fileblocks are stored. collected and redistributed by mapreduce. The failures
Name nodes keeps track of all these information in the such as node down are also managed by mapreduce
form of a table and stores it locally. When a node fails mechanism. Mapreduce consists of two phases mapper
then the namenode identifies all the file blocks stored and reducer which are executed sequentially one after
on the disk of that node, retrieves all the fileblocks from another. In the mapper set all the nodes does the same
other healthy nodes, finds new node to store another computation but against the part of the data allocated to
copy of them and store these copies to new node.It also that node. The map job takes input data and produces
updates the infromation of new fileblocks in its table. key value pairs. The reduce step takes those key value
Applicaton gets the data directly once it knows the lo- pairs and then combines or aggragate them to give the
cation and is not depenent on name node. Hadoop main- result. Mapreduce is very good in moving compute to
tains three copy of each file and these files are scattered the data. processing takes place where date is placed.
across the network.
4.8. Bioinformatics Tools and Softwares on
4.4. Secondary Name node Hadoop

There is a risk associated with the namenode to be Hadoop began to be used in Bioinformatics in May
a single point of failure. If name node fails all the in- 2009. CloudBurst [4] is the first bioinformatics soft-
formation about files and their respective fileblock will ware developed that runs on hadoop, after that vari-
be lost. Hadoop solves these problem by maintaining a ous softwares are developed .The advantage of running
backup of the name node called secondary name node tools on hadoop is parallelization, scalability, redan-
which can be used in the case when namenode fails. dancy, automatic monitoring etc.
There are various Bioinformatics software that runs
on hadoop:
4.5. Job tracker
CloudBurst: It is a new parallel read mapping al-
gorithm optimized for mapping NGS data to human
The role of job tracker component is to break the
genome and other references genome, for use in a va-
bigger job into smaller components called tasks [16]. It
riety of biological analysis including SNP discovery,
sends each set of components to task trackers.The job-
genotyping and personal genomics.This is the first at-
tracker combines all the result collected back from task
tempt to parallelize a Bioinformatics algorithm with
tracker and sends the final result to the application. If
Mapreduce/Hadoop. This is developed by professor
any of the task tracker fails job tracker assign the task
Michael Schatz at the university of Maryland. It is mod-
to another task tracker. Job tracker does resource man-
eled after the short read-mapping program RMAP, and
agement as well as job life cycle management.
reports either all alignments or the unambiguous best
alignment for each read with any number of mismatches
4.6. Task tracker or differences.
Crossbow: Crossbow [5] is a scalable software
The task tracker performs the tasks allocated to pipeline for whole genome resequencing analysis. It is
them by the jobtracker and sends the completed results a cloud version of Bowtie and it align reads to a ref-
back to the job tracker.task tracker is a slave. It gets di- erence genome with Bowtie and it uses the huge com-
rection from job tracker to run the task to completion , pute node of hadoop and mapreduce. After that it uses
monitor and report failures. SOAPsnp for genotyping the sample.The main aim of
The Hadoop provides the programmer a mecha- making this tool is to analyse sequence data on hadoop
nism that takes care of location of files, fault tolerance, cluster with existing tool for making this computation
dividing a program into task and scaling the program. fast and all the features of hadoop like scalability,fault
The programmers can concenterate on writing scale free tolerance, reliabilty, parallization, and distributed com-
program and the scaling is managed by Hadoop itself. puting is available with the accuracy of existing tools.
The scalabality cost is also linear Myrna: Myrna [17] is a cloud computing tool for
calculating differential gene expression in large RNA-
seq dataset. It uses Bowtie for short read alignment and
R/Bioconductor for internal calculation, normalization
and stastical testing.
Contrail: Contrail [18] is a de Bruijn graph based
assembler tool uses hadoop for de novo assembly for
large genomics from short sequencing genomics.
Jnomics: It [19] provides many comman tools
that provide many common genomics task like sorting,
merging, filtering, and selection.All these task can be
performed over distributed data across various machine
on hadoop cluster.
Hadoop-BAM: Hadoop-BAM [20] is a library
for scalable manipulation of aligned NGS data in the Figure 1. 4-node hadoop cluster
hadoop framework and specialized data plateform that
works with BAM format and uses Mapreduce to per-
form functions like genotyping and peak calling. proximately one alignment takes more than 10K CPU
CloudBLAST: CloudBLAST [21] is a Map- hours. Thus there is a requirenment of solutions that can
Reduce version of the commonly used bioinformatics scale up and deal with large data set very easily. There
application NCBI BLAST. are traditoinal tools which works on High Performance
SEAL: SEAL [22] is a suite of distributed applica- Computing principles. These method works on mes-
tions for aligning short DNA reads, and manipulating sage passing interface (mpi) [27] which is very suitable
and analyzing short read alignments. Seal applications mechanism for compute intensive problems. The NGS
generally run on the Hadoop framework and are made applications have added another dimension to the prob-
to scale well in the amount of computing nodes avail- lem to deal with huge data. There are newer methods
able and the amount of the data to process. like mapreduce which are more suitable for data inten-
PeakRanger: Peakranger [23] is a Cloud enabled sive problems [28]. Based on the experience on the past
peak caller for Chip-IP-seq data. technologies we have established a Hadoop implemen-
Quake: Quake [24] Quality aware detection and se- tation infrastructure which is used to analyse the open
quencing error correction tool. access cancer data using open source tools.
BlastReduce: BlastReduce [25] is a High perfor-
mance short read mapping tool. 5.1. Running CloudBurst on Hadoop Cluster

5. Case Study Problem definition We used CloudBurst with sample data package on
the 4-node hadoop cluster (Figure 1) and here we will
In genomics world there are lots of data sets gen- discuss the process of using the package to get the de-
erated in NGS labs. Illumina sequencer machine at sired results.
Sanger institute generates about 2PB of data in a year CloudBurst has several parameters to control the
and they are facing problem like running out of disk, sensitivity of the alignment algorithm. Here it finds
running out of memory and power for these continu- the unambiguous best alignment for 100,000 reads al-
ously increasing data [26]. There may be a solution lowing up to 3 mismatches when mapping to the cor-
to put more memory or hard disk, but this is not the responding S. suis genome. For running it , hadoop
solution because the existing algorithms are not ready cluster must be running and read file & reference file,
to scale for large amount of data. So there is need of both should be converted into .br file and uploaded into
HPC distributed solution to handle these huge amount HDFS. Cloudburst comes as java archive file then can
of data. These three main problem in sequencing are: be directly used with hadoop environment. These steps
are followed:
• Read Mapping.
• Mapping and SNP discovery. 1. Start Hadoop Cluster:
hduser @ master > hadoop/bin ./start-all.sh (it will
• De novo Genome Assembly.
start both HDFS & Mapreduce)
There are various classic methods are available for solv-
ing all this problem, but it takes very long time Ap- 2. Upload file into HDFS:
7. Discussion

Cancer is the prominent cause of deaths in devel-


oped as well as developing countries. Timely predic-
tion of cancer is the only way for the survival of the pa-
tient. Genomic surveillance is an essential component
of new cancer diagnostic methods. The NGS methods
can be very useful in mapping the variation of genes in
various stages of cancer. It can be used to detect the
prevalence of strongly predictive cancer genes. Today
most of the cancer research laboratories are working to
Figure 2. Input and output parameters
identify the genes that seems to be responsible for the
different kinds of cancer. Most of the advencements are
bin/hadoop dfs -put /path/to/read/&/reference/file made for the detection , prevention and the treatment of
/path/into/hdfs cancer but not to understand the cause of cancer.
3. Run CloudBurst:
8. CONCLUSIONS
bin/hadoop jar CloudBurst.jar with all input pa-
rameter
Biological complexities are posing grand chal-
lenges which cannot be solved by available technolo-
5.2. Input sets and output parameters gies and mechanisms. The problems are so complex
that a simgle domain knowledge is not sufficient to
Input of Cloudburst program: solve these grand challenges. There is an immense
s suis.fa: Streptococcus suis reference genome se- requirement of collaborative efforts among the expert
quence(convert it into .br file) researchers of various fields like Bioinformatics, Ar-
100k.fa: 100,000 36bp Illumina reads available tificial Intelligence, Big Data Analytics, Data Scien-
from(convert it into .br file) tists, Statiscians to come together and solve the prob-
The output of Cloudburst (Figure 2) comes in the lems of today. The development of High performance
following format. expert systems with Big Data machine learning algo-
1. chrom: name of the reference sequencer rithms applied on Bioinformatics data generated by the
Next Generation sequencers can solve today’s biologi-
2. start: start coordinate on references cal problems. We have just taken a small step to demon-
strate the capability of the collaborative efforts in solv-
3. end: end coordinate on references
ing a proof of concept problem. We hope that the pro-
4. name: name of reads posed solution can be scaled in future to provide an in-
tensive production environment to solve Health analyt-
5. score: number of mismatches/differences ics problems.
6. strand: +/- for forward/reverse
ACKNOWLEDGEMENT
6. Results
The work was carried out as a part of the Special
CloudBurst’s running time scales linearly with the Interest Groups (SIG) activity of C-DAC Pune. The au-
number of reads mapped, and with near linear speedup thors would like to thank Dr. Hemant Darbari, Execu-
as the number of nodes increases in Hadoop Cluster. tive Director, C-DAC, Pune. for providing a platform
CloudBurst reduces the running time from hours to to support innovative ideas as SIG activity. The au-
mere minutes for typical jobs involving mapping of mil- thors are also immensely grateful to Dr. Rajendra Joshi,
lions of short reads to the human genome. CloudBurst is Associate Director & HOD, Bioinformatics Group, C-
available open-source as a model for parallelizing other DAC Pune for his guidance at every step of the work.
bioinformatics algorithms with MapReduce. Thus the
hadoop based tool are helpful in analysing the NGS data References
related to cancer due to its linear scability feature. We
can keep on adding nodes in Hadoop Cluster to get ex- [1] J. S. Reis-Filho et al., “Next-generation sequencing,”
pected scalability based on the data size. Breast Cancer Res, vol. 11, no. Suppl 3, p. S12, 2009.
[2] M. Y. Galperin and E. V. Koonin, “Who’s your neigh- http://sourceforge.net/projects/jnomics/
bor? new computational approaches for functional ge- [20] M. Niemenmaa, A. Kallio, A. Schumacher, P. Klemelä,
nomics,” Nature biotechnology, vol. 18, no. 6, pp. 609– E. Korpelainen, and K. Heljanko, “Hadoop-bam: di-
613, 2000. rectly manipulating next generation sequencing data in
[3] A. Hadoop, “Hadoop,” 2009. the cloud,” Bioinformatics, vol. 28, no. 6, pp. 876–877,
[4] M. C. Schatz, “Cloudburst: highly sensitive read map- 2012.
ping with mapreduce,” Bioinformatics, vol. 25, no. 11, [21] A. Matsunaga, M. Tsugawa, and J. Fortes, “Cloudblast:
pp. 1363–1369, 2009. Combining mapreduce and virtualization on distributed
[5] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. resources for bioinformatics applications,” in eScience,
Salzberg, “Searching for snps with cloud computing,” 2008. eScience’08. IEEE Fourth International Confer-
Genome Biol, vol. 10, no. 11, p. R134, 2009. ence on. IEEE, 2008, pp. 222–229.
[6] T. Nguyen, W. Shi, and D. Ruden, “Cloudaligner: A [22] L. Pireddu, S. Leo, and G. Zanetti, “Seal: a distributed
fast and full-featured mapreduce based tool for sequence short read mapping and duplicate removal tool,” Bioin-
mapping,” BMC research notes, vol. 4, no. 1, p. 171, formatics, vol. 27, no. 15, pp. 2159–2160, 2011.
2011. [23] X. Feng, R. Grossman, and L. Stein, “Peakranger:
[7] J. Dean and S. Ghemawat, “Mapreduce: a flexible data a cloud-enabled peak caller for chip-seq data,” BMC
processing tool,” Communications of the ACM, vol. 53, bioinformatics, vol. 12, no. 1, p. 139, 2011.
no. 1, pp. 72–77, 2010. [24] D. R. Kelley, M. C. Schatz, S. L. Salzberg et al., “Quake:
[8] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google quality-aware detection and correction of sequencing er-
file system,” in ACM SIGOPS Operating Systems Re- rors,” Genome Biol, vol. 11, no. 11, p. R116, 2010.
view, vol. 37, no. 5. ACM, 2003, pp. 29–43. [25] M. C. Schatz, “Blastreduce: high performance short
[9] B. Nutch, “Open source search,” Queue. v2 i2, pp. 54– read mapping with mapreduce,” University of Maryland,
61, 2004. http://cgis.cs.umd.edu/Grad/scholarlypapers/papers/
[10] R. P. Padhy, “Big data processing with hadoop- MichaelSchatz.pdf, 2008.
mapreduce in cloud systems,” International Journal of [26] E. Dart, “Ber science network requirements,” Lawrence
Cloud Computing and Services Science (IJ-CLOSER), Berkeley National Laboratory, 2011.
vol. 2, no. 1, pp. 16–27, 2012. [27] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-
[11] R. C. Taylor, “An overview of the hadoop/mapreduce/h- performance, portable implementation of the mpi mes-
base framework and its current applications in bioin- sage passing interface standard,” Parallel computing,
formatics,” BMC bioinformatics, vol. 11, no. Suppl 12, vol. 22, no. 6, pp. 789–828, 1996.
p. S1, 2010. [28] S. N. Srirama, P. Jakovits, and E. Vainikko, “Adapting
[12] K. Arumugam, Y. S. Tan, B. S. Lee, and R. Kana- scientific computing problems to clouds using mapre-
gasabai, “Cloud-enabling sequence alignment with duce,” Future Generation Computer Systems, vol. 28,
hadoop mapreduce: A performance analysis,” in 4th In- no. 1, pp. 184–192, 2012.
ternational Conference on Bioinformatics and Biomedi-
cal Technology, 2012.
[13] D. W. Walker and J. J. Dongarra, “Mpi: a standard mes-
sage passing interface,” Supercomputer, vol. 12, pp. 56–
68, 1996.
[14] A. Katal, M. Wazid, and R. Goudar, “Big data: Issues,
challenges, tools and good practices,” in Contemporary
Computing (IC3), 2013 Sixth International Conference
on. IEEE, 2013, pp. 404–409.
[15] D. Borthakur, “The hadoop distributed file system: Ar-
chitecture and design,” Hadoop Project Website, vol. 11,
p. 21, 2007.
[16] S. Humbetov, “Data-intensive computing with map-
reduce and hadoop,” in Application of Information and
Communication Technologies (AICT), 2012 6th Interna-
tional Conference on. IEEE, 2012, pp. 1–5.
[17] B. Langmead, K. D. Hansen, J. T. Leek et al., “Cloud-
scale rna-sequencing differential expression analysis
with myrna,” Genome Biol, vol. 11, no. 8, p. R83, 2010.
[18] Contrail a hadoop based genome assembler for
assembling large genomes in the clouds. [Online].
Available: http://sourceforge.net/projects/contrail-bio/
[19] Jnomics jnomics is a collection of cloud-scale
dna sequence analysis tools. [Online]. Available:

You might also like