Big Data Challenges
in Bioinformatics
BARCELONA
SUPERCOMPUTING
CENTER
COMPUTER
SCIENCE
DEPARTMENT
Autonomic
Systems
and
eBusiness
Pla?orms
Jordi
Torres
[email protected]
Talk outline
! We talk about Petabyte?
Deluge of Data
Data is now considered the
Fourth Paradigm in Science
the first three paradigms were experimental, theoretical and
computational science.
This shift is being driven by the rapid growth in data
from improvements in
scientific instruments
Scientific instruments
Physics
Large Hadron Collider
produced around 15 petabytes
of data in 2012.
Astronomy
Large Synoptic Survey
Telescope its anticipated to
produce around 10 petabytes
per year.
4
Example: In Genome Research?
Cost of sequencing a human-sized genome
$100000000.0
$10000000.0
$1000000.0
$100000.0
Source: National Human Genome Research Institute (NHGRI) http://www.genome.gov/sequencingcosts/
Jan-13
Sep-12
Jan-12
May-12
Sep-11
May-11
Jan-11
Sep-10
May-10
Jan-10
Sep-09
May-09
Jan-09
Sep-08
May-08
Jan-08
Sep-07
May-07
Jan-07
Sep-06
May-06
Jan-06
Sep-05
May-05
Jan-05
Sep-04
Jan-04
May-04
Sep-03
May-03
Jan-03
Sep-02
May-02
Jan-02
$1000.0
Sep-01
$10000.0
Data Deluge: Due to the changes in big data generation
Example: Biomedicine
Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)
Important open issues
Transfer of data from one location to another (*)
shipping external hard disks
processing the data while it is being transferred
Future? Data
wont be moved!
(*) Out of scope of this presentation
Source: http://footage.shutterstock.com/clip-4721783-stock-footage-animation-presents-datatransfer-between-a-computer-and-a-cloud-a-concept-of-cloud-computing.html
Important open issues
Security and privacy of the data from individuals (*)
The same problems that appear in other areas
Use advanced encryption algorithms
http://www.google.es/url?
sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=FsNQD0JQszy2oM&tbnid=gsXeMMans2soOM:&ved=0CAUQjRw&
url=https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttp%2Fwww.tbase.com%2Fcorporate%2Fprivacy-andsecurity&ei=M__3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust=1392070808227629
(Out of scope of this presentation)
Source: http://www.tbase.com/corporate/privacy-and-security
Important open issues
Increased need to store data (*)
Cloudbased computing solutions have emerged
Source: http://www.custodia-documental.com/wp-content/uploads/Cloud-Big-Data.jpg
Important open issues
Increased need to store data (*)
Cloudbased computing solutions have emerged
The most common Cloud Computing inhibitors should be
tackled
Security
Privacy
Lack
of
Standards
Data
Integrity
Regulatory
Data
Recovery
Control
Vendor
Maturity
...
The most critical open issue
DERIVING
VALUE
VIA
HARNESSING
VOLUME,
VARIETY
AND
VELOCITY
(*)
Source: http://www.theatlantic.com/health/archive/2012/05/big-data-can
-save-health-care-0151-but-at-what-cost-to-privacy/257621/
(*) Big Data definition?
The most critical open issue
DERIVING
VALUE
VIA
HARNESSING
VOLUME,
VARIETY
AND
VELOCITY
(*)
(*) Big Data definition?
The most critical open issue
Source: cetemma - matar
DERIVING
VALUE
VIA
HARNESSING
VOLUME,
VARIETY
AND
VELOCITY
(*)
(*) Big Data definition?
The most critical open issue
DERIVING
VALUE
VIA
HARNESSING
VOLUME,
VARIETY
AND
VELOCITY
(*)
The
informa=on
is
non
ac=onable
knowledge
(*) Big Data definition?
What is the usefulness of Big Data?
Performs predictions of outcomes and behaviors
Data
+
Volume
Value
InformaMon
Knowledge
Approach: Machine Learning works in the sense that
these methods detect subtle structure in data relatively
easily without having to make strong assumptions about
parameters of distributions
15
Data Analytics: To extract knowledge
Big data uses inductive statistics and concepts
from nonlinear system identification to infer laws
(regressions, nonlinear relationships, and causal
effects) from large data sets to reveal
relationships, dependencies, and to perform
predictions of outcomes and behaviors.
(*) Wikipedia
(Out of scope of this presentation) :- (
16
Talk outline
! We talk about Petabyte?
17
Challenges related to us?
The big data
problem:
In the end it is
a Computing
Challenge
18
Computing Challenge
Researchers need to crunch a large amount of data very
quickly (and easily) using high-performance computers
Example: A de novo assembly algorithm for DNA data
finds reads whose sequences overlap and records
those overlaps in a huge diagram called an assembly
graph. For a large genome, this graph can occupy many
terabytes of RAM, and completing the genome
sequence can require weeks or months of
computation on a world-class supercomputer.
19
What does life science research do at BSC?
Pipeline schema:
MareNostrum
25-40
h
/
50-100
cpus
AlMx
/
CLL
cluster
0.
Receive the data:
Raw Genome Data:
120Gb
Common
Storage
50-60h
/
1-10
cpus
Also data deluge appears in genomics
The DNA data deluge
comes from thousands of
sources
More than 2000
sequencing instruments
around the world
more than 15 petabytes x
year of genetic data.
And soon, tens of
thousands of sources!!!!
Image source: https://share.sandia.gov/news/resources/
news_releases/images/2009/biofuel_genes.jpg
The total computing burden is growing
DNA sequencing
is on the path to
becoming an
everyday tool in
medicine.
Computing, not sequencing, is now the slower
and more costly aspect of genomics research.
How can we help at BSC?
Something must be done now, or else well need to put
vital research on hold while the necessary
computational techniques catch upor are invented.
What is clear is that it will involve both better algorithms
and a renewed focus on such big data approaches in
managing and processing data.
How?
Doing outstanding research to speedup this process
23
What is the time required to retrieve information?
1 Petabyte = 1000 x
(1 Terabyte )
What is the time required to retrieve information?
assume
100MB/sec
What is the time required to retrieve information?
assume
100MB/sec
scanning 1 Terabyte:
more than 5 hours
What is the time required to retrieve information?
scanning 1 Petabyte:
more than 5.000 hours
Solution?
massive parallelism
not only in computation but also in storage
assume 10.000 disks:
scanning 1 TB takes 1 second
Source:
hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg
Talk outline
! We talk about Petabyte?
29
Research in Big Data at BSC
To support this massive data parallelism & distribution it is
necessary to redefine and improve:
Data Processing across hundreds of
thousands of servers
Data Management across hundreds
of thousands of data devices
Dealing with new System Data
Infrastructure
30
How?
How do companies like
google read and process
data from 10.000 disks in
parallel?
Source:
hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg
GOOGLE: New programming model
To meet the challenges: MapReduce
Programming Model introduced by Google in
early 2000s to support distributed computing
(special emphasis in fault-tolerance)
Ecosystem of big data processing tools
open source, distributed, and run on commodity
hardware.
MapReduce: some details
The key innovation of
MapReduce is
the ability to take a query over a
data set, divide it, and run it in
parallel over many nodes.
Input Data
Mappers
Two phases
Map phase
Reduce phase
Reducers
Output Data
Limitations of MapReduce as a Programming model?
MapReduce is great but not every one
is a MapReduce expert
I am a python expert but .
There is a class of algorithms that
cannot be efficiently implemented with
the MapReduce programming model
Different programming models deal
with different challenges
Example: pyCOMPSs from BSC
Input Data
Output Data
OmpSs/COMPSs
34
Big Data resource management
Big Data characteristics
Requirements from data store
Volume
Scalability
Variety
Scheme-less
Velocity
Relaxed consistency & capacity to digest
Relational databases are not suitable for Big Data problems
Lack of horizontal scalability
Complex structures to express complex relationships
Hard consistency
Non-relational databases (NoSQL) are the alternative data store
Relaxing consistencyEventual consistency
35
General view of NoSQL storage (and replication)
1
2
1
2 3
1
2
1
2 5
4 5
3 4
3 4
3 5
Big Data resource management: open issues in NoSQL
Query performance depends heavily on data model
Designed to support many concurrent short queries
Solutions:
automatic configuration, query plan and data organization
BSC- Aeneas: https://github.com/cugni/aeneas
Model_A
Model_B
Model_C
Client
Query
Driven
Query
Plan
New System Data Infrastructure required
Example: Current computer systems available at genomics
research institutions are commonly designed to run general
computational jobs where,
Traditionally the limiting resource is the use of CPU.
Also, we find a large common storage space shared for all nodes.
Example: Computers in use for bioinformatic jobs
Typical mix of such computer systems and common
bioinformatics applications:
Bottlenecks & underutilization
Figure 16. EDW Applianceproblems
is a loosely coupled, shared nothing, MPP architecture
First approach:
Big Data Rack Architecture:
Shared Nothing
Storage Technology: Non Volatile Memory evolution
Evolution of Flash Adoption
FLASH + DISK
FLASH AS DISK
FLASH AS
MEMORY
SNIA NVM Summit
April 28, 2013
(*) HHD 100 cheaper than RAM . But 1000 times slower
40
Example: Computers in use for bioinformatic jobs
Jobs are responsible of managing the input data: partitions,
organisation, merge of intermediate results
Large parts of code are not functional, but housekeeping tasks
Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate
data intensive tasks
Compute Dense
Compute Fabric
Active Storage Fabric
Archival Storage
Disk/Tape
Important: Remote Nodes Have Gotten Closer
Interconnects have
become much faster
IB latency 2000 ns is
only 20x slower that
RAM and is 100x
faster that SSD
Source: http://www.slideshare.net/blopeur/hecatonchire-kvm-forum2012benoithudzia
42
Conclusion: Paradigm shift
Old
New
Compute-centric Model
Data-centric Model
Manycore
FPGA
Massive Parallelism
Persistent Memory
Flash
Phase Change
Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics
Conclusions: How can we help?
How can IT researchers help scientists like you
cope with the onslaught of data?
This is a crucial question and there is no definitive answer yet.
What is clear is that it will involve both better algorithms and a
renewed focus on big data approaches such as: data
infrastructure, data managing and data processing.
Questions & Answers
Over to you,
what do you think?
Thank you for your attention! - Jordi
45
Thank you to
46
More information
Updated information will be posted at
www.JordiTorres.eu
47