0% found this document useful (0 votes)

70 views47 pages

Big Data Challenges in Bioinformatics

The document discusses big data challenges in bioinformatics. Rapid growth in data from scientific instruments like the Large Hadron Collider and Large Synoptic Survey Telescope is driving a data deluge. Genome sequencing costs have plummeted, but computing challenges remain as analyzing large datasets can take weeks or months. Harnessing the volume, variety and velocity of big data requires deriving value through data analytics and machine learning. Solving these computational challenges will involve distributed processing across many servers and devices as well as improved data management and infrastructure.

Uploaded by

Mirella Flores

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views47 pages

Big Data Challenges in Bioinformatics

Uploaded by

Mirella Flores

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Big Data Challenges

in Bioinformatics

BARCELONA SUPERCOMPUTING CENTER

COMPUTER SCIENCE DEPARTMENT
Autonomic Systems and eBusiness Pla?orms

Jordi Torres
[email protected]

Talk outline
! We talk about Petabyte?

Deluge of Data

Data is now considered the

Fourth Paradigm in Science
the first three paradigms were experimental, theoretical and
computational science.

This shift is being driven by the rapid growth in data

from improvements in

scientific instruments

Scientific instruments
Physics

Large Hadron Collider

produced around 15 petabytes
of data in 2012.

Astronomy

Large Synoptic Survey

Telescope its anticipated to
produce around 10 petabytes
per year.
4

Example: In Genome Research?

Cost of sequencing a human-sized genome
$100000000.0

$10000000.0

$1000000.0

$100000.0

Source: National Human Genome Research Institute (NHGRI) http://www.genome.gov/sequencingcosts/

Jan-13

Sep-12

Jan-12

May-12

Sep-11

May-11

Jan-11

Sep-10

May-10

Jan-10

Sep-09

May-09

Jan-09

Sep-08

May-08

Jan-08

Sep-07

May-07

Jan-07

Sep-06

May-06

Jan-06

Sep-05

May-05

Jan-05

Sep-04

Jan-04

May-04

Sep-03

May-03

Jan-03

Sep-02

May-02

Jan-02

$1000.0

Sep-01

$10000.0

Data Deluge: Due to the changes in big data generation

Example: Biomedicine

Image source: Big Data in biomedicine, Drug Discovery Today. Fabricio F. Costa (2013)

Important open issues

Transfer of data from one location to another (*)
shipping external hard disks
processing the data while it is being transferred
Future? Data

wont be moved!

(*) Out of scope of this presentation

Source: http://footage.shutterstock.com/clip-4721783-stock-footage-animation-presents-datatransfer-between-a-computer-and-a-cloud-a-concept-of-cloud-computing.html

Important open issues

Security and privacy of the data from individuals (*)
The same problems that appear in other areas
Use advanced encryption algorithms

http://www.google.es/url?
sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&docid=FsNQD0JQszy2oM&tbnid=gsXeMMans2soOM:&ved=0CAUQjRw&
url=https%3A%2F%2F2.zoppoz.workers.dev%3A443%2Fhttp%2Fwww.tbase.com%2Fcorporate%2Fprivacy-andsecurity&ei=M__3Ur2EK4bQtAbSx4CoDA&psig=AFQjCNGD5uUaqRoyOqw687J-ATfdtBHTrA&ust=1392070808227629

(Out of scope of this presentation)

Source: http://www.tbase.com/corporate/privacy-and-security

Important open issues

Increased need to store data (*)

Cloudbased computing solutions have emerged

Source: http://www.custodia-documental.com/wp-content/uploads/Cloud-Big-Data.jpg

Important open issues

Increased need to store data (*)

Cloudbased computing solutions have emerged
The most common Cloud Computing inhibitors should be
tackled

Security

Privacy

Lack of
Standards

Data
Integrity

Regulatory

Data
Recovery

Control

Vendor
Maturity

...

The most critical open issue

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

Source: http://www.theatlantic.com/health/archive/2012/05/big-data-can
-save-health-care-0151-but-at-what-cost-to-privacy/257621/

(*) Big Data definition?

The most critical open issue

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

(*) Big Data definition?

The most critical open issue

Source: cetemma - matar

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

(*) Big Data definition?

The most critical open issue

DERIVING VALUE VIA HARNESSING

VOLUME,
VARIETY AND
VELOCITY (*)

The informa=on is
non ac=onable
knowledge
(*) Big Data definition?

What is the usefulness of Big Data?

Performs predictions of outcomes and behaviors

Data

+
Volume

Value

InformaMon

Knowledge

Approach: Machine Learning works in the sense that

these methods detect subtle structure in data relatively
easily without having to make strong assumptions about
parameters of distributions

Data Analytics: To extract knowledge

Big data uses inductive statistics and concepts

from nonlinear system identification to infer laws
(regressions, nonlinear relationships, and causal
effects) from large data sets to reveal
relationships, dependencies, and to perform
predictions of outcomes and behaviors.
(*) Wikipedia

(Out of scope of this presentation) :- (

Talk outline
! We talk about Petabyte?

Challenges related to us?

The big data

problem:
In the end it is
a Computing
Challenge
18

Computing Challenge
Researchers need to crunch a large amount of data very
quickly (and easily) using high-performance computers

Example: A de novo assembly algorithm for DNA data

finds reads whose sequences overlap and records
those overlaps in a huge diagram called an assembly
graph. For a large genome, this graph can occupy many
terabytes of RAM, and completing the genome
sequence can require weeks or months of
computation on a world-class supercomputer.

What does life science research do at BSC?

Pipeline schema:

MareNostrum
25-40 h / 50-100 cpus

AlMx / CLL cluster

0. Receive the data:

Raw Genome Data: 120Gb

Common
Storage
50-60h / 1-10 cpus

Also data deluge appears in genomics

The DNA data deluge

comes from thousands of
sources
More than 2000
sequencing instruments
around the world
more than 15 petabytes x
year of genetic data.

And soon, tens of

thousands of sources!!!!
Image source: https://share.sandia.gov/news/resources/
news_releases/images/2009/biofuel_genes.jpg

The total computing burden is growing

DNA sequencing
is on the path to
becoming an
everyday tool in
medicine.

Computing, not sequencing, is now the slower

and more costly aspect of genomics research.

How can we help at BSC?

Something must be done now, or else well need to put
vital research on hold while the necessary
computational techniques catch upor are invented.
What is clear is that it will involve both better algorithms
and a renewed focus on such big data approaches in
managing and processing data.
How?
Doing outstanding research to speedup this process

What is the time required to retrieve information?

1 Petabyte = 1000 x

(1 Terabyte )

What is the time required to retrieve information?

assume
100MB/sec

What is the time required to retrieve information?

assume
100MB/sec
scanning 1 Terabyte:

more than 5 hours

What is the time required to retrieve information?

scanning 1 Petabyte:

more than 5.000 hours

Solution?

massive parallelism
not only in computation but also in storage

assume 10.000 disks:

scanning 1 TB takes 1 second
Source: hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg

Talk outline
! We talk about Petabyte?

Research in Big Data at BSC

To support this massive data parallelism & distribution it is
necessary to redefine and improve:

Data Processing across hundreds of

thousands of servers
Data Management across hundreds
of thousands of data devices
Dealing with new System Data
Infrastructure
30

How?

How do companies like

google read and process
data from 10.000 disks in
parallel?
Source: hVp://www.google.com/about/datacenters/gallery/images/_2000/IDI_018.jpg

GOOGLE: New programming model

To meet the challenges: MapReduce

Programming Model introduced by Google in
early 2000s to support distributed computing
(special emphasis in fault-tolerance)

Ecosystem of big data processing tools

open source, distributed, and run on commodity
hardware.

MapReduce: some details

The key innovation of
MapReduce is
the ability to take a query over a
data set, divide it, and run it in
parallel over many nodes.

Input Data

Mappers

Two phases
Map phase
Reduce phase

Reducers

Output Data

Limitations of MapReduce as a Programming model?

MapReduce is great but not every one
is a MapReduce expert

I am a python expert but .

There is a class of algorithms that
cannot be efficiently implemented with
the MapReduce programming model

Different programming models deal

with different challenges
Example: pyCOMPSs from BSC

Input Data

Output Data

OmpSs/COMPSs
34

Big Data resource management

Big Data characteristics

Requirements from data store

Volume

Scalability

Variety

Scheme-less

Velocity

Relaxed consistency & capacity to digest

Relational databases are not suitable for Big Data problems

Lack of horizontal scalability

Complex structures to express complex relationships
Hard consistency

Non-relational databases (NoSQL) are the alternative data store

Relaxing consistencyEventual consistency
35

General view of NoSQL storage (and replication)

1
2
1

2 3

1
2
1

2 5

4 5

3 4

3 5

Big Data resource management: open issues in NoSQL

Query performance depends heavily on data model
Designed to support many concurrent short queries
Solutions:
automatic configuration, query plan and data organization
BSC- Aeneas: https://github.com/cugni/aeneas

Model_A
Model_B
Model_C

Client

Query
Driven

Query
Plan

New System Data Infrastructure required

Example: Current computer systems available at genomics
research institutions are commonly designed to run general
computational jobs where,
Traditionally the limiting resource is the use of CPU.
Also, we find a large common storage space shared for all nodes.

Example: Computers in use for bioinformatic jobs

Typical mix of such computer systems and common
bioinformatics applications:
Bottlenecks & underutilization
Figure 16. EDW Applianceproblems
is a loosely coupled, shared nothing, MPP architecture

First approach:
Big Data Rack Architecture:
Shared Nothing

Storage Technology: Non Volatile Memory evolution

Evolution of Flash Adoption

FLASH + DISK

FLASH AS DISK

FLASH AS
MEMORY

SNIA NVM Summit

April 28, 2013

(*) HHD 100 cheaper than RAM . But 1000 times slower

Example: Computers in use for bioinformatic jobs

Jobs are responsible of managing the input data: partitions,
organisation, merge of intermediate results
Large parts of code are not functional, but housekeeping tasks

Solutions: Active storage strategies for leveraging highperformance in-memory key/value databases to accelerate
data intensive tasks
Compute Dense
Compute Fabric

Active Storage Fabric

Archival Storage
Disk/Tape

Important: Remote Nodes Have Gotten Closer

Interconnects have
become much faster
IB latency 2000 ns is
only 20x slower that
RAM and is 100x
faster that SSD

Source: http://www.slideshare.net/blopeur/hecatonchire-kvm-forum2012benoithudzia
42

Conclusion: Paradigm shift

Old

New

Compute-centric Model

Data-centric Model

Manycore

FPGA

Massive Parallelism
Persistent Memory

Flash

Phase Change

Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics

Conclusions: How can we help?

How can IT researchers help scientists like you

cope with the onslaught of data?
This is a crucial question and there is no definitive answer yet.
What is clear is that it will involve both better algorithms and a
renewed focus on big data approaches such as: data
infrastructure, data managing and data processing.

Questions & Answers

Over to you,
what do you think?
Thank you for your attention! - Jordi

Thank you to

More information

Updated information will be posted at

www.JordiTorres.eu

0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
221 pages
Data Science and Big Data Analytics - Unit - 1
No ratings yet
Data Science and Big Data Analytics - Unit - 1
47 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
30 pages
Unit I
No ratings yet
Unit I
66 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
BDA2023 Outline
No ratings yet
BDA2023 Outline
7 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data
No ratings yet
Big Data
41 pages
BigData Nptel
100% (1)
BigData Nptel
813 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Week6 Iot Big Data
No ratings yet
Week6 Iot Big Data
21 pages
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
No ratings yet
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
12 pages
5introduction Data Science
No ratings yet
5introduction Data Science
46 pages
Introduction To Big Data Management
No ratings yet
Introduction To Big Data Management
53 pages
Big Data and Genomics
No ratings yet
Big Data and Genomics
17 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
Big Data
No ratings yet
Big Data
25 pages
BDA Module1
No ratings yet
BDA Module1
141 pages
Unit Iv PDF
No ratings yet
Unit Iv PDF
26 pages
290 Mis L5
No ratings yet
290 Mis L5
22 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Lecture 1 - What Is Big Data 1-29
No ratings yet
Lecture 1 - What Is Big Data 1-29
88 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Data Science
No ratings yet
Data Science
31 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Intro Biol Notes
No ratings yet
Intro Biol Notes
49 pages
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
No ratings yet
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
19 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Big Data
No ratings yet
Big Data
76 pages
Big Data Intro
No ratings yet
Big Data Intro
32 pages
Unit - 1
No ratings yet
Unit - 1
104 pages
UNIT 1big Data Introduction
No ratings yet
UNIT 1big Data Introduction
56 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Basics of Big Data
No ratings yet
Basics of Big Data
14 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
An Introduction To Big Data
No ratings yet
An Introduction To Big Data
31 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Bsr551-Assignment 1-Nureen Yasmine Binti Suhaizi-Cfap2294g
No ratings yet
Bsr551-Assignment 1-Nureen Yasmine Binti Suhaizi-Cfap2294g
11 pages
Initial Start-Up Report 5008S: Chapter 3: Installation
No ratings yet
Initial Start-Up Report 5008S: Chapter 3: Installation
3 pages
4TNV98-ZWBV1 0CS10-M55701 - en 184016-55700
No ratings yet
4TNV98-ZWBV1 0CS10-M55701 - en 184016-55700
171 pages
Buckling v4 - 10 14
No ratings yet
Buckling v4 - 10 14
5 pages
ICRI Concrete Repair Guidelines Catalog
100% (1)
ICRI Concrete Repair Guidelines Catalog
14 pages
ESP Training Course Overview 2016
No ratings yet
ESP Training Course Overview 2016
82 pages
Jurik
No ratings yet
Jurik
42 pages
Hu Mit
No ratings yet
Hu Mit
92 pages
Nature of SRS: Basic Issues
No ratings yet
Nature of SRS: Basic Issues
14 pages
Questions & Answers On Control Flow Statements in C
No ratings yet
Questions & Answers On Control Flow Statements in C
41 pages
AL-KO Prospekt AT6 Engl
No ratings yet
AL-KO Prospekt AT6 Engl
8 pages
Deep Sea Electronics PLC: Model 520 Automatic Start Module
No ratings yet
Deep Sea Electronics PLC: Model 520 Automatic Start Module
2 pages
QCP-10 Hydrotest and Flushing Procedure
67% (3)
QCP-10 Hydrotest and Flushing Procedure
15 pages
PEC 2019 Design Guidelines
No ratings yet
PEC 2019 Design Guidelines
9 pages
Global Ultimate Strength Analysis SBA KNPG-B - Rev0
100% (1)
Global Ultimate Strength Analysis SBA KNPG-B - Rev0
104 pages
Operator'S Manual Diesel Engine: 1D41. - 1D42. - 1D50. - 1D81. - 1D90
No ratings yet
Operator'S Manual Diesel Engine: 1D41. - 1D42. - 1D50. - 1D81. - 1D90
103 pages
528 - Tank Weighing System - 1
No ratings yet
528 - Tank Weighing System - 1
4 pages
GMK5150L Com V2 01-12-2018
No ratings yet
GMK5150L Com V2 01-12-2018
1,040 pages
HT Cable Sizing for Hingula OCP
100% (1)
HT Cable Sizing for Hingula OCP
16 pages
CBZ STAR (Mar, 2004)
No ratings yet
CBZ STAR (Mar, 2004)
108 pages
Biogas Production Enhancement
No ratings yet
Biogas Production Enhancement
10 pages
GRP Unit 3 New Retrofit Drawing of Ge
No ratings yet
GRP Unit 3 New Retrofit Drawing of Ge
52 pages
Leica Viva TS15
No ratings yet
Leica Viva TS15
4 pages
Catalogo General Maquinaria y Equipo PDF
50% (2)
Catalogo General Maquinaria y Equipo PDF
193 pages
Introduction of JHMT 2021 EN
No ratings yet
Introduction of JHMT 2021 EN
16 pages
AP Physics Motion Problems
No ratings yet
AP Physics Motion Problems
7 pages
Installation Manual: Hybrid Bulk Syrup System SSM-80G
No ratings yet
Installation Manual: Hybrid Bulk Syrup System SSM-80G
34 pages
Ffsq6 Home Wall 10x10
No ratings yet
Ffsq6 Home Wall 10x10
7 pages
Crankcase Cover PDF
No ratings yet
Crankcase Cover PDF
1 page
Chapter 10 Validation and Verification
No ratings yet
Chapter 10 Validation and Verification
12 pages