0% found this document useful (0 votes)

54 views45 pages

Distributed Nosql Storage For Extreme-Scale System Services

This document provides an overview of a PhD student's research focusing on distributed NoSQL storage systems for extreme-scale computing. It includes a short bio of the student, their research interests and collaborators. The document outlines the proposed work to build a scalable NoSQL key-value storage system called ZHT to meet the needs of system services for exascale machines. It highlights ZHT's performance achieving low latency and high scalability on supercomputers and clouds, as well as its features including dynamic membership, fault tolerance and additional operations. Evaluation results comparing ZHT to other systems are also summarized.

Uploaded by

Balakrishnan.G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views45 pages

Distributed Nosql Storage For Extreme-Scale System Services

Uploaded by

Balakrishnan.G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Distributed NoSQL Storage for

Extreme-scale System Services

Short Bio

6th year PhD student from DataSys Lab, CS Department, Illinois

Institute ofTechnology,Chicago
Academic advisor: Dr. Ioan Raicu
Research internships

Summer 2014, Argonne National Lab, Lemont, IL

Summer 2013, Lawrence Berkeley National Lab, Berkeley, CA

Research interests:

Distributed systems: NoSQL storage systems, file systems, system

services, HPC, Clouds and Big data
2

Collaborators

Dr. Dongfang Zhao (PNNL)

Dr. Ke Wang (Intel)
Dr. Kate Keahey (ANL/MCS)
Dr. Lavanya Ramakrishnan (LBL/ACS)
Dr. Zhao Zhang (UC Berkeley/AMP Lab)

Outline

Background and motivation

ZHT: distributed key-value store for high-end computing
systems

Proposed work: overview

System Design and Implementation
Features
Evaluation

Application use cases

Conclusion and Future work
4

Worlds most powerful machines

Ways to better performance:

Scale up

Scale up: build ONE bigger machine

Faster CPUs, larger memory, make everything faster

Exponentially expensive
Cant go too big

Ways to better performance:

Scale out

Scale out: build a bigger GROUP of machines

Up-to-date and commodity hard ware, fast network

Biggest issue: scalability: 1+1 < 2

Workload: Amdahls law, Gustafsons law

System services: metadata, scheduling, controlling,
monitoring

System services: bad example

IBM GPFS parallel file system metadata performance

(BlueGene/P, Intrepid, Argonne)

Background: storage problem

Limitations of current architecture:

Separated Compute and storage: shared network infrastructure

All I/Os are (eventually) remote
Frequent checkpointing: extremely write-intensive

New architecture: distributed storage

System services are critical!

Metadata management (file systems,

databases )

System scheduling systems (job, I/O)

System state monitoring system

Problem statement and motivation

Scalability is limited by system management and

services.

Decades old relatively centralized architecture no

longer catches up the growth of system scale.

There is no proper storage system to support

large scale distributed system services.
12

Proposed work

Build a scalable storage system that meets the

needs of scalable system services for exa-scale
machines
High Performance
High Scalability
Fault tolerance

Various storage system solutions

Storage type

Capacity

Single unit
size limit

Latency

Scalability

Resilience

Query &
indexing

File systems

Very large

Large

O(10) ms

Low

Medium

SQL
Databases

Large

Various small

O(10) ms

Very low

Very high
(ACID)

Best

NoSQL data
stores

Large

Various small

O(0.1~1) ms

High

Good

Limitations of Current
NoSQL DBs

Performance

High latency: 10ms ~ seconds

Not good enough scalability

Logarithmic routing algorithms

No deployment of O(1000) nodes

Portability

Many are implemented in Java no support on supercomputers

Complex dependencies
15

Proposed NoSQL solution

ZHT (zero-hop distributed hash table)

A building block for HPC systems and clouds

High Performance

A fast and lightweight distributed key-value store

Low Latency
High Throughput

Scalability towards O(10K) nodes

Reliability across failures
16

Lead author peer-reviewed

publications

1 Journal paper
A Convergence of Distributed Key-Value Storage in Cloud Computing and Supercomputing,
Journal of Concurrency and Computation Practice and Experience (CCPE) 2015

5 Conference papers
A Flexible QoS Fortified Distributed Key-Value Storage System for the Cloud, IEEE Big Data

2015
Distributed NoSQL Storage for Extreme-Scale System Services, SC 15, PhD showcase
A Dynamically Scalable Cloud Data infrastructure for Sensor Networks, ScienceCloud 2015
Scalable State Management for Scientific Applications in the Cloud, IEEE BigData 2014
ZHT: A Light-weight Reliable Persistent Dynamic Scalable Zero-hop Distributed Hash Table,
IEEE IPDPS 2013

3 research posters
GRAPH/Z: A Key-Value Store Based Scalable Graph Processing System, IEEE Cluster 2015
A Cloud-based Interactive Data Infrastructure for Sensor Networks, IEEE/ACM SC 2014
Exploring Distributed Hash Tables in High-End Computing, ACM SIGMETRIC PER 2011

Co-author peer-reviewed
Publications

2 Journal papers

Load-balanced and locality-aware scheduling for data-intensive workloads at extreme

scales, Journal of Concurrency and Computation: Practice and Experience (CCPE), 2015
Understanding the Performance and Potential of Cloud Computing for Scientific
Applications, IEEE Transaction on Cloud Computing (TCC), 2015

4 Conference papers

Overcoming Hadoop scaling limitations through distributed task execution, IEEE Cluster
2015
FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe
Message Queues, IEEE/ACM BDC 2015
FusionFS: Towards Supporting Data-Intensive Scientific Applications on Extreme-Scale
High-Performance Computing Systems, IEEE Big Data 2014
Optimizing Load Balancing and Data-Locality with Data-aware Scheduling, IEEE Big Data
2014
18

Technical reports and other posters

Distributed Kev-Value Store on HPC and Cloud Systems, GCASR 2013

NoVoHT: a Lightweight Dynamic Persistent NoSQL Key/Value Store,
GCASR 2013
FusionFS: a distributed file system for large scale data-intensive
computing, GCASR 2013
OHT: Hierarchical Distributed Hash Tables, IIT 2013
Exploring Eventual Consistency Support in ZHT, IIT 2013
Understanding the Cost of Cloud Computing and Storage, GCASR
2012
ZHT: a Zero-hop DHT for High-End Computing Environment, GCASR
2012
19

Overview: ZHT highlights

ZHT: a zero-hop key-value store system

Written in C/C++, few dependencies

Tuned for supercomputers and clouds

Performance Highlights (on BG/P)

Scale: 8K-nodes and 32K instances

Latency: 1.5 ms at 32K-core scales
Throughput: 18M ops/s
20

ZHT Features

Unconventional operations

Append,Compare_swap,Change_callback

Dynamic membership: allowing node joins and leaves

Insert, lookup, remove

modified consistent hashing

Constant routing: 2 hops at most
Bulk partition moving upon rehashing

Fault tolerance

Replication
Strong or eventual consistency

Zero-hop hash mapping

Node Node
Node
n
1
2

Node
n-1

...

Client 1 n
Key Key
j
k
Value k
Replica
3
Value k
Replica
1 Value k
Replica
2

hash

hash
Value j
Replica
2
Value j
Value j Replica
Replica
1
3

ZHT architecture

Topology

Server architecture
23

ZHT Related work

Many DHTs: Chord, Kademlia, Pastry, Cassandra, C-MPI,

Memcached, Dynamo
Why another?

Name

Impl.

Routing
Time

Persistence

Dynamic
membership

Additional
Operation

Cassandra
C-MPI

Java

Log(N)

Yes

C/MPI

Log(N)

Dynamo

Java

0 to
Log(N)

Yes

Memcached
ZHT

0 to 2

Yes

C++

Evaluation

Test beds

IBM Blue Gene/P supercomputer

Up to 8K-nodes and 32K-cores

32K-instance deployed

Commodity Cluster

Up to 64-nodes

Amazon EC2

Cassandra, Memcached, DynamoDB

M1.medium and Cc2.8xlarge

96 VMs, 768 ZHT instances deployed
Systems comparison
25

Performance on supercomputer:
Intrepid, IBM BG/P
2.5

Latency (ms)

TCP without Connection Caching

TCP connection cachig
UDP
Memcached

1.5

0.5

Number of Nodes
26

Performance on a cloud:
Amazon EC2
14

Latency (ms)

12
10
8

ZHT on m1.medium instance (1/node)

ZHT on cc2.8xlarge instance (8/node)

DynamoDB

2
0
1

Node number

96
27

Performance on a commodity
cluster: HEC
3

Latency (ms)

2.5
2

ZHT
Cassandra
Memcached

1.5
1
0.5
0
1

8
16
Scale (# of nodes)

Efficiency:
Simulation with 1M nodes
100%

Efficiency (ZHT)
Efficiency (Simulation)
Difference
Efficiency (Memcached)

Efficiency

80%
60%
40%
20%
0%

14% 15%
0% 1%

1% 2%

1% 1% 2% 2%

10%

Scale (No. of Node)

ZHT/Q: a Flexible QoS Fortified

Distributed Key-Value Storage System
for Data Centers and Clouds

Built on top of ZHT

Meet the needs of clouds and data centers
Support simultaneous multiple applications
QoS on request response time (latency)
Dynamic request batching strategies
Publication

A Flexible QoS Fortified Distributed Key-Value Storage System

for the Cloud, IEEE Big Data 2015
30

Client side batcher

Client API Wrapper
Initiate results
Client

Client
Proxy

Client

Client
Proxy

Physical
server

K/V Store
Server

Physical
server

K/V Store
Server

Physical
server

K/V Store
Server

Physical
server

Check
results

Request Handler

Response
buer

Push requests to batch

Update condition

K/V Store
Server

Choose
strategy

Check
condition

Condition Monitor
& Sender

Bn-1

K K K K

Batch buckets
Batching
Strategy
Engine

K K

V V
V V V V

Plugin
Plugin
Plugin

Unpack and
insert
Latency
Feedback

Result
Service

Data Center

Data center network topology

Sending batches

Client proxy (batcher)

Returned batch
results

Mixed workloads with multi-QoS

16000%
13,417%
10,847%

0.2

12000%
10000%
8000%
6000%

3,997%

4000%
2000%

0.0

Throughput)in)ops/s)

14000%

0.4

0.6

0.8

1.0

Workload patterns

1,519%

0.1

1,940%

0.5

5.0

50.0

Request latency in ms

0%
No%batching%

Pa8ern%1%

Pa8ern%2%

Workload)QoS)pa5erns)

Throughput

Pa8ern%3%

Pa8ern%4%

Batch latency distribution

Application Use Cases

File systems: FusionFS [TPDS15, BigData14, CCGrid14, SC13,

Cluster13, TaPP13, SC12, LSAP11]
Sensor network storage: WaggleDB [SC14, ScienceCloud15]
Statement management: FREIDA-State [BigData14]
Graph processing system:GRAPH/Z [Cluster15]
MTC scheduling: MATRIX [BigData14, HPC13, IIT13, SC12]
HPC scheduling: Slurm++ [HPDC15, HPDC14]
Distributed message queues: FabriQ [ BDC15]
Simulations: DKVS [SC13]

Lead author papers

33
Co-author papers
Papers based on my work

Use case FusionFS:

a distributed file system

A fully distributed file system, based on FUSE

All-in-one: compute node, storage server, metadata
server on one machine
Using ZHT for metadata management
Evaluated on BlueGene/P and Kodiak up to 1K node
Collaboration with Dongfang Zhao, PhD 2015

Use case FusionFS:

a distributed file system

Metadata performance:
Weak scaling: creating 10k empty files in 1 directory on each node

Use case MATRIX:

a distributed scheduling framework

Optimized for many-task-computing

Fully distributed design
Adaptive work stealing
Using ZHT to submit jobs and monitor status
Collaboration with Ke Wang, PhD 2015

Average efficiency (%)

Use case MATRIX:

a distributed scheduling framework
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

Matrix
Falkon

Task duration (seconds)

Scheduling efficiency on different job granularity

Use case WaggleDB:

NoSQL DB for sensor network data service

Project: WaggleDB at ANL, 2014

NoSQL database + distributed message queue
Collaborator: Kate Keahey, ANL
Publication:

A Dynamically Scalable Cloud Data Infrastructure for

Sensor Networks, ScienceCloud15
A Cloud-based Interactive Data Infrastructure for
Sensor Networks, SC14 research poster
38

WaggleDB Architecture

Use case FRIEDA-State:

NoSQL DB for
State Management on Cloud

Project: FRIEDA-State at LBL, 2013

Collaborator: Lavanya Ramakrishnan, LBL
Publication: Scalable State Management for Scientific
Applications in the Cloud, IEEE BigData 2014
Funding info

DOE DE-AC02-05CH11231
NSF No. 0910812
40

FRIEDA-State architecture

Conclusion

Prove that NoSQL systems are a fundamental

building block for more complex distributed
systems

Storage systems: ZHT/Q, FusionFS, Istore

Provenance: FusionProv
Job scheduling systems: MATRIX, Slurm++
Event streaming systems: WaggleDB, FRIEDA-State
Message queue systems: FqbriQ
42

Lessons Learned

Decentralized architecture
High Performance
Excellent scalability
Improved fault tolerance
Simplicity
Light-weight design
Little reliance on complex software stack
Easy adoption as building block for more complex systems
43

Future Research Directions

Building extreme-scale system services with

NoSQL storage systems

Using NoSQL storage to help on traditional

HPC applications problems

Scalability
Fault tolerance

Thank you!
Q&A
45

Seminar Report: Hadoop
100% (1)
Seminar Report: Hadoop
41 pages
NoSQL Databases MongoDB
No ratings yet
NoSQL Databases MongoDB
10 pages
Lecture 5 Distributed Storage Systems
No ratings yet
Lecture 5 Distributed Storage Systems
26 pages
Data Partitioning
No ratings yet
Data Partitioning
39 pages
Unit II
No ratings yet
Unit II
8 pages
Big Data Storage System Based On A Distributed Hash Tables System
No ratings yet
Big Data Storage System Based On A Distributed Hash Tables System
9 pages
L2 AWS Basics
No ratings yet
L2 AWS Basics
56 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Seminar Report On: Hadoop
No ratings yet
Seminar Report On: Hadoop
42 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Term Paper
No ratings yet
Term Paper
6 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
Chapter 7
No ratings yet
Chapter 7
51 pages
Ccomputing Madurya
No ratings yet
Ccomputing Madurya
20 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Lez.d-01-Hadoop (C)
No ratings yet
Lez.d-01-Hadoop (C)
29 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit 5
No ratings yet
Unit 5
101 pages
Cloud Data Storage
No ratings yet
Cloud Data Storage
47 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
Big Data
No ratings yet
Big Data
3 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Syllabus For Core Subject
No ratings yet
Syllabus For Core Subject
2 pages
Cloud Comp Techno
No ratings yet
Cloud Comp Techno
5 pages
Chapter 4 - Big Data Tools, Techniques, and Systems
No ratings yet
Chapter 4 - Big Data Tools, Techniques, and Systems
19 pages
Slides PDF
No ratings yet
Slides PDF
30 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Snowflake Cloud Data Warehouse Overview
No ratings yet
Snowflake Cloud Data Warehouse Overview
40 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Module 2 Hadoop
No ratings yet
Module 2 Hadoop
180 pages
2.big Data Storage
No ratings yet
2.big Data Storage
96 pages
Distributed File System and Scalable Computing
No ratings yet
Distributed File System and Scalable Computing
8 pages
Welcome To The New Era of Cloud Computing: The Web Is Replacing The Desktop
No ratings yet
Welcome To The New Era of Cloud Computing: The Web Is Replacing The Desktop
36 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
2nd Unit Bda
No ratings yet
2nd Unit Bda
30 pages
Nosql and Data Scalability: Getting Started With
100% (1)
Nosql and Data Scalability: Getting Started With
6 pages
Cloud Computing IEEE Project Topics Ocularsystems - in
No ratings yet
Cloud Computing IEEE Project Topics Ocularsystems - in
5 pages
03 Intro HadoopAndMapReduce BigData
No ratings yet
03 Intro HadoopAndMapReduce BigData
91 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Big Data and Cloud Computing
No ratings yet
Big Data and Cloud Computing
27 pages
Public Cloud Storage For The Seismic Big Data Based On Amazon EC2 Cluster and Hadoop
No ratings yet
Public Cloud Storage For The Seismic Big Data Based On Amazon EC2 Cluster and Hadoop
10 pages
Wanted Ay 2024-2025 V1
No ratings yet
Wanted Ay 2024-2025 V1
1 page
Ge3151 Problem Solving and Python Programming
92% (12)
Ge3151 Problem Solving and Python Programming
35 pages
Syllabus GE3151 PROBLEM SOLVING AND PYTHON PROGRAMMING 3 0 0 3
No ratings yet
Syllabus GE3151 PROBLEM SOLVING AND PYTHON PROGRAMMING 3 0 0 3
2 pages
Read The Following Paragraph and Answer The Questions That Follow
100% (1)
Read The Following Paragraph and Answer The Questions That Follow
33 pages
Computational Problem
No ratings yet
Computational Problem
2 pages
Python Modules & Recursion Guide
No ratings yet
Python Modules & Recursion Guide
5 pages
Unit Ii - Cs6010 Sna: Unitii Modelling, Aggregating and Knowledge Representation 9
No ratings yet
Unit Ii - Cs6010 Sna: Unitii Modelling, Aggregating and Knowledge Representation 9
17 pages
3.1 Extracting Evolution of Web Community From A Series of Web Archive
No ratings yet
3.1 Extracting Evolution of Web Community From A Series of Web Archive
18 pages
Ge8151 Python Programming Question Bank
No ratings yet
Ge8151 Python Programming Question Bank
5 pages
Google Tools Setup & Collaboration Guide
No ratings yet
Google Tools Setup & Collaboration Guide
48 pages
Optimized
No ratings yet
Optimized
10 pages
Ge 8151
No ratings yet
Ge 8151
2 pages
Multithreading and Exception Handling
No ratings yet
Multithreading and Exception Handling
40 pages
Testing P2PSIP Systems Overview
No ratings yet
Testing P2PSIP Systems Overview
7 pages
06 Synchronization
No ratings yet
06 Synchronization
52 pages
Module 2
No ratings yet
Module 2
23 pages
Ipfs p2p File System
No ratings yet
Ipfs p2p File System
11 pages
IOT Tutorial
100% (1)
IOT Tutorial
170 pages
Introduction To Chord System in Distributed System
No ratings yet
Introduction To Chord System in Distributed System
6 pages
Online Chat System Design
No ratings yet
Online Chat System Design
65 pages
Distributed Hash Tables Guide
No ratings yet
Distributed Hash Tables Guide
20 pages
DHT
No ratings yet
DHT
26 pages
Ad Hoc & Sensor Networks Guide
No ratings yet
Ad Hoc & Sensor Networks Guide
32 pages
Networking Peer To Peer
No ratings yet
Networking Peer To Peer
18 pages
Partial Topology-Aware Data Distribution Within Large Unmanned Surface Vehicle Teams
No ratings yet
Partial Topology-Aware Data Distribution Within Large Unmanned Surface Vehicle Teams
10 pages
Blockchain For Aerospace and Defense - CAIE Final
No ratings yet
Blockchain For Aerospace and Defense - CAIE Final
33 pages
Malicious Node Detection Using Machine Learning and Distributed Data Storage Using Blockchain in WSNs
No ratings yet
Malicious Node Detection Using Machine Learning and Distributed Data Storage Using Blockchain in WSNs
16 pages
Lec 10-12 P2P Applications and Sockets
No ratings yet
Lec 10-12 P2P Applications and Sockets
40 pages
Blockchain Technologies Module 1
No ratings yet
Blockchain Technologies Module 1
30 pages
Design and Implementation of Peer-to-Peer Network
100% (2)
Design and Implementation of Peer-to-Peer Network
40 pages
Blockcloud Technical Whitepaper
No ratings yet
Blockcloud Technical Whitepaper
19 pages
Client-Server Architecture Pure P2P
No ratings yet
Client-Server Architecture Pure P2P
8 pages
A Formal Specication of The Kademlia Distributed Hash Table Isabel Pita
No ratings yet
A Formal Specication of The Kademlia Distributed Hash Table Isabel Pita
11 pages
OpenDiLoCo An Open-Source Framework For Globally Distributed Low-Communication Training
No ratings yet
OpenDiLoCo An Open-Source Framework For Globally Distributed Low-Communication Training
8 pages
28 Consistent Hashing
No ratings yet
28 Consistent Hashing
6 pages
如何控制论文字数
100% (2)
如何控制论文字数
6 pages
Content Based Image Retrieval in Peer-To-Peer Networks
No ratings yet
Content Based Image Retrieval in Peer-To-Peer Networks
6 pages
Is Filecoin A $257 Million Ponzi Scheme?: Marc Juchli Johan A. Pouwelse
No ratings yet
Is Filecoin A $257 Million Ponzi Scheme?: Marc Juchli Johan A. Pouwelse
9 pages
FoxIT Whitepaper Blackhat Web
No ratings yet
FoxIT Whitepaper Blackhat Web
24 pages
Chapter7 ApplicationLayer
No ratings yet
Chapter7 ApplicationLayer
88 pages
Consistent Hashing
No ratings yet
Consistent Hashing
2 pages
Brief Notes On Blockchain - For Semester
No ratings yet
Brief Notes On Blockchain - For Semester
40 pages
NSSL Discussion Whanaungatanga
No ratings yet
NSSL Discussion Whanaungatanga
16 pages