Distributed NoSQL Storage for
Extreme-scale System Services
Short Bio
6th year PhD student from DataSys Lab, CS Department, Illinois
Institute ofTechnology,Chicago
Academic advisor: Dr. Ioan Raicu
Research internships
Summer 2014, Argonne National Lab, Lemont, IL
Summer 2013, Lawrence Berkeley National Lab, Berkeley, CA
Research interests:
Distributed systems: NoSQL storage systems, file systems, system
services, HPC, Clouds and Big data
2
Collaborators
Dr. Dongfang Zhao (PNNL)
Dr. Ke Wang (Intel)
Dr. Kate Keahey (ANL/MCS)
Dr. Lavanya Ramakrishnan (LBL/ACS)
Dr. Zhao Zhang (UC Berkeley/AMP Lab)
Outline
Background and motivation
ZHT: distributed key-value store for high-end computing
systems
Proposed work: overview
System Design and Implementation
Features
Evaluation
Application use cases
Conclusion and Future work
4
Worlds most powerful machines
Ways to better performance:
Scale up
Scale up: build ONE bigger machine
Faster CPUs, larger memory, make everything faster
Exponentially expensive
Cant go too big
Ways to better performance:
Scale out
Scale out: build a bigger GROUP of machines
Up-to-date and commodity hard ware, fast network
Biggest issue: scalability: 1+1 < 2
Workload: Amdahls law, Gustafsons law
System services: metadata, scheduling, controlling,
monitoring
System services: bad example
IBM GPFS parallel file system metadata performance
(BlueGene/P, Intrepid, Argonne)
Background: storage problem
Limitations of current architecture:
Separated Compute and storage: shared network infrastructure
All I/Os are (eventually) remote
Frequent checkpointing: extremely write-intensive
New architecture: distributed storage
System services are critical!
Metadata management (file systems,
databases )
System scheduling systems (job, I/O)
System state monitoring system
11
Problem statement and motivation
Scalability is limited by system management and
services.
Decades old relatively centralized architecture no
longer catches up the growth of system scale.
There is no proper storage system to support
large scale distributed system services.
12
Proposed work
Build a scalable storage system that meets the
needs of scalable system services for exa-scale
machines
High Performance
High Scalability
Fault tolerance
13
Various storage system solutions
Storage type
Capacity
Single unit
size limit
Latency
Scalability
Resilience
Query &
indexing
File systems
Very large
Large
O(10) ms
Low
Medium
No
SQL
Databases
Large
Various small
O(10) ms
Very low
Very high
(ACID)
Best
NoSQL data
stores
Large
Various small
O(0.1~1) ms
High
High
Good
14
Limitations of Current
NoSQL DBs
Performance
High latency: 10ms ~ seconds
Not good enough scalability
Logarithmic routing algorithms
No deployment of O(1000) nodes
Portability
Many are implemented in Java no support on supercomputers
Complex dependencies
15
Proposed NoSQL solution
ZHT (zero-hop distributed hash table)
A building block for HPC systems and clouds
High Performance
A fast and lightweight distributed key-value store
Low Latency
High Throughput
Scalability towards O(10K) nodes
Reliability across failures
16
Lead author peer-reviewed
publications
1 Journal paper
A Convergence of Distributed Key-Value Storage in Cloud Computing and Supercomputing,
Journal of Concurrency and Computation Practice and Experience (CCPE) 2015
5 Conference papers
A Flexible QoS Fortified Distributed Key-Value Storage System for the Cloud, IEEE Big Data
2015
Distributed NoSQL Storage for Extreme-Scale System Services, SC 15, PhD showcase
A Dynamically Scalable Cloud Data infrastructure for Sensor Networks, ScienceCloud 2015
Scalable State Management for Scientific Applications in the Cloud, IEEE BigData 2014
ZHT: A Light-weight Reliable Persistent Dynamic Scalable Zero-hop Distributed Hash Table,
IEEE IPDPS 2013
3 research posters
GRAPH/Z: A Key-Value Store Based Scalable Graph Processing System, IEEE Cluster 2015
A Cloud-based Interactive Data Infrastructure for Sensor Networks, IEEE/ACM SC 2014
Exploring Distributed Hash Tables in High-End Computing, ACM SIGMETRIC PER 2011
17
Co-author peer-reviewed
Publications
2 Journal papers
Load-balanced and locality-aware scheduling for data-intensive workloads at extreme
scales, Journal of Concurrency and Computation: Practice and Experience (CCPE), 2015
Understanding the Performance and Potential of Cloud Computing for Scientific
Applications, IEEE Transaction on Cloud Computing (TCC), 2015
4 Conference papers
Overcoming Hadoop scaling limitations through distributed task execution, IEEE Cluster
2015
FaBRiQ: Leveraging Distributed Hash Tables towards Distributed Publish-Subscribe
Message Queues, IEEE/ACM BDC 2015
FusionFS: Towards Supporting Data-Intensive Scientific Applications on Extreme-Scale
High-Performance Computing Systems, IEEE Big Data 2014
Optimizing Load Balancing and Data-Locality with Data-aware Scheduling, IEEE Big Data
2014
18
Technical reports and other posters
Distributed Kev-Value Store on HPC and Cloud Systems, GCASR 2013
NoVoHT: a Lightweight Dynamic Persistent NoSQL Key/Value Store,
GCASR 2013
FusionFS: a distributed file system for large scale data-intensive
computing, GCASR 2013
OHT: Hierarchical Distributed Hash Tables, IIT 2013
Exploring Eventual Consistency Support in ZHT, IIT 2013
Understanding the Cost of Cloud Computing and Storage, GCASR
2012
ZHT: a Zero-hop DHT for High-End Computing Environment, GCASR
2012
19
Overview: ZHT highlights
ZHT: a zero-hop key-value store system
Written in C/C++, few dependencies
Tuned for supercomputers and clouds
Performance Highlights (on BG/P)
Scale: 8K-nodes and 32K instances
Latency: 1.5 ms at 32K-core scales
Throughput: 18M ops/s
20
ZHT Features
Unconventional operations
Append,Compare_swap,Change_callback
Dynamic membership: allowing node joins and leaves
Insert, lookup, remove
modified consistent hashing
Constant routing: 2 hops at most
Bulk partition moving upon rehashing
Fault tolerance
Replication
Strong or eventual consistency
21
Zero-hop hash mapping
Node Node
Node
n
1
2
Node
n-1
...
Client 1 n
Key Key
j
k
Value k
Replica
3
Value k
Replica
1 Value k
Replica
2
hash
hash
Value j
Replica
2
Value j
Value j Replica
Replica
1
3
22
ZHT architecture
Topology
Server architecture
23
ZHT Related work
Many DHTs: Chord, Kademlia, Pastry, Cassandra, C-MPI,
Memcached, Dynamo
Why another?
Name
Impl.
Routing
Time
Persistence
Dynamic
membership
Additional
Operation
Cassandra
C-MPI
Java
Log(N)
Yes
Yes
No
C/MPI
Log(N)
No
No
No
Dynamo
Java
0 to
Log(N)
Yes
Yes
No
Memcached
ZHT
No
No
No
0 to 2
Yes
Yes
Yes
C++
24
Evaluation
Test beds
IBM Blue Gene/P supercomputer
Up to 8K-nodes and 32K-cores
32K-instance deployed
Commodity Cluster
Up to 64-nodes
Amazon EC2
Cassandra, Memcached, DynamoDB
M1.medium and Cc2.8xlarge
96 VMs, 768 ZHT instances deployed
Systems comparison
25
Performance on supercomputer:
Intrepid, IBM BG/P
2.5
Latency (ms)
TCP without Connection Caching
TCP connection cachig
UDP
Memcached
1.5
0.5
Number of Nodes
26
Performance on a cloud:
Amazon EC2
14
Latency (ms)
12
10
8
ZHT on m1.medium instance (1/node)
ZHT on cc2.8xlarge instance (8/node)
DynamoDB
2
0
1
16
Node number
32
64
96
27
Performance on a commodity
cluster: HEC
3
Latency (ms)
2.5
2
ZHT
Cassandra
Memcached
1.5
1
0.5
0
1
8
16
Scale (# of nodes)
32
64
28
Efficiency:
Simulation with 1M nodes
100%
Efficiency (ZHT)
Efficiency (Simulation)
Difference
Efficiency (Memcached)
Efficiency
80%
60%
40%
20%
0%
14% 15%
0% 1%
4%
1% 2%
1% 1% 2% 2%
10%
4%
Scale (No. of Node)
29
ZHT/Q: a Flexible QoS Fortified
Distributed Key-Value Storage System
for Data Centers and Clouds
Built on top of ZHT
Meet the needs of clouds and data centers
Support simultaneous multiple applications
QoS on request response time (latency)
Dynamic request batching strategies
Publication
A Flexible QoS Fortified Distributed Key-Value Storage System
for the Cloud, IEEE Big Data 2015
30
Client side batcher
Client API Wrapper
Initiate results
Client
Client
Proxy
Client
Proxy
Client
Client
Client
Client
Proxy
Client
Proxy
Physical
server
K/V Store
Server
Physical
server
K/V Store
Server
Physical
server
K/V Store
Server
Physical
server
Check
results
Request Handler
Response
buer
Push requests to batch
Update condition
B1
K/V Store
Server
Choose
strategy
B2
B3
Check
condition
Condition Monitor
& Sender
Bn-1
K K K K
Bn
Batch buckets
Batching
Strategy
Engine
K K
V V
V V V V
Plugin
Plugin
Plugin
Unpack and
insert
Latency
Feedback
Result
Service
Data Center
Data center network topology
Sending batches
Client proxy (batcher)
Returned batch
results
31
Mixed workloads with multi-QoS
16000%
13,417%
10,847%
0.2
12000%
10000%
8000%
6000%
3,997%
4000%
2000%
0.0
Throughput)in)ops/s)
14000%
0.4
0.6
0.8
1.0
Workload patterns
1,519%
0.1
1,940%
0.5
5.0
50.0
Request latency in ms
0%
No%batching%
Pa8ern%1%
Pa8ern%2%
Workload)QoS)pa5erns)
Throughput
Pa8ern%3%
Pa8ern%4%
Batch latency distribution
32
Application Use Cases
File systems: FusionFS [TPDS15, BigData14, CCGrid14, SC13,
Cluster13, TaPP13, SC12, LSAP11]
Sensor network storage: WaggleDB [SC14, ScienceCloud15]
Statement management: FREIDA-State [BigData14]
Graph processing system:GRAPH/Z [Cluster15]
MTC scheduling: MATRIX [BigData14, HPC13, IIT13, SC12]
HPC scheduling: Slurm++ [HPDC15, HPDC14]
Distributed message queues: FabriQ [ BDC15]
Simulations: DKVS [SC13]
Lead author papers
33
Co-author papers
Papers based on my work
Use case FusionFS:
a distributed file system
A fully distributed file system, based on FUSE
All-in-one: compute node, storage server, metadata
server on one machine
Using ZHT for metadata management
Evaluated on BlueGene/P and Kodiak up to 1K node
Collaboration with Dongfang Zhao, PhD 2015
34
Use case FusionFS:
a distributed file system
Metadata performance:
Weak scaling: creating 10k empty files in 1 directory on each node
35
Use case MATRIX:
a distributed scheduling framework
Optimized for many-task-computing
Fully distributed design
Adaptive work stealing
Using ZHT to submit jobs and monitor status
Collaboration with Ke Wang, PhD 2015
36
Average efficiency (%)
Use case MATRIX:
a distributed scheduling framework
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Matrix
Falkon
Task duration (seconds)
Scheduling efficiency on different job granularity
37
Use case WaggleDB:
NoSQL DB for sensor network data service
Project: WaggleDB at ANL, 2014
NoSQL database + distributed message queue
Collaborator: Kate Keahey, ANL
Publication:
A Dynamically Scalable Cloud Data Infrastructure for
Sensor Networks, ScienceCloud15
A Cloud-based Interactive Data Infrastructure for
Sensor Networks, SC14 research poster
38
WaggleDB Architecture
39
Use case FRIEDA-State:
NoSQL DB for
State Management on Cloud
Project: FRIEDA-State at LBL, 2013
Collaborator: Lavanya Ramakrishnan, LBL
Publication: Scalable State Management for Scientific
Applications in the Cloud, IEEE BigData 2014
Funding info
DOE DE-AC02-05CH11231
NSF No. 0910812
40
FRIEDA-State architecture
41
Conclusion
Prove that NoSQL systems are a fundamental
building block for more complex distributed
systems
Storage systems: ZHT/Q, FusionFS, Istore
Provenance: FusionProv
Job scheduling systems: MATRIX, Slurm++
Event streaming systems: WaggleDB, FRIEDA-State
Message queue systems: FqbriQ
42
Lessons Learned
Decentralized architecture
High Performance
Excellent scalability
Improved fault tolerance
Simplicity
Light-weight design
Little reliance on complex software stack
Easy adoption as building block for more complex systems
43
Future Research Directions
Building extreme-scale system services with
NoSQL storage systems
Using NoSQL storage to help on traditional
HPC applications problems
Scalability
Fault tolerance
44
Thank you!
Q&A
45