Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

Dynamometer
and
A Case Study in NameNode GC
Erik Krogen
Senior Software Engineer, Hadoop & HDFS

Dynamometer
• Realistic performance benchmark
& stress test for HDFS
• Open sourced on LinkedIn
GitHub, contributing to Apache
• Evaluate scalability limits
• Provide confidence before new
feature/config deployment

What’s a Dynamometer?
A dynamometer or "dyno" for
short, is a device for
measuring force, torque,
or power. For example, the
power produced by an engine
…
- Wikipedia
Image taken from https://flic.kr/p/dtkCRU and redistributed under the CC BY-SA 2.0 license

Main Goals
• Accurate namespace: Namespace
characteristics have a big impact
• Accurate client workload: Request
types and the timing of requests both
have a big impact
• Accurate system workload: Load
imposed by system management
(block reports, etc.) has a big impact
High Fidelity Efficiency
• Low cost: Offline infra has high
utilization; can’t afford to keep
around unused machines for testing
• Low developer effort: Deploying to
large number of machines can be
cumbersome; make it easy
• Fast iteration cycle: Should be able
to iterate quickly

Simplify the Problem
NameNode is the central component, most
frequent bottleneck: focus here

Dynamometer
SIMULATED HDFS CLUSTER RUNS IN YARN CONTAINERS
• How to schedule
and coordinate?
Use YARN!
• Real NameNode,
fake DataNodes to
run on ~1% the
hardware
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN NodeYARN Node
NameNode
Host YARN Cluster
YARN Node
DynoAM
Host HDFS Cluster
FsImage Block
Listings

Dynamometer
SIMULATED HDFS CLIENTS RUN IN YARN CONTAINERS
• Clients can run on
YARN too!
• Replay real traces
from production
cluster audit logs
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN Node
• • •
YARN Node
NameNode
Dynamometer
Infrastructure
Application
Workload MapReduce Job
Host YARN Cluster
Simulated Client Simulated Client
Simulated ClientSimulated Client
Host HDFS Cluster
Audit
Logs

Contributing to Apache
• Working to put into hadoop-tools
• Easier place for community to access and contribute
• Increased chance of others helping to maintain it
• Follow HDFS-12345 (actual ticket, not a placeholder)

NameNode GC:
A Dynamometer Case
Study

NameNode GC Primer
• Why do we care?
• NameNode heaps are huge (multi-hundred GB)
• GC is a big factor in performance
• What’s special about NameNode GC?
• Huge working set: can have over 100GB of long-lived objects
• Massive young gen churn (from RPC requests)

Can we use a new GC algorithm to squeeze
more performance out of the NameNode?
Q U E S T I O N

Experimental Setup
• 16 hour production trace: Long enough to experience 2 rounds of
mixed GC
• Measure performance via standard metrics (client latency, RPC
queue time)
• Measure GC pauses during startup and normal workloads
• Let’s try G1GC even though we know we’re pushing the limits:
• The region sizes can vary from 1 MB to 32 MB depending on the
heap size. The goal is to have no more than 2048 regions. –
Oracle*
• This implies that the heap should be 64 GB and under, but at this*Garbage First Garbage Collector Tuning - https://www.oracle.com/technetwork/articles/java/g1gc-198453

Can You Spot the Issue?
[Parallel Time: 17676.0 ms, GC Workers: 16]
[GC Worker Start (ms): Min: 883574.6, Avg: 883574.8, Max: 883575.0, Diff: 0.3]
[Ext Root Scanning (ms): Min: 1.0, Avg: 1.2, Max: 2.1, Diff: 1.1, Sum: 18.8]
[Update RS (ms): Min: 31.7, Avg: 32.2, Max: 32.8, Diff: 1.1, Sum: 514.7]
[Processed Buffers: Min: 25, Avg: 30.2, Max: 38, Diff: 13, Sum: 484]
[Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.6, Diff: 0.6, Sum: 1.0]
[Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3]
[Termination (ms): Min: 0.0, Avg: 88.8, Max: 96.5, Diff: 96.5, Sum: 1421.5]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4]
[GC Worker Total (ms): Min: 17675.5, Avg: 17675.6, Max: 17675.8, Diff: 0.3, Sum: 282810.3]
[GC Worker End (ms): Min: 901250.4, Avg: 901250.4, Max: 901250.5, Diff: 0.0]
[Code Root Fixup: 0.7 ms]
[Code Root Migration: 2.3 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 6.7 ms]
[Other: 1194.8 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 2.8 ms]
[Ref Enq: 0.4 ms]
[Redirty Cards: 468.4 ms]
[Free CSet: 4.0 ms]
[Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)]
[Times: user=223.20 sys=0.22, real=18.88 secs]
902.102: Total time for which application threads were stopped: 18.8815330 seconds
[Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4]
[Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3]
[Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)]
902.102: Total time for which application threads were stopped: 18.8815330 seconds
Huge pause!
Few GB of Eden
cleared, big but not
huge
~500ms
pause due to
object copy
17.5s pause due
to “Scan RS”!

Tuning G1GC with Dynamometer
• G1GC has lots of tunables – how do we optimize all of them without
hurting our production system?
• Dynamometer to the rescue
• Easily set up experiments sweeping over different values for a
param
• Fire-and-forget, test with with many combinations and analyze
later
• Main parameters needing significant tuning were for the remembered
sets
• (details to follow in appendix)

How Much Does G1GC Help?
Startup Normal Operation
METRIC CMS G1GC CMS G1GC
Avg Client Latency
(ms)
19 18
Total Pause Time (s) 200 180 550 160
Median Pause Time (s) 1.1 0.5 0.12 0.06
Max Pause Time (s) 13.4 3.3 1.4 0.6
* Values are approximate and provided primarily to give a sense of s
Excellent
reduction in
pause
times
Not much
impact on
throughput

Looking towards the future…
• Question: How does G1GC fare extrapolating to future workloads:
• 600GB+ heap size, 1 billion blocks, 1 billion files
• Answer: Not so well
• RSet entry count has to be increased even further to obtain
reasonable performance
• Off-heap overheads in the hundreds of gigabytes
• Wouldn’t recommend it

• Anything we can do besides G1GC?
• Extensive testing with Azul’s C4 GC available in Zing® JVM
• Good performance with no tuning
• Results in a test environment:
• 99th percentile pause time ~1ms, max in tens of ms
• Average client latency dropped ~30%
• Continued to see good performance up to 600GB heap size
Zing JVM: https://www.azul.com/products/zing/
Azul C4 GC: https://www.azul.com/resources/azul-technology/azul-c4-garbage-collec

• Anything we can do that isn’t proprietary?
• Wait for OpenJDK next gen GC algorithms to mature:
• Shenandoah
• ZGC

Appendix: Detailed G1GC Tuning Tips
• -XX:G1RSetRegionEntries: Solving the problem from the previous slide. 4096 worked well (default of
1536)
• Comes with high off-heap memory overheads
• -XX:G1RSetUpdatingPauseTimePercent: Reduce this to reduce the “Update RS” pause time, push
more work to concurrent threads (NameNode is not really that concurrent – extra cores are better
by the GC algorithm)
• -XX:G1NewSizePercent: Default of 5% is unreasonably large for heaps > 100GB, reducing will help
shorten pauses during high churn periods (startup, failover)
• -XX:MaxTenuringThreshold, -XX:ParallelGCThreads, -XX:ConcGCThreads: Set empirically based on
experiments sweeping over values. This is where Dynamometer really shines
• MaxTenuringThreshold is particularly interesting: Based on NN usage pattern (objects are either
very long lived or very short), would expect low values (1 or 2) to be best, but in practice closer
default of 8 performs better

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

More Related Content

What's hot (20)

Similar to Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC (20)

More from Erik Krogen (6)

Recently uploaded (20)

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

Editor's Notes