SlideShare a Scribd company logo
4
Most read
6
Most read
12
Most read
Dynamometer
and
A Case Study in NameNode GC
Erik Krogen
Senior Software Engineer, Hadoop & HDFS
Dynamometer
• Realistic performance benchmark
& stress test for HDFS
• Open sourced on LinkedIn
GitHub, contributing to Apache
• Evaluate scalability limits
• Provide confidence before new
feature/config deployment
What’s a Dynamometer?
A dynamometer or "dyno" for
short, is a device for
measuring force, torque,
or power. For example, the
power produced by an engine
…
- Wikipedia
Image taken from https://flic.kr/p/dtkCRU and redistributed under the CC BY-SA 2.0 license
Main Goals
• Accurate namespace: Namespace
characteristics have a big impact
• Accurate client workload: Request
types and the timing of requests both
have a big impact
• Accurate system workload: Load
imposed by system management
(block reports, etc.) has a big impact
High Fidelity Efficiency
• Low cost: Offline infra has high
utilization; can’t afford to keep
around unused machines for testing
• Low developer effort: Deploying to
large number of machines can be
cumbersome; make it easy
• Fast iteration cycle: Should be able
to iterate quickly
Simplify the Problem
NameNode is the central component, most
frequent bottleneck: focus here
Dynamometer
SIMULATED HDFS CLUSTER RUNS IN YARN CONTAINERS
• How to schedule
and coordinate?
Use YARN!
• Real NameNode,
fake DataNodes to
run on ~1% the
hardware
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN NodeYARN Node
NameNode
Host YARN Cluster
YARN Node
DynoAM
Host HDFS Cluster
FsImage Block
Listings
Dynamometer
SIMULATED HDFS CLIENTS RUN IN YARN CONTAINERS
• Clients can run on
YARN too!
• Replay real traces
from production
cluster audit logs
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN Node
• • •
YARN Node
NameNode
Dynamometer
Infrastructure
Application
Workload MapReduce Job
Host YARN Cluster
Simulated Client Simulated Client
Simulated ClientSimulated Client
Host HDFS Cluster
Audit
Logs
Contributing to Apache
• Working to put into hadoop-tools
• Easier place for community to access and contribute
• Increased chance of others helping to maintain it
• Follow HDFS-12345 (actual ticket, not a placeholder)
NameNode GC:
A Dynamometer Case
Study
NameNode GC Primer
• Why do we care?
• NameNode heaps are huge (multi-hundred GB)
• GC is a big factor in performance
• What’s special about NameNode GC?
• Huge working set: can have over 100GB of long-lived objects
• Massive young gen churn (from RPC requests)
Can we use a new GC algorithm to squeeze
more performance out of the NameNode?
Q U E S T I O N
Experimental Setup
• 16 hour production trace: Long enough to experience 2 rounds of
mixed GC
• Measure performance via standard metrics (client latency, RPC
queue time)
• Measure GC pauses during startup and normal workloads
• Let’s try G1GC even though we know we’re pushing the limits:
• The region sizes can vary from 1 MB to 32 MB depending on the
heap size. The goal is to have no more than 2048 regions. –
Oracle*
• This implies that the heap should be 64 GB and under, but at this*Garbage First Garbage Collector Tuning - https://www.oracle.com/technetwork/articles/java/g1gc-198453
Can You Spot the Issue?
[Parallel Time: 17676.0 ms, GC Workers: 16]
[GC Worker Start (ms): Min: 883574.6, Avg: 883574.8, Max: 883575.0, Diff: 0.3]
[Ext Root Scanning (ms): Min: 1.0, Avg: 1.2, Max: 2.1, Diff: 1.1, Sum: 18.8]
[Update RS (ms): Min: 31.7, Avg: 32.2, Max: 32.8, Diff: 1.1, Sum: 514.7]
[Processed Buffers: Min: 25, Avg: 30.2, Max: 38, Diff: 13, Sum: 484]
[Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.6, Diff: 0.6, Sum: 1.0]
[Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3]
[Termination (ms): Min: 0.0, Avg: 88.8, Max: 96.5, Diff: 96.5, Sum: 1421.5]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4]
[GC Worker Total (ms): Min: 17675.5, Avg: 17675.6, Max: 17675.8, Diff: 0.3, Sum: 282810.3]
[GC Worker End (ms): Min: 901250.4, Avg: 901250.4, Max: 901250.5, Diff: 0.0]
[Code Root Fixup: 0.7 ms]
[Code Root Migration: 2.3 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 6.7 ms]
[Other: 1194.8 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 2.8 ms]
[Ref Enq: 0.4 ms]
[Redirty Cards: 468.4 ms]
[Free CSet: 4.0 ms]
[Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)]
[Times: user=223.20 sys=0.22, real=18.88 secs]
902.102: Total time for which application threads were stopped: 18.8815330 seconds
[Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4]
[Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3]
[Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)]
902.102: Total time for which application threads were stopped: 18.8815330 seconds
Huge pause!
Few GB of Eden
cleared, big but not
huge
~500ms
pause due to
object copy
17.5s pause due
to “Scan RS”!
Tuning G1GC with Dynamometer
• G1GC has lots of tunables – how do we optimize all of them without
hurting our production system?
• Dynamometer to the rescue
• Easily set up experiments sweeping over different values for a
param
• Fire-and-forget, test with with many combinations and analyze
later
• Main parameters needing significant tuning were for the remembered
sets
• (details to follow in appendix)
How Much Does G1GC Help?
Startup Normal Operation
METRIC CMS G1GC CMS G1GC
Avg Client Latency
(ms)
19 18
Total Pause Time (s) 200 180 550 160
Median Pause Time (s) 1.1 0.5 0.12 0.06
Max Pause Time (s) 13.4 3.3 1.4 0.6
* Values are approximate and provided primarily to give a sense of s
Excellent
reduction in
pause
times
Not much
impact on
throughput
Looking towards the future…
• Question: How does G1GC fare extrapolating to future workloads:
• 600GB+ heap size, 1 billion blocks, 1 billion files
• Answer: Not so well
• RSet entry count has to be increased even further to obtain
reasonable performance
• Off-heap overheads in the hundreds of gigabytes
• Wouldn’t recommend it
Looking towards the future…
• Anything we can do besides G1GC?
• Extensive testing with Azul’s C4 GC available in Zing® JVM
• Good performance with no tuning
• Results in a test environment:
• 99th percentile pause time ~1ms, max in tens of ms
• Average client latency dropped ~30%
• Continued to see good performance up to 600GB heap size
Zing JVM: https://www.azul.com/products/zing/
Azul C4 GC: https://www.azul.com/resources/azul-technology/azul-c4-garbage-collec
Looking towards the future…
• Anything we can do that isn’t proprietary?
• Wait for OpenJDK next gen GC algorithms to mature:
• Shenandoah
• ZGC
Appendix: Detailed G1GC Tuning Tips
• -XX:G1RSetRegionEntries: Solving the problem from the previous slide. 4096 worked well (default of
1536)
• Comes with high off-heap memory overheads
• -XX:G1RSetUpdatingPauseTimePercent: Reduce this to reduce the “Update RS” pause time, push
more work to concurrent threads (NameNode is not really that concurrent – extra cores are better
by the GC algorithm)
• -XX:G1NewSizePercent: Default of 5% is unreasonably large for heaps > 100GB, reducing will help
shorten pauses during high churn periods (startup, failover)
• -XX:MaxTenuringThreshold, -XX:ParallelGCThreads, -XX:ConcGCThreads: Set empirically based on
experiments sweeping over values. This is where Dynamometer really shines
• MaxTenuringThreshold is particularly interesting: Based on NN usage pattern (objects are either
very long lived or very short), would expect low values (1 or 2) to be best, but in practice closer
default of 8 performs better
Thank
you!

More Related Content

PPTX
Scaling HBase for Big Data
PPTX
Node Labels in YARN
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
ODP
Block Storage For VMs With Ceph
PDF
Apache Hadoop YARNとマルチテナントにおけるリソース管理
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)
Scaling HBase for Big Data
Node Labels in YARN
HBaseCon 2015: HBase Performance Tuning @ Salesforce
Block Storage For VMs With Ceph
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Hive + Tez: A Performance Deep Dive
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)

What's hot (20)

PDF
Heterogeneous multiprocessing on androd and i.mx7
PDF
Getting Started with HBase
PDF
Ceph and RocksDB
PPTX
Hadoop -ResourceManager HAの仕組み-
PPTX
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
PDF
HBase Advanced - Lars George
PDF
HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージ
PDF
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
PDF
2021.02 new in Ceph Pacific Dashboard
PDF
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
PPTX
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
PDF
Hive on Tezのベストプラクティス
PDF
BlueStore, A New Storage Backend for Ceph, One Year In
PDF
BPF: Tracing and more
PDF
Une introduction à HBase
PDF
HBase Application Performance Improvement
PDF
Data Center Networks:Virtual Bridging
PDF
【第26回Elasticsearch勉強会】Logstashとともに振り返る、やっちまった事例ごった煮
PDF
TripleOの光と闇
PDF
Open vSwitch 패킷 처리 구조
Heterogeneous multiprocessing on androd and i.mx7
Getting Started with HBase
Ceph and RocksDB
Hadoop -ResourceManager HAの仕組み-
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
HBase Advanced - Lars George
HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージ
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
2021.02 new in Ceph Pacific Dashboard
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
Hive on Tezのベストプラクティス
BlueStore, A New Storage Backend for Ceph, One Year In
BPF: Tracing and more
Une introduction à HBase
HBase Application Performance Improvement
Data Center Networks:Virtual Bridging
【第26回Elasticsearch勉強会】Logstashとともに振り返る、やっちまった事例ごった煮
TripleOの光と闇
Open vSwitch 패킷 처리 구조
Ad

Similar to Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC (20)

PPTX
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
PDF
淺談 Java GC 原理、調教和 新發展
PDF
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
PPTX
Am I reading GC logs Correctly?
PDF
ZGC-SnowOne.pdf
PPTX
Logs @ OVHcloud
PPTX
Galaxy Big Data with MariaDB
ODP
BAXTER phase 1b
PDF
Slices Of Performance in Java - Oleksandr Bodnar
PPT
Everything You Need to Know About Sharding
PDF
The state of Hive and Spark in the Cloud (July 2017)
PPTX
Google file system
PDF
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
PPT
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
PPTX
Become a GC Hero
PPTX
GC Tuning: Fortune 500 Case Studies on Cutting Costs and Boosting Performance
PDF
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
PPTX
Running a Go App in Kubernetes: CPU Impacts
PPTX
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
PDF
Couchbase live 2016
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
淺談 Java GC 原理、調教和 新發展
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Am I reading GC logs Correctly?
ZGC-SnowOne.pdf
Logs @ OVHcloud
Galaxy Big Data with MariaDB
BAXTER phase 1b
Slices Of Performance in Java - Oleksandr Bodnar
Everything You Need to Know About Sharding
The state of Hive and Spark in the Cloud (July 2017)
Google file system
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Become a GC Hero
GC Tuning: Fortune 500 Case Studies on Cutting Costs and Boosting Performance
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
Running a Go App in Kubernetes: CPU Impacts
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Couchbase live 2016
Ad

More from Erik Krogen (6)

PPTX
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
PPTX
Hadoop Meetup Jan 2019 - Hadoop On Azure
PPTX
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
PPTX
Hadoop Meetup Jan 2019 - Overview of Ozone
PPTX
Hadoop Meetup Jan 2019 - Hadoop Encryption
PPTX
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Hadoop Encryption
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PPTX
MYSQL Presentation for SQL database connectivity
PDF
REPORT: Heating appliances market in Poland 2024
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
PPTX
Big Data Technologies - Introduction.pptx
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
KodekX | Application Modernization Development
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
CroxyProxy Instagram Access id login.pptx
PDF
Sensors and Actuators in IoT Systems using pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Transforming Manufacturing operations through Intelligent Integrations
Smarter Business Operations Powered by IoT Remote Monitoring
MYSQL Presentation for SQL database connectivity
REPORT: Heating appliances market in Poland 2024
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
Big Data Technologies - Introduction.pptx
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 2 Digital Image Fundamentals.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Weekly Chronicles - August'25 Week I
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
madgavkar20181017ppt McKinsey Presentation.pdf
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
CroxyProxy Instagram Access id login.pptx
Sensors and Actuators in IoT Systems using pdf

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

  • 1. Dynamometer and A Case Study in NameNode GC Erik Krogen Senior Software Engineer, Hadoop & HDFS
  • 2. Dynamometer • Realistic performance benchmark & stress test for HDFS • Open sourced on LinkedIn GitHub, contributing to Apache • Evaluate scalability limits • Provide confidence before new feature/config deployment
  • 3. What’s a Dynamometer? A dynamometer or "dyno" for short, is a device for measuring force, torque, or power. For example, the power produced by an engine … - Wikipedia Image taken from https://flic.kr/p/dtkCRU and redistributed under the CC BY-SA 2.0 license
  • 4. Main Goals • Accurate namespace: Namespace characteristics have a big impact • Accurate client workload: Request types and the timing of requests both have a big impact • Accurate system workload: Load imposed by system management (block reports, etc.) has a big impact High Fidelity Efficiency • Low cost: Offline infra has high utilization; can’t afford to keep around unused machines for testing • Low developer effort: Deploying to large number of machines can be cumbersome; make it easy • Fast iteration cycle: Should be able to iterate quickly
  • 5. Simplify the Problem NameNode is the central component, most frequent bottleneck: focus here
  • 6. Dynamometer SIMULATED HDFS CLUSTER RUNS IN YARN CONTAINERS • How to schedule and coordinate? Use YARN! • Real NameNode, fake DataNodes to run on ~1% the hardware Dynamomete r Driver DataNode DataNode DataNode DataNode • • • YARN NodeYARN Node NameNode Host YARN Cluster YARN Node DynoAM Host HDFS Cluster FsImage Block Listings
  • 7. Dynamometer SIMULATED HDFS CLIENTS RUN IN YARN CONTAINERS • Clients can run on YARN too! • Replay real traces from production cluster audit logs Dynamomete r Driver DataNode DataNode DataNode DataNode • • • YARN Node • • • YARN Node NameNode Dynamometer Infrastructure Application Workload MapReduce Job Host YARN Cluster Simulated Client Simulated Client Simulated ClientSimulated Client Host HDFS Cluster Audit Logs
  • 8. Contributing to Apache • Working to put into hadoop-tools • Easier place for community to access and contribute • Increased chance of others helping to maintain it • Follow HDFS-12345 (actual ticket, not a placeholder)
  • 10. NameNode GC Primer • Why do we care? • NameNode heaps are huge (multi-hundred GB) • GC is a big factor in performance • What’s special about NameNode GC? • Huge working set: can have over 100GB of long-lived objects • Massive young gen churn (from RPC requests)
  • 11. Can we use a new GC algorithm to squeeze more performance out of the NameNode? Q U E S T I O N
  • 12. Experimental Setup • 16 hour production trace: Long enough to experience 2 rounds of mixed GC • Measure performance via standard metrics (client latency, RPC queue time) • Measure GC pauses during startup and normal workloads • Let’s try G1GC even though we know we’re pushing the limits: • The region sizes can vary from 1 MB to 32 MB depending on the heap size. The goal is to have no more than 2048 regions. – Oracle* • This implies that the heap should be 64 GB and under, but at this*Garbage First Garbage Collector Tuning - https://www.oracle.com/technetwork/articles/java/g1gc-198453
  • 13. Can You Spot the Issue? [Parallel Time: 17676.0 ms, GC Workers: 16] [GC Worker Start (ms): Min: 883574.6, Avg: 883574.8, Max: 883575.0, Diff: 0.3] [Ext Root Scanning (ms): Min: 1.0, Avg: 1.2, Max: 2.1, Diff: 1.1, Sum: 18.8] [Update RS (ms): Min: 31.7, Avg: 32.2, Max: 32.8, Diff: 1.1, Sum: 514.7] [Processed Buffers: Min: 25, Avg: 30.2, Max: 38, Diff: 13, Sum: 484] [Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4] [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.6, Diff: 0.6, Sum: 1.0] [Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3] [Termination (ms): Min: 0.0, Avg: 88.8, Max: 96.5, Diff: 96.5, Sum: 1421.5] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4] [GC Worker Total (ms): Min: 17675.5, Avg: 17675.6, Max: 17675.8, Diff: 0.3, Sum: 282810.3] [GC Worker End (ms): Min: 901250.4, Avg: 901250.4, Max: 901250.5, Diff: 0.0] [Code Root Fixup: 0.7 ms] [Code Root Migration: 2.3 ms] [Code Root Purge: 0.0 ms] [Clear CT: 6.7 ms] [Other: 1194.8 ms] [Choose CSet: 0.0 ms] [Ref Proc: 2.8 ms] [Ref Enq: 0.4 ms] [Redirty Cards: 468.4 ms] [Free CSet: 4.0 ms] [Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)] [Times: user=223.20 sys=0.22, real=18.88 secs] 902.102: Total time for which application threads were stopped: 18.8815330 seconds [Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4] [Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3] [Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)] 902.102: Total time for which application threads were stopped: 18.8815330 seconds Huge pause! Few GB of Eden cleared, big but not huge ~500ms pause due to object copy 17.5s pause due to “Scan RS”!
  • 14. Tuning G1GC with Dynamometer • G1GC has lots of tunables – how do we optimize all of them without hurting our production system? • Dynamometer to the rescue • Easily set up experiments sweeping over different values for a param • Fire-and-forget, test with with many combinations and analyze later • Main parameters needing significant tuning were for the remembered sets • (details to follow in appendix)
  • 15. How Much Does G1GC Help? Startup Normal Operation METRIC CMS G1GC CMS G1GC Avg Client Latency (ms) 19 18 Total Pause Time (s) 200 180 550 160 Median Pause Time (s) 1.1 0.5 0.12 0.06 Max Pause Time (s) 13.4 3.3 1.4 0.6 * Values are approximate and provided primarily to give a sense of s Excellent reduction in pause times Not much impact on throughput
  • 16. Looking towards the future… • Question: How does G1GC fare extrapolating to future workloads: • 600GB+ heap size, 1 billion blocks, 1 billion files • Answer: Not so well • RSet entry count has to be increased even further to obtain reasonable performance • Off-heap overheads in the hundreds of gigabytes • Wouldn’t recommend it
  • 17. Looking towards the future… • Anything we can do besides G1GC? • Extensive testing with Azul’s C4 GC available in Zing® JVM • Good performance with no tuning • Results in a test environment: • 99th percentile pause time ~1ms, max in tens of ms • Average client latency dropped ~30% • Continued to see good performance up to 600GB heap size Zing JVM: https://www.azul.com/products/zing/ Azul C4 GC: https://www.azul.com/resources/azul-technology/azul-c4-garbage-collec
  • 18. Looking towards the future… • Anything we can do that isn’t proprietary? • Wait for OpenJDK next gen GC algorithms to mature: • Shenandoah • ZGC
  • 19. Appendix: Detailed G1GC Tuning Tips • -XX:G1RSetRegionEntries: Solving the problem from the previous slide. 4096 worked well (default of 1536) • Comes with high off-heap memory overheads • -XX:G1RSetUpdatingPauseTimePercent: Reduce this to reduce the “Update RS” pause time, push more work to concurrent threads (NameNode is not really that concurrent – extra cores are better by the GC algorithm) • -XX:G1NewSizePercent: Default of 5% is unreasonably large for heaps > 100GB, reducing will help shorten pauses during high churn periods (startup, failover) • -XX:MaxTenuringThreshold, -XX:ParallelGCThreads, -XX:ConcGCThreads: Set empirically based on experiments sweeping over values. This is where Dynamometer really shines • MaxTenuringThreshold is particularly interesting: Based on NN usage pattern (objects are either very long lived or very short), would expect low values (1 or 2) to be best, but in practice closer default of 8 performs better

Editor's Notes

  • #3: Quick show of hands, who has heard of Dynamometer? How can we know that deploying a new config won’t hurt us? How to know if working on a project is worth it?
  • #6: First key insight: focus on NN only Greatly reduces size of problem: only a single “real” node is needed for NameNode
  • #11: Quick background for those not familiar
  • #12: Enable it on a small cluster and see what happens? Sure, but without the load (heap and RPC) it won’t really tell us much Enable it on production? Sure, but how many apologies do we need to make if something goes wrong? Try it on NNThroughputBenchmark? Sure, but block reports and varying client workloads contribute heavily Dynamometer!
  • #13: Macro effect: performance viewe by client/server Micro effect: GC pauses themselves GC characteristics very different during startup / failover, measure these separately
  • #14: Can you see what’s wrong here? Object copying time? No…. Rset Scanning! Essentially the set of references. Tradeoff between overhead and how expensive it is to scan them NN performance ground to a halt. This is why we can’t test it out on production!
  • #15: Getting these values is where Dynamometer really comes through strong. We tried probably around 50 different combinations of parameters. Dynamometer allowed us to set up an experiment where we could sweep over a number of parameters, let it run over a long weekend, and come back at the end to a bunch of data – no human necessary inbetween
  • #20: Getting these values is where Dynamometer really comes through strong. We tried probably around 50 different combinations of parameters. Dynamometer allowed us to set up an experiment where we could sweep over a number of parameters, let it run over a long weekend, and come back at the end to a bunch of data – no human necessary inbetween
  • #21: - Don’t forget to show the appendix!