DB Capacity Planning at eBay
Feng Qu, Sr MTS
Bass Chorng, Principal Capacity Engineer
#CassandraSummit2015
Who Am I?
Bass Chorng – Principal Capacity Engineer @
eBay
Specializes in database performance, availability
& scalability in a large website.
Established DB capacity team at eBay in 2003.
Loves mountain biking.
#CassandraSummit2015 2
eBay Site DB Traffic At A Glance
NoSQL Total – 52 B/Day
Cassandra – 15 B Billion SQL Calls per Day
Mongo – 15 B 15 15 12 10
10
CouchBase – 12 B Cassandra
PushVM – 10B Mongo
CouchBase
RDBMS Total – 350 B 340
PushVM
MySQL – 10 B MySQL
Oracle – 340 B Oracle
Peak Traffic – 8M/sec
Site Total DB Calls – 400B/Day across 2,000 NoSQL Nodes + 450 Oracle Nodes
Hosting 800M Active items & 120M Active Users
Y-o-Y Growth – 30% ~ 35%
#CassandraSummit2015
Capacity Planning - Simply Put
Ø Analyze Traffic
o Data
Ø Analyze Utilization
o Data
Ø Analyze The Relationship Of The Above Two
o Same Data
Ø Forecast Growth
o Simple Models, Then Impress Your Boss.
Ø Convert Resource Need into $
o A Calculator, Then Impress Your CIO’s
BTW, You Also Need To Know …
• Platform Domain Knowledge – Server, DB Engine, IO Subsystem, Networks …
• Relationship Between System Overhead & Utilization
• Seasonality & Workload Characteristics
• Bottlenecks – Components, Systems, Platforms, Architecture, Site & Apps
• New Technologies
#CassandraSummit2015 4
Domain Knowledge Stack
aka Whom To Blame Stack
APPS C
CA
A
P
DB P
AA
CC
UNIX I
TI
YT
Bottom of food chain => STORAGE Y
#CassandraSummit2015 5
Ø What
0.0
1.0
2.0
3.0
5/1/2015
Apps,
4.0To
5/2/2015
5/3/2015
5/4/2015
5/5/2015
5/6/2015
5/7/2015
5/8/2015
Ø How To Use It?
5/10/2015
Ø How To Collect?
5/11/2015
Collect?
5/12/2015
5/13/2015
5/14/2015
5/15/2015
5/16/2015
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
5/17/2015
1/26/2015 5/19/2015
IO Time, NIC, HBA, Array
1/28/2015 5/20/2015
5/21/2015
1/30/2015 5/22/2015
2/1/2015 5/23/2015
5/24/2015
2/3/2015 5/25/2015
2/5/2015 5/26/2015
5/27/2015
2/7/2015
2/9/2015
2/11/2015
2/13/2015
2/15/2015
Data
2/17/2015
2/19/2015
Time Resolution, Aggregation Level, Retention
2/21/2015
2/23/2015
2/25/2015
2/27/2015
3/1/2015
Average, Max, 95th percentile, Dashboard, Reporting, Trending
Database, Sessions, CPU, Memory, Connections, IOPS,
#CassandraSummit2015
6
Forecast
Ø Model Traffic, Not Resources CATY Traffic Forecast
70
Ø Need One Year Trend
60
Ø Forecast At Daily Level
Ø Eliminate 50Outliers
Ø No Data Is Better Than Wrong Data
Billion 40
Ø Convert
Calls Traffic To Resource Usage
30
Ø Linear Extrapolation Only (CPU Utilization, not IO Time)
20
Ø Simple Excel Formula Works Well
Ø For Long 10 Term Resource Planning Only
Ø Use Average, 0 Not Max
01/01/2012 01/01/2013 01/01/2014 01/01/2015
Ø Not All Workloads Are Predictable
Forecast Actual Capacity
#CassandraSummit2015
7
Things To Watch For
Myths
Ø More CPU Makes Apps Run Faster
Ø More Data Makes Apps Run Slower
Ø Apps Run Twice As Fast On CPU Twice The Speed
Ø High Session = High Load
Pitfalls
Ø Cause VS. Symptom
Ø Time Resolution Masks Issues
Ø Look At The Whole Picture
Ø Slow Down In Order To Go Faster < Throttle >
Challenges
Ø Data Quality – Data Missing, Data Source Changes, F/O Data Residency, Data Errors …
Ø Varieties of Data Formats & Resolutions
Ø Data Collection In Secured Zones
#CassandraSummit2015
8
Me: Everything NoSQL
Ø Prior to 2011: Worked on Oracle at DoubleClick/Yahoo/Intuit
Ø Worked on NoSQL at eBay Database Infrastructure team:
Ø Cassandra since 2011
Ø MongoDB since 2012
Ø Couchbase since 2014
Ø Cassandra Summit speaker for 2013, 2014, 2015
Ø DataStax Cassandra MVP for 2014, 2015
CassandraSummit2015
|
#CassandraSummit
For Cassandra
Ø Capacity Measurements
Ø Throughput
Ø Latency
Ø E.g. 30,000 reads/sec with SLA of P99 at 5ms
Ø Hardware SKU Example
Ø CPU: 20 cores
Ø Memory: 128GB RAM
Ø Storage: 1.5TB local SSD
Ø Network: 10g NIC
CassandraSummit2015
|
#CassandraSummit
Benchmarking
Ø Benchmarking for different hardware
Ø High I/O SKU
Ø High memory SKU
Ø High storage SKU
Ø Bare metal or cloud
Ø Benchmarking for different software releases
Ø Benchmarking for different workloads
Ø 100% Writes
Ø 50% Writes, 50% Reads
Ø 5% Writes, 95% Reads
Ø 100% Reads
Ø Benchmarking Tools
Ø YCSB
Ø Cassandra-stress
Ø Proactive and repeated process using near real-time traffic in prod like environment
CassandraSummit2015
|
#CassandraSummit
Capacity Planning
Ø Key to avoid surprise in production
Ø The concept behind capacity planning is simple, but the mechanics are harder.
Ø Business requirements may increase, need to forecast how much resource must be
added to the system to ensure that user experience continues uninterrupted
Ø Input: clearly defined capacity goal coming from business requirement and performance baseline
from benchmark test
Ø Output: Identify resources to be added, such as memory, CPU, storage, I/O, network
Ø Always prepare for peak + headroom
CassandraSummit2015
|
#CassandraSummit
Capacity Planning Process
Ø Initial Sizing
Ø Storage size vs. data size
Ø Compaction overhead, compression ratio, RF, indexes
Ø Cost-effective configuration to meet capacpity/latency SLA
Ø Routine Review
Ø System utilization on I/O, storage, network, CPU, memory etc
Ø Cassandra metrics on GC, compaction, latency, throughput etc
Ø Compactionstats, cfhistoralgrams, tpstats etc
Ø Forecasting
Ø Historical comparison
Ø Traffic projection
Ø Flex up or Flex down
CassandraSummit2015
|
#CassandraSummit
Scale Up vs. Scale Out
Ø Scale Up(vertical)
Ø Pros
Ø Smaller data center footprint, such as space, power, cooling
Ø Less license cost
Ø Cons
Ø Likely cost more using proprietary hardware
Ø Less fault tolerant
Ø Limited upgradability in future
Ø Scale Out(horizontal)
Ø Pros
Ø Cheaper using commodity hardware
Ø More fault tolerant
Ø (unlimited) upgradability
Ø Cons
Ø Bigger data center footprint
Ø More license cost
Ø Likely need more network equipment
CassandraSummit2015
|
#CassandraSummit
Questions ?
eBay is hiring experienced NoSQL professionals, please send resume to [email protected]
CassandraSummit2015
|
#CassandraSummit