Netflix running Presto in the AWS Cloud

Netflix running Presto in the AWS Cloud
Zhenxiao Luo
Senior Software Engineer @ Netflix

Outline
● BigDataPlatform@Netflix
● Use cases & requirements
● What we did
○ Reading/Writing from/to Amazon S3
○ Operations
○ Deployment
○ Performance
● What’s next?

Use Cases
● Big Batch Jobs
○ high throughput, fault tolerant, ETL
○ data spills to disk
○ Hive on Tez, Pig on Tez
● Adhoc Queries
○ low latency, interactive, data exploration
○ in-memory, but limited data size
○ Impala, Redshift, Spark, Presto

Netflix Requirement
● SQL like Language
● Low latency for adhoc queries
● Work well on AWS cloud
● Good integration with Hadoop stack
● Scale to 1000+ node cluster
● Open source with community support

Reading/Writing to/from S3
● Option 1: Apache Hadoop NativeS3FileSysyem
● Option 2: PrestoS3FileSystem
○ retry logic for read timeout
○ write directly to final S3 path
● Option 3: emrFileSystem
○ disable hadoop logging
○ disable hadoop FileSystem cache

Bug Fixes
● https://github.
com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4
6e
● https://github.
com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef
b86
● https://github.com/facebook/presto/pull/1147
● https://github.com/facebook/presto/pull/1300
● https://github.com/facebook/presto/issues/1285
● https://github.com/facebook/presto/issues/1264

Our Operations Environment
● Launch script on top of EMR
● Ganglia integration
● Usage graphs - concurrent queries & tasks

Current Deployment
● Presto in Production @ Netflix
● 100+ nodes Presto Cluster
● 1000+ queries running per day
● Presto query against the same Petabyte Scale S3 Data
Warehouse as Hive and Pig

Observed Performance @ Netflix
● Data in Sequence File Format
● One MapReduce Job SmallTableScan
○ MapReduce overhead dominates the query execution time
○ Presto is always ~10X faster than Hive
● One MapReduce Job BigTableScan
○ MapReduce overhead is marginal compared with big table scan time
○ Presto performs similar to Hive
● Multiple MapReduce Aggregation
○ Presto is always > 10X faster than Hive
● Joins
○ Presto is always > 2X faster than Hive

What we are working on
● Support Parquet File Format
○ https://github.com/facebook/presto/pull/1147
○ Parquet performs similar to Sequence, but not as fast as RCFile
● ODBC/JDBC driver for Presto
○ Support Microstrategy running on Presto

Some inconveniences ...
● Support Server Side “Use Schema”
○ Workaround: Client Side “Use Schema” Or “Schema.Table”
● Recurse the partition directory
○ Different behavior with Hive
● Metadata caching
○ have to rerun the query a number of times to see the metadata
change
● Extend JSON extract functions to allow . notation
○ json_extract_scalar(mapColumn, '$.namePart1.namePart2')
○ Workaround: regexp_extract
● WebUI running slow
○ load query task info on demand

Features we would like
● Big table join
● User Defined Functions
● Break down one column value into several tuples
○ In Hive: lateral view explode json_tuple
● Decimal type
● Scheduler
● Writes
○ Insert overwrite
○ Alter table add partition
○ Parallel writes from workers (not client only)

Netflix running Presto in the AWS Cloud

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Netflix running Presto in the AWS Cloud (20)

More from Zhenxiao Luo (8)

Recently uploaded (20)

Netflix running Presto in the AWS Cloud