Netflix running Presto in the AWS Cloud
Zhenxiao Luo
Senior Software Engineer @ Netflix
Outline
● BigDataPlatform@Netflix
● Use cases & requirements
● What we did
○ Reading/Writing from/to Amazon S3
○ Operations
○ Deployment
○ Performance
● What’s next?
BigDataPlatform @ Netflix
Use Cases
● Big Batch Jobs
○ high throughput, fault tolerant, ETL
○ data spills to disk
○ Hive on Tez, Pig on Tez
● Adhoc Queries
○ low latency, interactive, data exploration
○ in-memory, but limited data size
○ Impala, Redshift, Spark, Presto
Netflix Requirement
● SQL like Language
● Low latency for adhoc queries
● Work well on AWS cloud
● Good integration with Hadoop stack
● Scale to 1000+ node cluster
● Open source with community support
What did Netflix do?
Reading/Writing to/from S3
● Option 1: Apache Hadoop NativeS3FileSysyem
● Option 2: PrestoS3FileSystem
○ retry logic for read timeout
○ write directly to final S3 path
● Option 3: emrFileSystem
○ disable hadoop logging
○ disable hadoop FileSystem cache
Bug Fixes
● https://github.
com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4
6e
● https://github.
com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef
b86
● https://github.com/facebook/presto/pull/1147
● https://github.com/facebook/presto/pull/1300
● https://github.com/facebook/presto/issues/1285
● https://github.com/facebook/presto/issues/1264
Our Operations Environment
● Launch script on top of EMR
● Ganglia integration
● Usage graphs - concurrent queries & tasks
Current Deployment
● Presto in Production @ Netflix
● 100+ nodes Presto Cluster
● 1000+ queries running per day
● Presto query against the same Petabyte Scale S3 Data
Warehouse as Hive and Pig
Observed Performance @ Netflix
● Data in Sequence File Format
● One MapReduce Job SmallTableScan
○ MapReduce overhead dominates the query execution time
○ Presto is always ~10X faster than Hive
● One MapReduce Job BigTableScan
○ MapReduce overhead is marginal compared with big table scan time
○ Presto performs similar to Hive
● Multiple MapReduce Aggregation
○ Presto is always > 10X faster than Hive
● Joins
○ Presto is always > 2X faster than Hive
What we are working on
● Support Parquet File Format
○ https://github.com/facebook/presto/pull/1147
○ Parquet performs similar to Sequence, but not as fast as RCFile
● ODBC/JDBC driver for Presto
○ Support Microstrategy running on Presto
Some inconveniences ...
● Support Server Side “Use Schema”
○ Workaround: Client Side “Use Schema” Or “Schema.Table”
● Recurse the partition directory
○ Different behavior with Hive
● Metadata caching
○ have to rerun the query a number of times to see the metadata
change
● Extend JSON extract functions to allow . notation
○ json_extract_scalar(mapColumn, '$.namePart1.namePart2')
○ Workaround: regexp_extract
● WebUI running slow
○ load query task info on demand
Features we would like
● Big table join
● User Defined Functions
● Break down one column value into several tuples
○ In Hive: lateral view explode json_tuple
● Decimal type
● Scheduler
● Writes
○ Insert overwrite
○ Alter table add partition
○ Parallel writes from workers (not client only)
Q & A
Thank you!

More Related Content

PDF
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
PDF
FLiP Into Trino
PPTX
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
PPTX
Hive+Tez: A performance deep dive
PPTX
YugaByte DB Internals - Storage Engine and Transactions
PPT
Tableau desktop & server
PDF
MyBatis, une alternative à JPA.
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
Alluxio Webinar | 10x Faster Trino Queries on Your Data Platform
FLiP Into Trino
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Hive+Tez: A performance deep dive
YugaByte DB Internals - Storage Engine and Transactions
Tableau desktop & server
MyBatis, une alternative à JPA.
The columnar roadmap: Apache Parquet and Apache Arrow

What's hot (20)

PDF
Understanding Query Plans and Spark UIs
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PDF
Apache Hadoop 3
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Introduction to MongoDB
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PPTX
PostgreSQL and JDBC: striving for high performance
PDF
Presto anatomy
PDF
Parallelization of Structured Streaming Jobs Using Delta Lake
PDF
Making Apache Spark Better with Delta Lake
PPTX
Mongodb introduction and_internal(simple)
ODP
Transparent Hugepages in RHEL 6
PPTX
Druid deep dive
PDF
Apache HBase Improvements and Practices at Xiaomi
PDF
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
ODP
Writing and testing high frequency trading engines in java
PDF
IoT Scale Event-Stream Processing for Connected Fleet at Penske
PDF
Spark SQL Join Improvement at Facebook
PPTX
Apache doris (incubating) introduction
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Understanding Query Plans and Spark UIs
Apache Tez: Accelerating Hadoop Query Processing
Apache Hadoop 3
Building robust CDC pipeline with Apache Hudi and Debezium
Introduction to MongoDB
Cosco: An Efficient Facebook-Scale Shuffle Service
PostgreSQL and JDBC: striving for high performance
Presto anatomy
Parallelization of Structured Streaming Jobs Using Delta Lake
Making Apache Spark Better with Delta Lake
Mongodb introduction and_internal(simple)
Transparent Hugepages in RHEL 6
Druid deep dive
Apache HBase Improvements and Practices at Xiaomi
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Writing and testing high frequency trading engines in java
IoT Scale Event-Stream Processing for Connected Fleet at Penske
Spark SQL Join Improvement at Facebook
Apache doris (incubating) introduction
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Ad

Viewers also liked (17)

PDF
Presto@Uber
POTX
Performance Tuning EC2 Instances
PDF
Presto - Hadoop Conference Japan 2014
PPTX
presto-at-netflix-hadoop-summit-15
PDF
Presto in the cloud
PPTX
Presto@Netflix Presto Meetup 03-19-15
PDF
Engineering Velocity: Shifting the Curve at Netflix
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PDF
Microservices and elastic resource pools with Amazon EC2 Container Service
PDF
Data Science Languages and Industry Analytics
PDF
Map reduce vs spark
PPTX
Amazon EMR Facebook Presto Meetup
PDF
Prestogres internals
PDF
Why Scala Is Taking Over the Big Data World
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PPTX
How To Analyze Geolocation Data with Hive and Hadoop
PDF
Modern SQL in Open Source and Commercial Databases
Presto@Uber
Performance Tuning EC2 Instances
Presto - Hadoop Conference Japan 2014
presto-at-netflix-hadoop-summit-15
Presto in the cloud
Presto@Netflix Presto Meetup 03-19-15
Engineering Velocity: Shifting the Curve at Netflix
Prestogres, ODBC & JDBC connectivity for Presto
Microservices and elastic resource pools with Amazon EC2 Container Service
Data Science Languages and Industry Analytics
Map reduce vs spark
Amazon EMR Facebook Presto Meetup
Prestogres internals
Why Scala Is Taking Over the Big Data World
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
How To Analyze Geolocation Data with Hive and Hadoop
Modern SQL in Open Source and Commercial Databases
Ad

Similar to Netflix running Presto in the AWS Cloud (20)

PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Netflix Open Source Meetup Season 4 Episode 2
PPTX
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Introduction to AWS Big Data
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Intro to Apache Hadoop
PPTX
ApacheCon 2022_ Large scale unification of file format.pptx
PDF
Netty training
PDF
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
PDF
Apache Hadoop 3.0 Community Update
PDF
20140120 presto meetup_en
PDF
Understanding Hadoop
PDF
It's Time To Stop Using Lambda Architecture
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Introduction to Hadoop Administration
AWS Big Data Demystified #1: Big data architecture lessons learned
Netflix Open Source Meetup Season 4 Episode 2
Apache Iceberg - A Table Format for Hige Analytic Datasets
Introduction to AWS Big Data
The Parquet Format and Performance Optimization Opportunities
Intro to Apache Hadoop
ApacheCon 2022_ Large scale unification of file format.pptx
Netty training
A Day in the Life of a Druid Implementor and Druid's Roadmap
Presto Summit 2018 - 09 - Netflix Iceberg
Hadoop 3 @ Hadoop Summit San Jose 2017
Apache Hadoop 3.0 Community Update
20140120 presto meetup_en
Understanding Hadoop
It's Time To Stop Using Lambda Architecture
Introduction to Apache Tajo: Data Warehouse for Big Data
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Introduction to Hadoop Administration

More from Zhenxiao Luo (8)

PDF
Real time analytics on deep learning @ strata data 2019
PDF
Real time analytics at uber @ strata data 2019
PDF
Presto Elasticsearch Connector at Presto Summit
PDF
Uber Geo spatial data platform at DataWorks Summit
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Presto GeoSpatial @ Strata New York 2017
PDF
Presto @ Uber Hadoop summit2017
PDF
Presto Apache BigData 2017
Real time analytics on deep learning @ strata data 2019
Real time analytics at uber @ strata data 2019
Presto Elasticsearch Connector at Presto Summit
Uber Geo spatial data platform at DataWorks Summit
Machine learning and big data @ uber a tale of two systems
Presto GeoSpatial @ Strata New York 2017
Presto @ Uber Hadoop summit2017
Presto Apache BigData 2017

Recently uploaded (20)

PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
The AI Revolution in Customer Service - 2025
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PPTX
Internet of Everything -Basic concepts details
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
SaaS reusability assessment using machine learning techniques
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
EIS-Webinar-Regulated-Industries-2025-08.pdf
The AI Revolution in Customer Service - 2025
SGT Report The Beast Plan and Cyberphysical Systems of Control
Internet of Everything -Basic concepts details
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
MuleSoft-Compete-Deck for midddleware integrations
A symptom-driven medical diagnosis support model based on machine learning te...
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
SaaS reusability assessment using machine learning techniques
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Lung cancer patients survival prediction using outlier detection and optimize...
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
NewMind AI Weekly Chronicles – August ’25 Week IV
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf

Netflix running Presto in the AWS Cloud

  • 1. Netflix running Presto in the AWS Cloud Zhenxiao Luo Senior Software Engineer @ Netflix
  • 2. Outline ● BigDataPlatform@Netflix ● Use cases & requirements ● What we did ○ Reading/Writing from/to Amazon S3 ○ Operations ○ Deployment ○ Performance ● What’s next?
  • 4. Use Cases ● Big Batch Jobs ○ high throughput, fault tolerant, ETL ○ data spills to disk ○ Hive on Tez, Pig on Tez ● Adhoc Queries ○ low latency, interactive, data exploration ○ in-memory, but limited data size ○ Impala, Redshift, Spark, Presto
  • 5. Netflix Requirement ● SQL like Language ● Low latency for adhoc queries ● Work well on AWS cloud ● Good integration with Hadoop stack ● Scale to 1000+ node cluster ● Open source with community support
  • 7. Reading/Writing to/from S3 ● Option 1: Apache Hadoop NativeS3FileSysyem ● Option 2: PrestoS3FileSystem ○ retry logic for read timeout ○ write directly to final S3 path ● Option 3: emrFileSystem ○ disable hadoop logging ○ disable hadoop FileSystem cache
  • 8. Bug Fixes ● https://github. com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4 6e ● https://github. com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef b86 ● https://github.com/facebook/presto/pull/1147 ● https://github.com/facebook/presto/pull/1300 ● https://github.com/facebook/presto/issues/1285 ● https://github.com/facebook/presto/issues/1264
  • 9. Our Operations Environment ● Launch script on top of EMR ● Ganglia integration ● Usage graphs - concurrent queries & tasks
  • 10. Current Deployment ● Presto in Production @ Netflix ● 100+ nodes Presto Cluster ● 1000+ queries running per day ● Presto query against the same Petabyte Scale S3 Data Warehouse as Hive and Pig
  • 11. Observed Performance @ Netflix ● Data in Sequence File Format ● One MapReduce Job SmallTableScan ○ MapReduce overhead dominates the query execution time ○ Presto is always ~10X faster than Hive ● One MapReduce Job BigTableScan ○ MapReduce overhead is marginal compared with big table scan time ○ Presto performs similar to Hive ● Multiple MapReduce Aggregation ○ Presto is always > 10X faster than Hive ● Joins ○ Presto is always > 2X faster than Hive
  • 12. What we are working on ● Support Parquet File Format ○ https://github.com/facebook/presto/pull/1147 ○ Parquet performs similar to Sequence, but not as fast as RCFile ● ODBC/JDBC driver for Presto ○ Support Microstrategy running on Presto
  • 13. Some inconveniences ... ● Support Server Side “Use Schema” ○ Workaround: Client Side “Use Schema” Or “Schema.Table” ● Recurse the partition directory ○ Different behavior with Hive ● Metadata caching ○ have to rerun the query a number of times to see the metadata change ● Extend JSON extract functions to allow . notation ○ json_extract_scalar(mapColumn, '$.namePart1.namePart2') ○ Workaround: regexp_extract ● WebUI running slow ○ load query task info on demand
  • 14. Features we would like ● Big table join ● User Defined Functions ● Break down one column value into several tuples ○ In Hive: lateral view explode json_tuple ● Decimal type ● Scheduler ● Writes ○ Insert overwrite ○ Alter table add partition ○ Parallel writes from workers (not client only)
  • 15. Q & A Thank you!