SlideShare a Scribd company logo
Copyright 2013 by Hortonworks and Microsoft
ORC File & Vectorization
Improving Hive Data Storage and Query Performance
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
Jitendra Pandey
jitendra@hortonworks.com
Eric Hanson
ehans@microsoft.com
owen@hortonworks.c
om
ORC – Optimized RC File
Page 2
History
Page 3
Remaining Challenges
Page 4
Requirements
Page 5
File Structure
Page 6
Stripe Structure
Page 7
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
Compression
Page 9
Integer Column Serialization
Page 10
String Column Serialization
Page 11
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double
Compound Type Serialization
Page 13
Generic Compression
Page 14
Column Projection
Page 15
How Do You Use ORC
Page 16
Managing Memory
Page 17
TPC-DS File Sizes
Page 18
ORC Predicate Pushdown
Page 19
Additional Details
Page 20
Current work for Hive 0.12
Page 21
Future Work
Page 22
Comparison
Page 23
RC File Trevni Parquet ORC
Hive Integration Y N N Y
Active Development N N Y Y
Hive Type Model N N N Y
Shred complex columns N Y Y Y
Splits found quickly N Y Y Y
Files per a bucket 1 many 1 or many 1
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N Y Y
Store min, max, sum, count N N N Y
Store internal indexes N N N Y
No overhead for non-null N N N Y ≥ 0.12
Predicate Pushdown N N N Y ≥ 0.12
Vectorization
Page 24
Vectorization
Page 25
Why row-at-a-time execution is slow
Page 26
• Hive uses Object Inspectors to work on a row
• Enables level of abstraction
• Costs major performance
• Exacerbated by using lazy serdes
• Inner loop has many method, new(), and if-
then-else calls
• Lots of CPU instructions
• Pipeline stalls Poor instructions/cycle
• Poor cache locality
How the code works (simplified)
Page 27
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8
Vectorization project
Page 28
Preliminary performance results
• NOT a benchmark
• 218 million row fact table of real data, 25 columns
• 18GB raw data
• 6 core, 12 thread workstation, 1 disk, 16GB RAM
• select a, b, count(*) from t
where c >= const group by a, b -- 53 row result
Page 29
warm start times RC non-
vectorized
(default, not
compressed)
ORC non-
vectorized
(default,
compressed)
ORC vectorized
(default,
compressed)
Runtime (sec) 261 58 43
Total CPU (sec) 381 159 42
Thanks to contributors!
Page 30
• Microsoft Big Data:
• Eric Hanson, Remus Rusanu, Sarvesh
Sakalanaga, Tony Murphy, Ashit Gosalia
• Hortonworks:
• Jitendra Pandey, Owen O’Malley, Gopal V
• Others:
• Teddy Choi, Tim Chen
Jitendra/Eric are joint leads

More Related Content

PPTX
Hive 3 - a new horizon
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Hive tuning
PPTX
HBase in Practice
PPTX
ORC File - Optimizing Your Big Data
PDF
What is in a Lucene index?
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
ODP
Deep Dive Into Elasticsearch
Hive 3 - a new horizon
Hive + Tez: A Performance Deep Dive
Hive tuning
HBase in Practice
ORC File - Optimizing Your Big Data
What is in a Lucene index?
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Deep Dive Into Elasticsearch

What's hot (20)

PDF
Optimizing Hive Queries
PPTX
Node Labels in YARN
PDF
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
PPTX
Impala presentation
PPTX
ORC improvement in Apache Spark 2.3
PDF
Parquet performance tuning: the missing guide
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
LLAP: long-lived execution in Hive
PPTX
Apache Tez - A unifying Framework for Hadoop Data Processing
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PDF
Deep Dive into the New Features of Apache Spark 3.0
PPTX
Apache Spark Architecture
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Introduction To Apache Lucene
PPTX
Transactional SQL in Apache Hive
PPTX
Hive: Loading Data
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
Optimizing Hive Queries
Node Labels in YARN
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
Impala presentation
ORC improvement in Apache Spark 2.3
Parquet performance tuning: the missing guide
Deep Dive: Memory Management in Apache Spark
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
LLAP: long-lived execution in Hive
Apache Tez - A unifying Framework for Hadoop Data Processing
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Deep Dive into the New Features of Apache Spark 3.0
Apache Spark Architecture
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Processing Large Data with Apache Spark -- HasGeek
Dynamic Partition Pruning in Apache Spark
Introduction To Apache Lucene
Transactional SQL in Apache Hive
Hive: Loading Data
Cosco: An Efficient Facebook-Scale Shuffle Service
Ad

Viewers also liked (20)

PPTX
Hive+Tez: A performance deep dive
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PPTX
ORC 2015: Faster, Better, Smaller
PDF
ORC Files
PDF
Parquet Strata/Hadoop World, New York 2013
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PPTX
ORC File and Vectorization - Hadoop Summit 2013
PDF
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
PPTX
ORC File Introduction
PPSX
LLAP Nov Meetup
PPTX
ORC 2015
PDF
Indexed Hive
PPTX
Data organization: hive meetup
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
PDF
Parquet and AVRO
PDF
Big data: Loading your data with flume and sqoop
PDF
Effective Hive Queries
PPTX
How to Test Big Data Systems | QualiTest Group
ZIP
Intro to Pig UDF
Hive+Tez: A performance deep dive
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Efficient Data Storage for Analytics with Apache Parquet 2.0
ORC 2015: Faster, Better, Smaller
ORC Files
Parquet Strata/Hadoop World, New York 2013
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
ORC File and Vectorization - Hadoop Summit 2013
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
ORC File Introduction
LLAP Nov Meetup
ORC 2015
Indexed Hive
Data organization: hive meetup
Project Tungsten: Bringing Spark Closer to Bare Metal
Parquet and AVRO
Big data: Loading your data with flume and sqoop
Effective Hive Queries
How to Test Big Data Systems | QualiTest Group
Intro to Pig UDF
Ad

Similar to ORC File & Vectorization - Improving Hive Data Storage and Query Performance (20)

PPTX
ORC: 2015 Faster, Better, Smaller
PPTX
Using Apache Hive with High Performance
PDF
Overview of the Hive Stinger Initiative
PDF
ORC 2015: Faster, Better, Smaller
PPTX
Hive present-and-feature-shanghai
PPTX
ORC improvement in Apache Spark 2.3
PDF
Improving performance of decision support queries in columnar cloud database ...
PDF
The Apache Spark File Format Ecosystem
PPTX
Hive analytic workloads hadoop summit san jose 2014
PPTX
Hive for Analytic Workloads
PDF
Optimizing Hive Queries
PPTX
Stinger Initiative - Deep Dive
PDF
Vectorized Query Execution in Apache Spark at Facebook
PPTX
Faster Faster Faster! Datamarts with Hive at Yahoo
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PDF
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
PPTX
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
PPTX
Performance Update: When Apache ORC Met Apache Spark
PDF
Column and hadoop
ORC: 2015 Faster, Better, Smaller
Using Apache Hive with High Performance
Overview of the Hive Stinger Initiative
ORC 2015: Faster, Better, Smaller
Hive present-and-feature-shanghai
ORC improvement in Apache Spark 2.3
Improving performance of decision support queries in columnar cloud database ...
The Apache Spark File Format Ecosystem
Hive analytic workloads hadoop summit san jose 2014
Hive for Analytic Workloads
Optimizing Hive Queries
Stinger Initiative - Deep Dive
Vectorized Query Execution in Apache Spark at Facebook
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Performance Update: When Apache ORC Met Apache Spark
Column and hadoop

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced IT Governance
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Big Data Technologies - Introduction.pptx
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectroscopy.pptx food analysis technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
Advanced IT Governance
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
GamePlan Trading System Review: Professional Trader's Honest Take
Big Data Technologies - Introduction.pptx
Transforming Manufacturing operations through Intelligent Integrations
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025

ORC File & Vectorization - Improving Hive Data Storage and Query Performance