The columnar roadmap:
Apache Parquet and Apache Arrow
Julien Le Dem
VP Apache Parquet, Apache Arrow PMC, Principal Engineer WeWork
@J_
julien.ledem@wework.com
June 2018
Julien Le Dem
@J_
Principal Engineer
Data Platform
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez
• Used Hadoop first at Yahoo in 2007
• Formerly Twitter Data platform and Dremio
Julien
Agenda
• Community Driven Standard
• Benefits of Columnar representation
• Vertical integration: Parquet to Arrow
• Arrow based communication
CommunityDriven Standard
An open source standard
• Parquet: Common need for on disk columnar.
• Arrow: Common need for in memory columnar.
• Arrow is building on the success of Parquet.
• Top-level Apache project
• Standard from the start:
– Members from 13+ major open source projects involved
• Benefits:
– Share the effort
– Create an ecosystem
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
Interoperability and Ecosystem
Before With Arrow
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
Benefits of Columnar representation
Columnar layout
Logical table
representation
Row layout
Column layout
@EmrgencyKittens
On Disk and in Memory
• Different trade offs
– On disk: Storage.
• Accessed by multiple queries.
• Priority to I/O reduction (but still needs good CPU throughput).
• Mostly Streaming access.
– In memory: Transient.
• Specific to one query execution.
• Priority to CPU throughput (but still needs good I/O).
• Streaming and Random access.
Parquet on disk columnar format
Parquet on disk columnar format
• Nested data structures
• Compact format:
– type aware encodings
– better compression
• Optimized I/O:
– Projection push down (column pruning)
– Predicate push down (filters based on stats)
Parquet nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Access only the data you need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar Statistics
Read only the
data you need!
Parquet file layout
Arrow in memory columnar format
Arrow goals
• Well-documented and cross language compatible
• Designed to take advantage of modern CPU
• Embeddable
– in execution engines, storage layers, etc.
• Interoperable
Arrow in memory columnar format
• Nested Data Structures
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
CPU pipeline
Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
Vertical integration: Parquet to Arrow
Representation comparison for flat schema
Naïve conversion
Naïve conversion Data dependent branch
Peeling away abstraction layers
Vectorized read
(They have layers)
Bit packing case
Bit packing case No branch
Run length encoding case
Run length encoding case
Block of defined values
Run length encoding case
Block of null values
Predicate pushdown
Example: filter and projection
Naive filter and projection implementation
Peeling away abstractions
Arrow based communication
Universal high performance UDFs
SQL engine
Python
process
User
defined
function
SQL
Operator
1
SQL
Operator
2
reads reads
Arrow RPC/REST API
• Generic way to retrieve data in Arrow format
• Generic way to serve data in Arrow format
• Simplify integrations across the ecosystem
• Arrow based pipe
RPC: arrow based storage interchange
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
projection/predicate
push down
Operator
Arrow batches
Storage
Mem
Disk
SQL
execution
Scanner Operator
Scanner Operator
Storage
Mem
Disk
Storage
Mem
Disk
…
RPC: arrow based cache
The memory
representation is sent
over the wire.
No serialization
overhead.
projection
push down
Operator
Arrow-based
Cache
SQL
execution
Operator
Operator
…
RPC: Single system execution
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
Results
- PySpark Integration:
53x speedup (IBM spark work on SPARK-13534)
http://s.apache.org/arrowresult1
- Streaming Arrow Performance
7.75GB/s data movement
http://s.apache.org/arrowresult2
- Arrow Parquet C++ Integration
4GB/s reads
http://s.apache.org/arrowresult3
- Pandas Integration
9.71GB/s
http://s.apache.org/arrowresult4
Language Bindings
Parquet
• Target Languages
– Java
– CPP
– Python & Pandas
• Engines integration:
– Many!
Arrow
• Target Languages
– Java
– CPP, Python
– R (underway)
– C, Ruby, JavaScript
• Engines integration:
– Drill
– Pandas, R
– Spark (underway)
Arrow Releases
237
131
76
17
62
22 34
178
195
311
89
138
93
137
October 10, 2016 February 18,
2017
May 5, 2017 May 22, 2017 July 23, 2017 August 14, 2017 September 17,
2017
0.1.0 0.2.0 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0
Days Changes
Current activity:
• Spark Integration (SPARK-13534)
• Pages index in Parquet footer (PARQUET-922)
• Arrow REST API (ARROW-1077)
• Bindings:
– C, Ruby (ARROW-631)
–JavaScript (ARROW-541)
Get Involved
• Join the community
– dev@{arrow,parquet}.apache.org
– http://{arrow,parquet}.apache.org
– Follow @Apache{Parquet,Arrow}
THANK YOU!
julien.ledem@wework.com
Julien Le Dem @J_
June 2018
We’re hiring!
Contact: julien.ledem@wework.com
We’re growing
We’re hiring!
• WeWork doubled in size last year and the year before
• We expect to double in size this year as well
• Broadened scope: WeWork Labs, Powered by We, Flatiron
School, Meetup, WeLive, …
WeWork in numbers
Technology jobs
Locations
• San Francisco: Platform teams
• New York
• Tel Aviv
• Singapore
• WeWork has 283 locations, in 75 cities internationally
• Over 253,000 members worldwide
• More than 40,000 member companies
Contact: julien.ledem@wework.com
Questions?
Julien Le Dem @J_ julien.ledem@wework.com
June 2018

The columnar roadmap: Apache Parquet and Apache Arrow

  • 1.
    The columnar roadmap: ApacheParquet and Apache Arrow Julien Le Dem VP Apache Parquet, Apache Arrow PMC, Principal Engineer WeWork @J_ [email protected] June 2018
  • 2.
    Julien Le Dem @J_ PrincipalEngineer Data Platform • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  • 3.
    Agenda • Community DrivenStandard • Benefits of Columnar representation • Vertical integration: Parquet to Arrow • Arrow based communication
  • 4.
  • 5.
    An open sourcestandard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow is building on the success of Parquet. • Top-level Apache project • Standard from the start: – Members from 13+ major open source projects involved • Benefits: – Share the effort – Create an ecosystem Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 6.
    Interoperability and Ecosystem BeforeWith Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  • 7.
    Benefits of Columnarrepresentation
  • 8.
    Columnar layout Logical table representation Rowlayout Column layout @EmrgencyKittens
  • 9.
    On Disk andin Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
  • 10.
    Parquet on diskcolumnar format
  • 11.
    Parquet on diskcolumnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
  • 12.
    Parquet nested representation Document DocIdLinks Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 13.
    Access only thedata you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
  • 14.
  • 15.
    Arrow in memorycolumnar format
  • 16.
    Arrow goals • Well-documentedand cross language compatible • Designed to take advantage of modern CPU • Embeddable – in execution engines, storage layers, etc. • Interoperable
  • 17.
    Arrow in memorycolumnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  • 18.
  • 19.
    Columnar data persons =[{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 20.
    Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name(offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 21.
  • 22.
  • 23.
  • 24.
    Naïve conversion Datadependent branch
  • 25.
    Peeling away abstractionlayers Vectorized read (They have layers)
  • 26.
  • 27.
  • 28.
  • 29.
    Run length encodingcase Block of defined values
  • 30.
    Run length encodingcase Block of null values
  • 31.
  • 32.
  • 33.
    Naive filter andprojection implementation
  • 34.
  • 35.
  • 36.
    Universal high performanceUDFs SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
  • 37.
    Arrow RPC/REST API •Generic way to retrieve data in Arrow format • Generic way to serve data in Arrow format • Simplify integrations across the ecosystem • Arrow based pipe
  • 38.
    RPC: arrow basedstorage interchange The memory representation is sent over the wire. No serialization overhead. Scanner projection/predicate push down Operator Arrow batches Storage Mem Disk SQL execution Scanner Operator Scanner Operator Storage Mem Disk Storage Mem Disk …
  • 39.
    RPC: arrow basedcache The memory representation is sent over the wire. No serialization overhead. projection push down Operator Arrow-based Cache SQL execution Operator Operator …
  • 40.
    RPC: Single systemexecution The memory representation is sent over the wire. No serialization overhead. Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result
  • 41.
    Results - PySpark Integration: 53xspeedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowresult1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  • 42.
    Language Bindings Parquet • TargetLanguages – Java – CPP – Python & Pandas • Engines integration: – Many! Arrow • Target Languages – Java – CPP, Python – R (underway) – C, Ruby, JavaScript • Engines integration: – Drill – Pandas, R – Spark (underway)
  • 43.
    Arrow Releases 237 131 76 17 62 22 34 178 195 311 89 138 93 137 October10, 2016 February 18, 2017 May 5, 2017 May 22, 2017 July 23, 2017 August 14, 2017 September 17, 2017 0.1.0 0.2.0 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 Days Changes
  • 44.
    Current activity: • SparkIntegration (SPARK-13534) • Pages index in Parquet footer (PARQUET-922) • Arrow REST API (ARROW-1077) • Bindings: – C, Ruby (ARROW-631) –JavaScript (ARROW-541)
  • 45.
    Get Involved • Jointhe community – dev@{arrow,parquet}.apache.org – http://{arrow,parquet}.apache.org – Follow @Apache{Parquet,Arrow}
  • 46.
  • 47.
  • 48.
    We’re growing We’re hiring! •WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Singapore • WeWork has 283 locations, in 75 cities internationally • Over 253,000 members worldwide • More than 40,000 member companies Contact: [email protected]
  • 49.