Vertica Unify 2021 - A Deep Dive of Complex Data Types
Vertica Unify 2021 - A Deep Dive of Complex Data Types
Types
Deepak Majeti, Technical Lead
Ryan Roelke, Technical Lead
Overview
▪ Introduction to Complex Data Types ▪ Performance Evaluation
- Description ▪ Flattening vs. Complex Data Type
- Use Cases ▪ Join vs. Complex Data Type
- File Formats
2
Introduction To Complex Data
Types
Examples
Customer ROW ( Customer (name, id)
name VARCHAR, {'Alex', 10001}
id INT {'Mary', 20002}
)
ShipDates ARRAY [DATE] ShipDates[DATE]
['2021-07-22', '2021-07-24']
['2021-08-02', '2021-08-01', '2021-06-22']
4
In-Lined Nested Complex Data Type
WebSales ROW(
customerKey INT,
orders ARRAY[
ROW(itemKey INT,
itemPrice NUMERIC)
],
pageCounters ROW(
page ARRAY[
ROW (pageKey INTEGER,
pageVersion INTEGER)
],
durationSeconds ARRAY[INTEGER]),
shipDate DATE
)
5
Uses of Complex Data Types
6
Use Cases - Exploit Data Locality
VS.
7
Use Cases - Exploit Data Locality
VS.
8
Use Cases – Analyze Web Sales Data
CREATE EXTERNAL TABLE webSales(sessionKey INT, customerKey INT, shipDate DATE) AS COPY FROM ...;
CREATE EXTERNAL TABLE orders(sessionKey INT, itemKey INT, itemPrice NUMERIC) AS COPY FROM ...;
VS.
CREATE EXTERNAL TABLE webSales(
customerKey INT, shipDate DATE,
orders ARRAY[ROW(itemKey INT, itemPrice NUMERIC)],
pageCounters ROW(pageMetadata ARRAY[ROW(pageKey INTEGER, pageVersion INTEGER)],
durationSeconds ARRAY[INTEGER])) AS COPY FROM ...;
▪ Query the page with maximum duration spent, and ship date of those customers who ordered
more than 5 items in a session
9
Use Cases – Query on top of Normalized Tables
10
Use Cases - Query Plan with Join
Join:
webSales x maxDurationPage x itemCount
Filter Step:
itemCount > 5
Join:
pageCounters x pageCounters
Aggregate + GroupBy Step:
itemCount
11
Use Cases - Query and Plan with Complex Data Types
Load Step:
webSales
12
Data Formats that Support Complex Data Types
▪ JSON
- Arrays, Objects
- Schema-less, row-oriented, text-based
- Redundant keys on every row
▪ AVRO
- Records, Enums, Arrays, Maps, Unions, and Fixed
- Has a schema, row-oriented, binary compressed
13
Data Formats that Support Complex Data Types
▪ Parquet / ORC
- Arrays, Maps, Structs
- Has a schema, column-oriented, binary compressed
- Additionally, Parquet encodes the complex-type hierarchy as
repetition and definition levels
14
Complex Types in Vertica
Querying Complex Data Types in Vertica
▪ Flex Tables
- Use a text format called VMap to store data as key-value pairs
- Schema-less/weak-typed
- Data represented in VMap (text) format
- Trade performance for flexibility
16
Flex Tables
"restaurants.json": Restaurant Data in JSON format
{
"name" : "Bob's pizzeria",
"cuisine" : "Italian",
"locations" : ["Cambridge", "Pittsburgh"],
"menu" : [{"item" : "cheese pizza", "price" : "$8.25"},
{"item" : "spinach pizza", "price" : "$10.50"}]
}
{
"name" : "Bakersfield Tacos",
"cuisine" : "Mexican",
"locations" : ["Pittsburgh"],
"capacity" : 150,
"menu" : [{"item" : "veggie taco", "price" : "$9.95"},
{"item" : "steak taco", "price" : "$10.95"}]
}
17
Loading into a Flex Table
=> CREATE FLEX TABLE restaurants();
=> COPY restaurants FROM 's3://bucket/data/restaurants.json' PARSER fjsonparser();
Rows Loaded
--------------
2
(1 row)
__raw__ column data contains key-value pairs encoded as VMap binary format backed by the long varbinary type
18
Materializing Flex Tables
19
Strongly Typed Tables
New Types added to Vertica: ROW, ARRAY, SET
▪ SET is a collection of elements where the order is not defined. Implemented to optimize element look-up.
20
Features Unique to Vertica
21
Optimizations
Native Internal Representation
▪ Complex type values are represented as a combination of structure data and value data
▪ Values are stored in their native binary format
▪ This columnar representation aligns with Vertica's execution engine for optimal performance
Type: ARRAY[ARRAY[INT]]
Data Array[isNull] Array[offsets] Array[Elements]
[NULL, [NULL, 100], [200]] [NULL, False, False] [0, 0, 2, 3] [NULL, 100, 200]
▪ We construct a micro-benchmark to evaluate flattening the data vs. using a complex data type
▪ The micro-benchmark consists of user ids and corresponding search words searched by the user
- The data set consists of 1,000,000 users each with up to 1000 distinct search words
▪ A four node Vertica cluster was used and the data was segmented on the userName
▪ The table schemas used are below
26
Flattening vs. Complex Data Types
▪ The queries used to evaluate benefit from locality
27
Flattening vs. Complex Data Types
28
Flattening vs. Complex Data Types
29
Normalization vs. Complex Data Types
▪ We constuct a benchmark to evaluate normalizing the data vs. using a complex data type
▪ The benchmark is based on the webSales schema
- The data set consists of 100,000,000 sessions
- Each session contains upto 10 items ordered
- Each session contains upto 10 page counters
- This data is stored on HDFS in Parquet file-format
30
Normalization vs. Complex Data Types
▪ Query the page with maximum duration spent, and ship date of those customers who ordered more than
5 items in a session
CREATE EXTERNAL TABLE webSales(sessionKey INT, customerKey INT, shipDate DATE) AS COPY FROM ...;
CREATE EXTERNAL TABLE orders(sessionKey INT, itemKey INT, itemPrice NUMERIC) AS COPY FROM ...;
VS.
CREATE EXTERNAL TABLE webSales(
customerKey INT, shipDate DATE,
orders ARRAY[ROW(itemKey INT, itemPrice NUMERIC), 10],
pageCounters ROW(pageMetadata ARRAY[ROW(pageKey INTEGER, pageVersion INTEGER), 10],
durationSeconds ARRAY[INTEGER, 10])) AS COPY FROM ...;
31
Normalization vs. Complex Data Types
32
Normalization vs. Complex Data Types
33
Normalization vs. Complex Data Types
34
Normalization vs. Complex Data Types
35
Best Practices
Complex Data Types
▪ Use Complex Types when
- The queries can exploit data locality
- Where data duplication can be avoided
- When the data is sparse
▪ Simplify queries
▪ Take advantage of array bounds where possible
▪ Use SETs where element order does not matter
37
11.0 Release
Features
39
Features
▪ More 1D Array support
- JSON, AVRO, Kafka formats
- string_to_array
- Bounded arrays
- Implode
- Casting
- Export to Parquet
▪ JDBC Client Driver Support
- ARRAY and SET are java.sql.Array
- ROW and MAP are java.sql.Struct
- Access with ResultSet getArray and getObject
40
Future Roadmap
Complex Types Roadmap
▪ ROS support
▪ JSON / AVRO support
▪ UNNEST
▪ Map support
▪ Export Complex Types
▪ Combine Flex with Strong Complex Types
- Handle semi-structured data efficiently
▪ DB Designer
▪ Considering adding support in
Vertica Python and other Client Drivers
42
Thank you