0% found this document useful (0 votes)
101 views

Vertica Unify 2021 - A Deep Dive of Complex Data Types

The document provides an overview of complex data types in Vertica including: 1) Examples of complex data types like rows, arrays, and maps and how they can represent nested or multi-valued data compared to normalized tables. 2) Use cases for complex data types like co-locating related data to avoid joins, modeling domains where complex types are common, and handling sparse data. 3) How complex data types are supported in file formats like JSON, Avro, Parquet/ORC and how they are loaded and queried in Vertica using flex tables and strongly typed complex types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Vertica Unify 2021 - A Deep Dive of Complex Data Types

The document provides an overview of complex data types in Vertica including: 1) Examples of complex data types like rows, arrays, and maps and how they can represent nested or multi-valued data compared to normalized tables. 2) Use cases for complex data types like co-locating related data to avoid joins, modeling domains where complex types are common, and handling sparse data. 3) How complex data types are supported in file formats like JSON, Avro, Parquet/ORC and how they are loaded and queried in Vertica using flex tables and strongly typed complex types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

A Deep Dive of Complex Data

Types
Deepak Majeti, Technical Lead
Ryan Roelke, Technical Lead
Overview
▪ Introduction to Complex Data Types ▪ Performance Evaluation
- Description ▪ Flattening vs. Complex Data Type
- Use Cases ▪ Join vs. Complex Data Type
- File Formats

▪ Complex Data Types in Vertica ▪ Best Practices


- Flex Tables ▪ When and how to use Complex Types
- Strongly-Typed Tables
- Comparison with other Tools
▪ 11.0 Release
▪ See the list of exciting features coming soon to a Vertica DB near you!
▪ Optimizations
- Native Internal Representation
- Field Pushdown ▪ Future Roadmap

2
Introduction To Complex Data
Types
Examples
Customer ROW ( Customer (name, id)​
name VARCHAR, {'Alex', 10001}​
id INT {'Mary', 20002}​
)
ShipDates ARRAY [DATE] ShipDates[DATE]​
['2021-07-22', '2021-07-24']​
['2021-08-02', '2021-08-01', '2021-06-22']​

HttpRequests MAP <VARCHAR, VARCHAR> HttpRequests <headerKey, headerValue>


[<'Pragma', 'no-cache'>, <'Host', 'ab.c.com'>]​
[<'Host', 'a.bc.com'>, <'Request', 'b.bc.com'>]​

▪ ROW is same as STRUCT

▪ MAP<VARCHAR, VARCHAR> can be handled using ARRAY[ROW(key VARCHAR, value VARCHAR)]

4
In-Lined Nested Complex Data Type

WebSales ROW(
customerKey INT,
orders ARRAY[
ROW(itemKey INT,
itemPrice NUMERIC)
],
pageCounters ROW(
page ARRAY[
ROW (pageKey INTEGER,
pageVersion INTEGER)
],
durationSeconds ARRAY[INTEGER]),
shipDate DATE
)

5
Uses of Complex Data Types

▪ Co-locate related data in the same row


- Avoid data normalization and the resultant costly join
- Avoid data duplication and the resultant extra storage cost
▪ Natural fit for data coming from domains where Complex User Name Search Words
Types are first class citizens
Adam 'XBox', 'Board Games',
- NoSQL, Python, etc. 'Movies', 'Gardening Tools'
▪ Better handle sparse data prominent in Web 2.0 applications Eve 'PS5', 'Baking Goods',
▪ Simplify query analyses 'Movies', 'Hiking Gear'

▪ When joins are avoided


▪ By using rich functions that can iteratively process data

6
Use Cases - Exploit Data Locality

▪ Queries in Clickstream or Search applications benefit if the data is co-located


▪ Popular queries:
- Count the number of search words used by a user
- Find if a search word is used by a user
▪ Arrays/Sets can be used for these use-cases

=> CREATE TABLE words_flat(userName VARCHAR, words VARCHAR);

VS.

=> CREATE TABLE words_set(userName VARCHAR, words SET[VARCHAR]);

7
Use Cases - Exploit Data Locality

Count the number of search words used by each user


=> SELECT userName, count(DISTINCT words) FROM words_flat GROUP BY userName;

Find users that used a specific search word


=> SELECT userName FROM words_flat WHERE words = 'XBox';

VS.

Count the number of search words used by each user


=> SELECT userName, apply_count(words) FROM words_set;

Find users that used a specific search word


=> SELECT userName FROM words_set WHERE contains(words,'XBox');

8
Use Cases – Analyze Web Sales Data
CREATE EXTERNAL TABLE webSales(sessionKey INT, customerKey INT, shipDate DATE) AS COPY FROM ...;

CREATE EXTERNAL TABLE orders(sessionKey INT, itemKey INT, itemPrice NUMERIC) AS COPY FROM ...;

CREATE EXTERNAL TABLE pageCounters(sessionKey INT, pageKey INT, pageVersion INT,


durationSeconds INT) AS COPY FROM ...;

VS.
CREATE EXTERNAL TABLE webSales(
customerKey INT, shipDate DATE,
orders ARRAY[ROW(itemKey INT, itemPrice NUMERIC)],
pageCounters ROW(pageMetadata ARRAY[ROW(pageKey INTEGER, pageVersion INTEGER)],
durationSeconds ARRAY[INTEGER])) AS COPY FROM ...;

▪ Query the page with maximum duration spent, and ship date of those customers who ordered
more than 5 items in a session

9
Use Cases – Query on top of Normalized Tables

=> SELECT maxDurationPageKey, maxDurationPageVersion, shipDate FROM


(SELECT sessionKey, shipDate FROM webSales) webSales
JOIN
(SELECT pageCounters.sessionKey, pageKey maxDurationPageKey,
pageVersion maxDurationPageVersion FROM pageCounters
JOIN
(SELECT sessionKey, max(durationSeconds) durationSeconds FROM
pageCounters GROUP BY sessionKey) maxDurationPage
USING (sessionKey, durationSeconds)) maxDurationPage
USING (sessionKey)
JOIN
(SELECT sessionKey, count(itemKey) itemCount FROM orders
GROUP BY sessionKey) AS numOrders
USING (sessionKey) WHERE itemCount > 5;

10
Use Cases - Query Plan with Join

Join:
webSales x maxDurationPage x itemCount

Filter Step:
itemCount > 5

Join:
pageCounters x pageCounters
Aggregate + GroupBy Step:
itemCount

Load Step: Load Step: Load Step:


webSales pageCounters orders

11
Use Cases - Query and Plan with Complex Data Types

=> SELECT pageCounters.


page[array_find(pageCounters.durationSeconds, apply_max(pageCounters.durationSeconds))],
shipDate FROM webSales WHERE apply_count(orders) > 5;

Expression Evaluation Step:


index = array_find(pageCounters.durationSeconds, apply_max(pageCounters.durationSeconds))
pageCounters.page[index]

Scalar + Filter Step:


apply_count(orders) > 5

Load Step:
webSales

12
Data Formats that Support Complex Data Types

▪ Complex data types are supported by many popular file formats

▪ JSON
- Arrays, Objects
- Schema-less, row-oriented, text-based
- Redundant keys on every row

▪ AVRO
- Records, Enums, Arrays, Maps, Unions, and Fixed
- Has a schema, row-oriented, binary compressed

13
Data Formats that Support Complex Data Types
▪ Parquet / ORC
- Arrays, Maps, Structs
- Has a schema, column-oriented, binary compressed
- Additionally, Parquet encodes the complex-type hierarchy as
repetition and definition levels

File Type​ Schema​ Binary Format​ Columnar​


JSON​ NO​ NO​ NO​
AVRO​ YES​ YES​ NO​
PARQUET/ORC​ YES​ YES​ YES​

14
Complex Types in Vertica
Querying Complex Data Types in Vertica

▪ Flex Tables
- Use a text format called VMap to store data as key-value pairs
- Schema-less/weak-typed
- Data represented in VMap (text) format
- Trade performance for flexibility

▪ Strongly-Typed Complex Types


- New Complex Data Types introduced
- Have a schema/strong-typed
- Data represented in native format
- Give the best possible performance

16
Flex Tables
"restaurants.json": Restaurant Data in JSON format

{
"name" : "Bob's pizzeria",
"cuisine" : "Italian",
"locations" : ["Cambridge", "Pittsburgh"],
"menu" : [{"item" : "cheese pizza", "price" : "$8.25"},
{"item" : "spinach pizza", "price" : "$10.50"}]
}
{
"name" : "Bakersfield Tacos",
"cuisine" : "Mexican",
"locations" : ["Pittsburgh"],
"capacity" : 150,
"menu" : [{"item" : "veggie taco", "price" : "$9.95"},
{"item" : "steak taco", "price" : "$10.95"}]
}

17
Loading into a Flex Table
=> CREATE FLEX TABLE restaurants();
=> COPY restaurants FROM 's3://bucket/data/restaurants.json' PARSER fjsonparser();
Rows Loaded
--------------
2
(1 row)

=> SELECT * FROM restaurants;


__identity__ | __raw__
--------------+------------------------------------------------------
1 | ...\000..cheese pizza$8.25spinach pizza$10.50...\000
2 | ...\001...\000taco$9.95steak taco$10.95...\000\100
(2 rows)

__raw__ column data contains key-value pairs encoded as VMap binary format backed by the long varbinary type
18
Materializing Flex Tables

Turn your schema-less data into strong types

=> ALTER TABLE restaurants ADD COLUMN "name" VARCHAR;


=> ALTER TABLE restaurants ADD COLUMN "cuisine" VARCHAR;
=> ALTER TABLE restaurants ADD COLUMN "locations" ARRAY[VARCHAR];
=> ALTER TABLE restaurants ADD COLUMN "menu" LONG VARBINARY;
=> COPY restaurants FROM 's3://bucket/data/restaurants.json' PARSER fjsonparser();

19
Strongly Typed Tables
New Types added to Vertica: ROW, ARRAY, SET

▪ SET is a collection of elements where the order is not defined. Implemented to optimize element look-up.

=> CREATE EXTERNAL TABLE


orders (customer ROW(id INT, name VARCHAR),
items ARRAY[ARRAY[VARCHAR, 100]]
contact ARRAY[ROW(mobile VARCHAR, zip VARCHAR)])
AS COPY FROM 's3://bucket/data/orders.parquet' PARQUET;

=> SELECT customer.id FROM orders;


=> SELECT items[0][1] FROM orders;
=> SELECT contact[0].mobile FROM orders;

20
Features Unique to Vertica

Feature Vertica BigQuery Snowflake Spark

SET Type Yes Yes No No

Full Support Yes No Yes No


for Null/Empty
Arrays
Bounded Arrays Yes No No No

21
Optimizations
Native Internal Representation
▪ Complex type values are represented as a combination of structure data and value data
▪ Values are stored in their native binary format
▪ This columnar representation aligns with Vertica's execution engine for optimal performance

Type: ROW(id INT, name VARCHAR)


Data ROW​(isNull) id name

(1, 'Vertica') False 1 Vertica


NULL NULL NULL NULL

Type: ARRAY[ARRAY[INT]]
Data Array[isNull] Array[offsets] Array[Elements]
[NULL, [NULL, 100], [200]] [NULL, False, False] [0, 0, 2, 3] [NULL, 100, 200]

NULL NULL NULL NULL


23
Field Pushdown
▪ Materialize only the required fields

=> SELECT customer.id FROM requests;

Data ROW​(isNull) id name

(1, 'Vertica') False 1 Vertica


NULL NULL NULL NULL

=> SELECT count(customer) FROM requests;

Data ROW​(isNull) id name

(1, 'Vertica') False 1 Vertica


NULL NULL NULL NULL
24
Performance Evaluation
Flattening vs. Complex Data Types

▪ We construct a micro-benchmark to evaluate flattening the data vs. using a complex data type
▪ The micro-benchmark consists of user ids and corresponding search words searched by the user
- The data set consists of 1,000,000 users each with up to 1000 distinct search words
▪ A four node Vertica cluster was used and the data was segmented on the userName
▪ The table schemas used are below

=> CREATE TABLE words_flat (userName VARCHAR, words VARCHAR);


=> CREATE TABLE words_array (userName VARCHAR, words ARRAY[VARCHAR, 1000]);
=> CREATE TABLE words_set (userName VARCHAR, words SET[VARCHAR, 1000]);

26
Flattening vs. Complex Data Types
▪ The queries used to evaluate benefit from locality

Query 1: Count the number of search words used by each user


=> SELECT userName, count(words) FROM words_flat GROUP BY userName;
=> SELECT userName, apply_count(words) FROM words_array;
=> SELECT userName, apply_count(words) FROM words_set;

Query 2: Find users that used a specific search word


=> SELECT userName FROM words_flat WHERE words = 'XBox';
=> SELECT userName FROM words_array WHERE contains(words,'XBox');
=> SELECT userName FROM words_set WHERE contains(words,'XBox');

27
Flattening vs. Complex Data Types

28
Flattening vs. Complex Data Types

29
Normalization vs. Complex Data Types

▪ We constuct a benchmark to evaluate normalizing the data vs. using a complex data type
▪ The benchmark is based on the webSales schema
- The data set consists of 100,000,000 sessions
- Each session contains upto 10 items ordered
- Each session contains upto 10 page counters
- This data is stored on HDFS in Parquet file-format

30
Normalization vs. Complex Data Types
▪ Query the page with maximum duration spent, and ship date of those customers who ordered more than
5 items in a session

CREATE EXTERNAL TABLE webSales(sessionKey INT, customerKey INT, shipDate DATE) AS COPY FROM ...;

CREATE EXTERNAL TABLE orders(sessionKey INT, itemKey INT, itemPrice NUMERIC) AS COPY FROM ...;

CREATE EXTERNAL TABLE pageCounters(sessionKey INT, pageKey INT, pageVersion INT,


durationSeconds INT) AS COPY FROM ...;

VS.
CREATE EXTERNAL TABLE webSales(
customerKey INT, shipDate DATE,
orders ARRAY[ROW(itemKey INT, itemPrice NUMERIC), 10],
pageCounters ROW(pageMetadata ARRAY[ROW(pageKey INTEGER, pageVersion INTEGER), 10],
durationSeconds ARRAY[INTEGER, 10])) AS COPY FROM ...;

31
Normalization vs. Complex Data Types

=> SELECT maxDurationPageKey, maxDurationPageVersion, shipDate FROM


(SELECT sessionKey, shipDate FROM webSales) webSales
JOIN
(SELECT pageCounters.sessionKey, pageKey maxDurationPageKey,
pageVersion maxDurationPageVersion FROM pageCounters
JOIN
(SELECT sessionKey, max(durationSeconds) durationSeconds FROM
pageCounters GROUP BY sessionKey) maxDurationPage
USING (sessionKey, durationSeconds)) maxDurationPage
USING (sessionKey)
JOIN
(SELECT sessionKey, count(itemKey) itemCount FROM orders
GROUP BY sessionKey) AS numOrders
USING (sessionKey) WHERE itemCount > 5;

32
Normalization vs. Complex Data Types

=> SELECT pageCounters.


page[array_find(pageCounters.durationSeconds, apply_max(pageCounters.durationSeconds))],
shipDate FROM webSales WHERE apply_count(orders) > 5;

33
Normalization vs. Complex Data Types

34
Normalization vs. Complex Data Types

35
Best Practices
Complex Data Types
▪ Use Complex Types when
- The queries can exploit data locality
- Where data duplication can be avoided
- When the data is sparse
▪ Simplify queries
▪ Take advantage of array bounds where possible
▪ Use SETs where element order does not matter

37
11.0 Release
Features

▪ Complete Support for Parquet file format


▪ Complete Support for ORC file format
▪ Supported Functions
- Explode
- TO_JSON
- Array functions
▪ Create Views
▪ CTAS
- Limited to types supported in ROS
▪ SDK support

39
Features
▪ More 1D Array support
- JSON, AVRO, Kafka formats
- string_to_array
- Bounded arrays
- Implode
- Casting
- Export to Parquet
▪ JDBC Client Driver Support
- ARRAY and SET are java.sql.Array
- ROW and MAP are java.sql.Struct
- Access with ResultSet getArray and getObject

40
Future Roadmap
Complex Types Roadmap

▪ ROS support
▪ JSON / AVRO support
▪ UNNEST
▪ Map support
▪ Export Complex Types
▪ Combine Flex with Strong Complex Types
- Handle semi-structured data efficiently
▪ DB Designer
▪ Considering adding support in
Vertica Python and other Client Drivers

42
Thank you

Join the Vertica Academy: academy.vertica.com

You might also like