0% found this document useful (0 votes)
367 views1 page

Iceberg Spark Catalog Configuration Guide

This document provides an overview of Iceberg's capabilities for creating and altering tables, inserting and merging data, and working with catalogs and metadata tables in Spark SQL. It describes Iceberg's support for primitive and nested data types, partitioning transforms, schema evolution operations, and writing data from DataFrames.

Uploaded by

fjaimesilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
367 views1 page

Iceberg Spark Catalog Configuration Guide

This document provides an overview of Iceberg's capabilities for creating and altering tables, inserting and merging data, and working with catalogs and metadata tables in Spark SQL. It describes Iceberg's support for primitive and nested data types, partitioning transforms, schema evolution operations, and writing data from DataFrames.

Uploaded by

fjaimesilva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
  • Command Reference Guide

C R E A T E a n d A LT E R T A B L E Writes

Example s y ntax I N S ERT


CREATE TABLE IF NOT EXISTS logs (
INSERT INTO table SELECT id, data FROM ...
[Link] • [Link]
level string, event_ts timestamp, msg string, ...)

Iceberg Spark 3.3 USING iceberg PARTITIONED BY (level, hours(event_ts)) INSERT INTO table VALUES (1, 'a'), (2, 'b'), ...

C at a l o g s Supported t y pes M ER G E

Primitive types: MERGE INTO target_table t

Configure a catalog, called “sandbox”


USING source_changes s ON [Link] = [Link]

[Link]=\
boolean , int , bigint , float , double , decimal(P,S) ,
WHEN MATCHED AND [Link] = 'delete' THEN DELETE

[Link]
date , timestamp , string , binary WHEN MATCHED THEN UPDATE SET [Link] =

[Link]=rest
[Link] + [Link]

Note: Spark’s timestamp type is Iceberg’s timestamp with time zone type
[Link]=\
WHEN NOT MATCHED THEN INSERT ([Link], [Link])

[Link]
Nested types: VALUES ([Link], [Link])
[Link]=sandbox

struct<name type, ...> , array<item_type> ,


[Link]=...
For performance, add filters to the ON clause for the target table
[Link]=sandbox map<key_type, value_type>
ON [Link] = [Link] AND t.event_ts >=

Supported partition transforms date_add(current_date(), -2)


Working with multiple catalogs in SQL
column Partition by the unmodified column value Uses [Link]
See the session’s current catalog and database
years(event_ts) Year granularity e.g. 2023 copy-on-write vs merge-on-read
SHOW CURRENT DATABASE
months(event_ts) Month granularity e.g. 2023-03 Note: When in doubt, use copy-on-write for the best read performance
Sets the current catalog and database
days(event_ts) Day granularity e.g. 2023-03-01 To enable merge-on-read:
USE [Link]
hours(event_ts) Hour granularity e.g. 2023-03-01-10 ALTER TABLE target_table SET TBLPROPERTIES (

List databases and tables


truncate(width, col) Truncate strings or numbers in col 'format-version'='2',

SHOW DATABASES
'[Link]'='merge-on-read')
bucket(width, col) Hash col values into width buckets
SHOW TABLES
Schema e v olution (ALT ER TABLE ta b l e …)
UPDAT E
q u e r i e s & m e t a d at a t a b l e s UPDATE table SET count = count + 1 WHERE id = 5
ADD COLUMN line_no int AFTER event_ts
Simple select example D E L E T E FR O M
-- widen type (int to bigint, float to double, etc.)

SELECT count(1) as row_count FROM logs


ALTER COLUMN line_no TYPE bigint DELETE FROM table WHERE id = 5
WHERE event_ts >= date_add(current_date(), -7))

ALTER COLUMN line_no COMMENT 'Line number'


AND event_ts < current_date() D ataframe writes
ALTER COLUMN line_no FIRST
Note: Filters automatically select files using partitions and value stats Create a writer
ALTER COLUMN line_no AFTER event_ts
writer = [Link](tableName)
Metadata tables
RENAME COLUMN msg TO message
-- lists all tags and branches
Note: In catalogs with multiple formats, add .using("iceberg")
[Link] DROP COLUMN line_no
Create from dataframe
-- all known revisions of the table
Adding/updating nested types [Link]("[Link]").partitionedBy($"col").create()
[Link] ADD COLUMN location struct<lat float, long float>
Append
-- history of the main branch
ADD COLUMN [Link] float
[Link]("[Link]").append()
[Link]
Note: UPDATE COLUMN can’t modify struct types Overwrite
Note: Must be loaded using the full table name
A lter partition spec [Link]("[Link]").overwrite($"report_date" === d)
Others:
ALTER TABLE ... ADD PARTITION FIELD days(event_ts) AS day [Link]("[Link]").overwritePartitions()
partitions, manifests, files, data_files,
delete_files ALTER TABLE ... DROP PARTITION FIELD days(event_ts) Stored procedures
Setting distribution and sort order B asic s y ntax
I nspecting tables
DESCRIBE [Link]
Globally sort by event_ts CALL system.procedure_name(named_arg => value, ...)

ALTER TABLE logs WRITE ORDERED BY event_ts Compaction


T ime tra v el
Distribute by partitions to writers and locally sort by event_ts Compact data and rewrite all delete files
SELECT ... FROM table FOR VERSION AS OF ref_or_id
ALTER TABLE logs WRITE DISTRIBUTED BY PARTITION
CALL [Link].rewrite_data_files(

SELECT ... FROM table


LOCALLY ORDERED BY event_ts
table => 'table_name',

FOR TIMESTAMP AS OF '2022-04-14 [Link]-07:00'

Remove write order where => 'col1 = "value"',

-- Also works with metadata tables


options => map('min-input-files', '2',

ALTER TABLE logs WRITE UNORDERED


Loading a table from a metadata file 'delete-file-threshold', '1'))

df = [Link]("iceberg").load(
Table properties Compact and sort
"s3://bucket/path/to/[Link]") Set table properties
CALL [Link].rewrite_data_files(

ALTER TABLE table SET TBLPROPERTIES ('prop'='val') table => 'table_name',

M etadata columns
strategy => 'sort',

_file The file location containing the record Format version: 1 or 2 sort_order => 'col1, col2 desc')
_pos The position within _file of the record format-version Compact and sort using z-order
_partition The partition tuple used to store the record Note: Must be 2 for merge -on-read CALL [Link].rewrite_data_files(

Age limit for snapshot retention table => 'table_name',

Functions
strategy => 'sort',

Call I ceberg transform functions [Link]-snapshot-age-ms sort_order => 'zorder(col1, col2)')


SELECT [Link](10, name) FROM table Minimum number of snapshots to retain O ptimi ze table metadata
SELECT [Link](16, id) FROM table
[Link]-snapshots-to-keep CALL [Link].rewrite_manifests(table => 'table')
I nspect the I ceberg librar y v ersion
Mode by command: copy-on-write or merge-on-read R oll back to pre v ious snapshot or time
SELECT [Link].iceberg_version() as version
write.(update|delete|merge).mode CALL [Link].rollback_to_snapshot(

table => 'table_name',

Isolation level by command: snapshot or serializable


snapshot_id => 9180664844100633321)
write.(update|delete|merge).isolation-level
CALL [Link].rollback_to_timestamp(

Target size, in bytes, for split combining for the table table => 'table_name',

tab ul [Link] • d [Link] ul ar.i o

v 0.4.4 [Link]-size timestamp => TIMESTAMP '2023-01-01 [Link].000')

You might also like