C R E A T E a n d A LT E R T A B L E Writes
Example s y ntax I N S ERT
CREATE TABLE IF NOT EXISTS logs (
INSERT INTO table SELECT id, data FROM ...
[Link] • [Link]
level string, event_ts timestamp, msg string, ...)
Iceberg Spark 3.3 USING iceberg PARTITIONED BY (level, hours(event_ts)) INSERT INTO table VALUES (1, 'a'), (2, 'b'), ...
C at a l o g s Supported t y pes M ER G E
Primitive types: MERGE INTO target_table t
Configure a catalog, called “sandbox”
USING source_changes s ON [Link] = [Link]
[Link]=\
boolean , int , bigint , float , double , decimal(P,S) ,
WHEN MATCHED AND [Link] = 'delete' THEN DELETE
[Link]
date , timestamp , string , binary WHEN MATCHED THEN UPDATE SET [Link] =
[Link]=rest
[Link] + [Link]
Note: Spark’s timestamp type is Iceberg’s timestamp with time zone type
[Link]=\
WHEN NOT MATCHED THEN INSERT ([Link], [Link])
[Link]
Nested types: VALUES ([Link], [Link])
[Link]=sandbox
struct<name type, ...> , array<item_type> ,
[Link]=...
For performance, add filters to the ON clause for the target table
[Link]=sandbox map<key_type, value_type>
ON [Link] = [Link] AND t.event_ts >=
Supported partition transforms date_add(current_date(), -2)
Working with multiple catalogs in SQL
column Partition by the unmodified column value Uses [Link]
See the session’s current catalog and database
years(event_ts) Year granularity e.g. 2023 copy-on-write vs merge-on-read
SHOW CURRENT DATABASE
months(event_ts) Month granularity e.g. 2023-03 Note: When in doubt, use copy-on-write for the best read performance
Sets the current catalog and database
days(event_ts) Day granularity e.g. 2023-03-01 To enable merge-on-read:
USE [Link]
hours(event_ts) Hour granularity e.g. 2023-03-01-10 ALTER TABLE target_table SET TBLPROPERTIES (
List databases and tables
truncate(width, col) Truncate strings or numbers in col 'format-version'='2',
SHOW DATABASES
'[Link]'='merge-on-read')
bucket(width, col) Hash col values into width buckets
SHOW TABLES
Schema e v olution (ALT ER TABLE ta b l e …)
UPDAT E
q u e r i e s & m e t a d at a t a b l e s UPDATE table SET count = count + 1 WHERE id = 5
ADD COLUMN line_no int AFTER event_ts
Simple select example D E L E T E FR O M
-- widen type (int to bigint, float to double, etc.)
SELECT count(1) as row_count FROM logs
ALTER COLUMN line_no TYPE bigint DELETE FROM table WHERE id = 5
WHERE event_ts >= date_add(current_date(), -7))
ALTER COLUMN line_no COMMENT 'Line number'
AND event_ts < current_date() D ataframe writes
ALTER COLUMN line_no FIRST
Note: Filters automatically select files using partitions and value stats Create a writer
ALTER COLUMN line_no AFTER event_ts
writer = [Link](tableName)
Metadata tables
RENAME COLUMN msg TO message
-- lists all tags and branches
Note: In catalogs with multiple formats, add .using("iceberg")
[Link] DROP COLUMN line_no
Create from dataframe
-- all known revisions of the table
Adding/updating nested types [Link]("[Link]").partitionedBy($"col").create()
[Link] ADD COLUMN location struct<lat float, long float>
Append
-- history of the main branch
ADD COLUMN [Link] float
[Link]("[Link]").append()
[Link]
Note: UPDATE COLUMN can’t modify struct types Overwrite
Note: Must be loaded using the full table name
A lter partition spec [Link]("[Link]").overwrite($"report_date" === d)
Others:
ALTER TABLE ... ADD PARTITION FIELD days(event_ts) AS day [Link]("[Link]").overwritePartitions()
partitions, manifests, files, data_files,
delete_files ALTER TABLE ... DROP PARTITION FIELD days(event_ts) Stored procedures
Setting distribution and sort order B asic s y ntax
I nspecting tables
DESCRIBE [Link]
Globally sort by event_ts CALL system.procedure_name(named_arg => value, ...)
ALTER TABLE logs WRITE ORDERED BY event_ts Compaction
T ime tra v el
Distribute by partitions to writers and locally sort by event_ts Compact data and rewrite all delete files
SELECT ... FROM table FOR VERSION AS OF ref_or_id
ALTER TABLE logs WRITE DISTRIBUTED BY PARTITION
CALL [Link].rewrite_data_files(
SELECT ... FROM table
LOCALLY ORDERED BY event_ts
table => 'table_name',
FOR TIMESTAMP AS OF '2022-04-14 [Link]-07:00'
Remove write order where => 'col1 = "value"',
-- Also works with metadata tables
options => map('min-input-files', '2',
ALTER TABLE logs WRITE UNORDERED
Loading a table from a metadata file 'delete-file-threshold', '1'))
df = [Link]("iceberg").load(
Table properties Compact and sort
"s3://bucket/path/to/[Link]") Set table properties
CALL [Link].rewrite_data_files(
ALTER TABLE table SET TBLPROPERTIES ('prop'='val') table => 'table_name',
M etadata columns
strategy => 'sort',
_file The file location containing the record Format version: 1 or 2 sort_order => 'col1, col2 desc')
_pos The position within _file of the record format-version Compact and sort using z-order
_partition The partition tuple used to store the record Note: Must be 2 for merge -on-read CALL [Link].rewrite_data_files(
Age limit for snapshot retention table => 'table_name',
Functions
strategy => 'sort',
Call I ceberg transform functions [Link]-snapshot-age-ms sort_order => 'zorder(col1, col2)')
SELECT [Link](10, name) FROM table Minimum number of snapshots to retain O ptimi ze table metadata
SELECT [Link](16, id) FROM table
[Link]-snapshots-to-keep CALL [Link].rewrite_manifests(table => 'table')
I nspect the I ceberg librar y v ersion
Mode by command: copy-on-write or merge-on-read R oll back to pre v ious snapshot or time
SELECT [Link].iceberg_version() as version
write.(update|delete|merge).mode CALL [Link].rollback_to_snapshot(
table => 'table_name',
Isolation level by command: snapshot or serializable
snapshot_id => 9180664844100633321)
write.(update|delete|merge).isolation-level
CALL [Link].rollback_to_timestamp(
Target size, in bytes, for split combining for the table table => 'table_name',
tab ul [Link] • d [Link] ul ar.i o
v 0.4.4 [Link]-size timestamp => TIMESTAMP '2023-01-01 [Link].000')