S DBT
1. wtf is Data Build Tool
is python tool to processing data as ELT process (EX: load data into SQL, then process) with key
feature:
manage table, view
moduling the code (seem like it make the code as module in DAG structure, hence produce
more convenience workflow to see what is the source of cleared data)
versioning the code
data quality control automatically
2. install
pip install dbt dbt-postgre
3. Adding
start a new project with postgre setup conecttion
dbt init --profiles-dir ./
--profiles-dir ./ : set location of dbt project to current working directory
note: this command come with several question to connect to a database, read a promt
and enter suitable option. These configuration can be modify later at ./profile.yml
example configuration:
brazillian_ecom: # dbt project name
outputs:
dev: # postgre setup
dbname: brazillian_ecommerce
host: localhost
port: 5432
schema: analytics # p_remind: schema in database to use
threads: 1
type: postgres
user: admin
pass: admin123
target: dev
sql code in module
by default is create a view in SQL, to create it as table, need a explicit cmd
compact code generate: SQL source code can be programed (same as EJS in web app)
syntax:
from {{ref("<table_name>")}}
action
Inote: run this command at project root
run all model: dbt run --profiles-dir ./
> run a single model (a sql file that generate single table/view):
dbt run --profiles-dir ./ --select <sql_file_name_without_extention>
dbt run --profiles-dir ./ --select +<sql_file_name_without_extention> # run this
model along with its dependencies
test a model quality: dbt test --profiles-dir ./
> note: a test is setup at: ./models/model_name/schema.yml
> syntax: research later
generate UI:
dbt docs generate --profiles-dir ./
dbt docs serve --profiles-dir ./
seed: transfer csv to sql
note: file to seed need to be place in ./seed/
dbt seed --profiles-dir ./
implement SCD2 with dbt: dbt --debug snapshot --profiles-dir ./
file to set up is conventionally place at ./snapshot
EX: create a table with SCD2 name category_scd2, defined at
./snapshot/category_scd2.sql with content:
{% snapshot category_scd2 %}
{{ config(target_schema='snapshots', unique_key='cate_id', strategy='timestamp',
updated_at='updated_at', invalid_hard_deletes=True) }}
SELECT * FROM category; # with category is dim_table in current database need
to implement an scd2 table to track the ver of record and valid-from/valid-to
{% endsnapshot %}
> this file create a table name category_csd2 table in database to record the table
category
note: it log out the SQL code to create this operation
Inote: hình như mỗi lần update table nguồn xong cần chạy lại lệnh snapshot để nó record
update vào bảng snapshot
4. my work
simple dbt pipiline: dựng một pipiline xử lý dữ liệu từ csv bỏ vào postgree
> note: tên container được đổi manually (khác với file docker compose) để phân biệt với các
container khác
5. working section:
video at -1:11