0% found this document useful (0 votes)
60 views2 pages

Data Build Tool for Data Engineers

The document discusses Data Build Tool (DBT), a Python tool for data processing using an extract-load-transform methodology. It describes DBT's key features like managing tables and views, modular code, versioning, and data quality controls. It also outlines how to install, configure, and run DBT projects and commands like modeling, testing, documentation, seeding, and snapshotting for slowly changing dimensions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views2 pages

Data Build Tool for Data Engineers

The document discusses Data Build Tool (DBT), a Python tool for data processing using an extract-load-transform methodology. It describes DBT's key features like managing tables and views, modular code, versioning, and data quality controls. It also outlines how to install, configure, and run DBT projects and commands like modeling, testing, documentation, seeding, and snapshotting for slowly changing dimensions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

S DBT

1. wtf is Data Build Tool


is python tool to processing data as ELT process (EX: load data into SQL, then process) with key
feature:
 manage table, view
 moduling the code (seem like it make the code as module in DAG structure, hence produce
more convenience workflow to see what is the source of cleared data)
 versioning the code
 data quality control automatically
2. install
pip install dbt dbt-postgre
3. Adding
 start a new project with postgre setup conecttion
 dbt init --profiles-dir ./
 --profiles-dir ./ : set location of dbt project to current working directory
note: this command come with several question to connect to a database, read a promt
and enter suitable option. These configuration can be modify later at ./profile.yml
example configuration:
brazillian_ecom: # dbt project name
outputs:
dev: # postgre setup
dbname: brazillian_ecommerce
host: localhost
port: 5432
schema: analytics # p_remind: schema in database to use
threads: 1
type: postgres
user: admin
pass: admin123
target: dev
 sql code in module
 by default is create a view in SQL, to create it as table, need a explicit cmd
 compact code generate: SQL source code can be programed (same as EJS in web app)
 syntax:
from {{ref("<table_name>")}}
 action
Inote: run this command at project root
 run all model: dbt run --profiles-dir ./
> run a single model (a sql file that generate single table/view):
 dbt run --profiles-dir ./ --select <sql_file_name_without_extention>
 dbt run --profiles-dir ./ --select +<sql_file_name_without_extention> # run this
model along with its dependencies

 test a model quality: dbt test --profiles-dir ./
> note: a test is setup at: ./models/model_name/schema.yml
> syntax: research later
 generate UI:
 dbt docs generate --profiles-dir ./
 dbt docs serve --profiles-dir ./
 seed: transfer csv to sql
 note: file to seed need to be place in ./seed/
 dbt seed --profiles-dir ./
 implement SCD2 with dbt: dbt --debug snapshot --profiles-dir ./
 file to set up is conventionally place at ./snapshot
 EX: create a table with SCD2 name category_scd2, defined at
./snapshot/category_scd2.sql with content:
{% snapshot category_scd2 %}
{{ config(target_schema='snapshots', unique_key='cate_id', strategy='timestamp',
updated_at='updated_at', invalid_hard_deletes=True) }}
SELECT * FROM category; # with category is dim_table in current database need
to implement an scd2 table to track the ver of record and valid-from/valid-to
{% endsnapshot %}
> this file create a table name category_csd2 table in database to record the table
category
 note: it log out the SQL code to create this operation
 Inote: hình như mỗi lần update table nguồn xong cần chạy lại lệnh snapshot để nó record
update vào bảng snapshot

4. my work
 simple dbt pipiline: dựng một pipiline xử lý dữ liệu từ csv bỏ vào postgree
> note: tên container được đổi manually (khác với file docker compose) để phân biệt với các
container khác
5. working section:
 video at -1:11

You might also like