ETL Pipeline, Class Notes
ETL Pipeline, Class Notes
Imagine ML pipeline with intermediate data, how do you make sure you don’t rerun a part of the script
that is already finished?
Workflow managers
- Apache Airflow
o Disadvantage: has to have a running server
- Luigi
o Checks existence of intermediate files. If not exists, rerun the part of the pipeline
- MLFlow
- Kubeflow
o Create jobs in Kubernetes
- Metaflow
o Improved version of Luigi
- Prefect
o Built on Dask
Airflow
- You really have to code (flexible)
- One data pipeline is a DAG
- It also uses a meta database
o Postgres (better for production)
o SQLite (not parallelizable, not recommended, but good for simpler tasks)
- In Airflow, you will create the DAGs
- Setting and starting up Airflow
o Initialize database
▪ airflow db init
o Run scheduler
▪ airflow scheduler
o Run web interface
▪ airflow webserver
Airflow commands
- python dag.py
o check if dag.py is syntactically correct
- airflow tasks
o Manage tasks
- airflow tasks test <dag_id> <task_id>
o test the dag
- airflow dags test my_first_dag
o This command is used to manually run a specific task instance within a DAG for testing
purposes. It allows you to execute a single task in isolation, simulating how the task
would run as part of the DAG. You provide the DAG ID, the task ID, and optionally the
execution date to specify which task instance to run.
- airflow dags trigger my_first_dag
o This command is used to trigger the execution of a DAG run. It starts the entire
workflow specified by the DAG and schedules the tasks within the DAG to run according
to their dependencies and scheduling intervals. This command is used to start a new
DAG run, and it is typically used in production environments or when you want to
initiate the DAG execution according to its schedule.