BashOperatorWithAirflow-FinalAssignment
BashOperatorWithAirflow-FinalAssignment
Hands-on Lab: Build ETL Data Pipelines with BashOperator using Apache
Airflow
Project Scenario
You are a data engineer at a data analytics consulting company. You have been assigned a project to decongest the national highways by analyzing the road traffic data
from different toll plazas. Each highway is operated by a different toll operator with a different IT setup that uses different file formats. Your job is to collect data available
in different formats and consolidate it into a single file.
Objectives
In this assignment, you will develop an Apache Airflow DAG that will:
Throughout this lab, you will be prompted to take screenshots and save them on your device. You will need to upload the screenshots for peer review. You can use various
free screen grabbing tools or your operating system's shortcut keys (Alt + PrintScreen in Windows, for example) to capture the required screenshots. You can save the
screenshots with the .jpg or .png extension.
2. Open a terminal and create a directory structure for the staging area as follows:
/home/project/airflow/dags/finalassignment/staging.
1. 1
Copied! Executed!
1. 1
Copied! Executed!
4. Download the data set from the source to the following destination using the curl command.
1. 1
Copied! Executed!
about:blank 1/4
10/1/24, 10:46 AM about:blank
1. Create a new file named ETL_toll_data.py in /home/project directory and open it in the file editor.
3. Define the DAG arguments as per the following details in the ETL_toll_data.py file:
Parameter Value
owner <You may use any dummy name>
start_date today
email <You may use any dummy email>
email_on_failure True
email_on_retry True
retries 1
retry_delay 5 minutes
4. Define the DAG in the ETL_toll_data.py file using the following details.
Parameter Value
DAG id ETL_toll_data
Schedule Daily once
default_args As you have defined in the previous step
description Apache Airflow Final Assignment
Take a screenshot of the command and output you used. Name the screenshot dag_definition.jpg.
At the end of this exercise, you should have the following screenshots with .jpg or .png extension:
1. dag_args.jpg
2. dag_definition.jpg
You can locally untar and read through the file fileformats.txt to understand the column details.
2. Create a task named extract_data_from_csv to extract the fields Rowid, Timestamp, Anonymized Vehicle number, and Vehicle type from the vehicle-data.csv file
and save them into a file named csv_data.csv.
3. Create a task named extract_data_from_tsv to extract the fields Number of axles, Tollplaza id, and Tollplaza code from the tollplaza-data.tsv file and save it
into a file named tsv_data.csv.
4. Create a task named extract_data_from_fixed_width to extract the fields Type of Payment code, and Vehicle Code from the fixed width file payment-data.txt and
save it into a file named fixed_width_data.csv.
5. Create a task named consolidate_data to consolidate data extracted from previous tasks. This task should create a single csv file named extracted_data.csv by
combining data from the following files:
csv_data.csv
tsv_data.csv
fixed_width_data.csv
The final csv file should use the fields in the order given below:
Rowid
Timestamp
Anonymized Vehicle number
Vehicle type
Number of axles
Tollplaza id
Tollplaza code
Type of Payment code, and
Vehicle Code
about:blank 2/4
10/1/24, 10:46 AM about:blank
Hint: Use the bash paste command that merges the columns of the files passed as a command-line parameter and sends the output to a new file specified. You
can use the command man paste to explore more.
Take a screenshot of the command and output you used. Name the screenshot consolidate_data.jpg.
6. Create a task named transform_data to transform the vehicle_type field in extracted_data.csv into capital letters and save it into a file named
transformed_data.csv in the staging directory.
Hint: You can use the tr command within the BashOperator in Airflow.
1. 1
1. <font color="#8A3FFC">***Take a screenshot***</font> of the command and output you used. Name the screenshot `transform.jpg`.
Copied!
Task Functionality
First task unzip_data
Second task extract_data_from_csv
Third task extract_data_from_tsv
Fourth task extract_data_from_fixed_width
Fifth task consolidate_data
Sixth task transform_data
Take a screenshot of the task pipeline section of the DAG. Name the screenshot task_pipeline.jpg.
At the end of this exercise, you should have the following screenshots with .jpg or .png extension:
1. unzip_data.jpg
2. extract_data_from_csv.jpg
3. extract_data_from_tsv.jpg
4. extract_data_from_fixed_width.jpg
5. consolidate_data.jpg
6. transform.jpg
7. task_pipeline.jpg
Note: If you don't find your DAG in the list, you can check for errors using the following command in the terminal:
1. 1
Copied! Executed!
3. Take a screenshot of DAG unpaused on CLI or the GUI. Name the screenshot unpause_trigger_dag.jpg.
4. Take a screenshot of the tasks in the DAG run through CLI or Web UI. Name the screenshot dag_tasks.jpg.
5. Take a screenshot the DAG runs for the Airflow console through CLI or Web UI. Name the screenshot dag_runs.jpg.
Screenshot checklist
You should have the following screenshots with .jpg or .png extension:
1. submit_dag.jpg
2. unpause_trigger_dag.jpg
3. dag_tasks.jpg
4. dag_runs.jpg
Authors
Lavanya T S
Ramesh Sannareddy
Other Contributors
Rav Ahuja
about:blank 3/4
10/1/24, 10:46 AM about:blank
© IBM Corporation. All rights reserved.
about:blank 4/4