Dags Definitive Guide
Dags Definitive Guide
E
V
IS
E
D
E
D
IT
IO
N
DAGs:
The Definitive Guide
Everything you need to know about Airflow DAGs
Powered by Astronomer
Table of Contents
Editor’s Note
46 Deferrable Operators
52 DAG Design
52 DAG Writing Best Practices in Apache Airflow
64 Passing Data Between Airflow Tasks
Welcome to the ultimate guide to Apache Airflow DAGs, brought to 77 Using Tasks Group in Airflow
86 Cross-DAG Dependencies
you by the Astronomer team. This ebook covers everything you need to
know to work with DAGs, from the building blocks that make them up to
100 Dynamically Generating DAGs
best practices for writing them, dynamically generating them, testing
and debugging them, and more. It’s a guide written by practitioners for 118 Testing Airflow DAGs
practitioners.
130 Debugging DAGs
130 7 Common Errors to Check when Debugging DAGs
144 Error Notifications in Airflow
2
What Exactly is a DAG?
A DAG is a Directed Acyclic Graph — a conceptual representation of a series
of activities, or, in other words, a mathematical abstraction of a data pipeline.
Although used in different circles, both terms, DAG and data pipeline,
represent an almost identical mechanism. In a nutshell, a DAG (or a pipe-
line) defines a sequence of execution stages in any non-recurring algorithm.
DIRECTED — In general, if multiple tasks exist, each must have at least one
Zdefined upstream (previous) or downstream (subsequent) task, or one or
1. DAGs more of both. (It’s important to note however, that there are also DAGs that
have multiple parallel tasks — meaning no dependencies.)
Where to Begin? ACYCLIC — No task can create data that goes on to reference itself. That
could cause an infinite loop, which could give rise to a problem or two.
There are no cycles in DAGs.
A C Y C L I C other tasks.
4 5
“At Astronomer, we believe using a code-based data pipeline tool like This is the key quality of a directed graph: data can follow only in the direc-
Airflow should be a standard,” says Kenten Danas, Lead Developer tion of the vertex. In this example, data can go from A to B, but never B to A.
Advocate at Astronomer. There are many reasons for this, but these In the same way that water flows through pipes in one direction, data must
high-level concepts are crucial: follow in the direction defined by the graph. Nodes from which a directed
vertex extends are considered upstream, while nodes at the receiving end of
• Code-based pipelines are extremely dynamic. If you can write it in code, a vertex are considered downstream.
then you can do it in your data pipeline.
Code-based pipelines are highly extensible. You can integrate with In addition to data moving in one direction, nodes never become self-ref-
basically every system out there, as long as it has an API. erential. That is, they can never inform themselves, as this could create an
• Code-based pipelines are more manageable: Since everything is in code, infinite loop. So data can go from A to B to C/D/E, but once there, no sub-
it can integrate seamlessly into your source controls CI/CD and general sequent process can ever lead back to A/B/C/D/E as data moves down the
developer workflows. There’s no need to manage external things graph. Data coming from a new source, such as node G, can still lead to
differently. nodes that are already connected, but no subsequent data can be passed
back into G. This is the defining quality of an acyclic graph.
Why must this be true for data pipelines? If F had a downstream process
An Example of a DAG in the form of D, we would see a graph where D informs E, which informs F,
which informs D, and so on. It creates a scenario where the pipeline could
run indefinitely without ever ending. Like water that never makes it to the
faucet, such a loop would be a waste of data flow.
C
To put this example in real-world terms, imagine the DAG above represents a
A B data engineering story:
E
• Node A could be the code for pulling data out of an API.
• Node B could be the code for anonymizing the data and dropping
any IP address.
D • Node D could be the code for checking that no duplicate record
G F IDs exist.
• Node E could be putting that data into a database.
• Node F could be running a SQL query on the new tables to update
a dashboard.
Consider the directed acyclic graph above. In this DAG, each vertex (line)
has a specific direction (denoted by the arrow) connecting different nodes.
6 7
You can see the flexibility of DAGs in the following real-world example:
DAGs in Airflow
• DAG dependencies ensure that your data tasks are executed in the same
order every time, making them reliable for your everyday data infrastructure.
• The graphing component of DAGs allows you to visualize dependencies in
Airflow’s user interface.
• Because every path in a DAG is linear, it’s easy to develop and test your
data pipelines against expected outcomes.
An Airflow DAG starts with a task written in Python. You can think of tasks as
the nodes of your DAG: Each one represents a single action, and it can be
dependent on both upstream and downstream tasks.
Tasks are wrapped by operators, which are the building blocks of Airflow,
defining the behavior of their tasks. For example, a Python Operator task will
execute a Python function, while a task wrapped in a Sensor Operator will
wait for a signal before completing an action.
The following diagram shows how these concepts work in practice. As you
can see, by writing a single DAG file in Python, you can begin to define
complex relationships between data and actions.
DAG
Operator Operator
Task Task
8 9
Using a single DAG (like the Customer Operations one shown in yellow), you
are able to: From Operators to DagRuns:
• Extract data from a legacy data store and load it into an AWS S3 bucket. Implementing DAGs in Airflow
• Either train a data model or complete a data transformation, depending
on the data you’re using.
• Store the results of the previous action in a database. While DAGs are simple structures, defining them in code requires some more
• Send information about the entire process to various metrics and complex infrastructure and concepts beyond nodes and vertices. This is
reporting systems. especially true when you need to execute DAGs on a frequent,
reliable basis.
Organizations use DAGs and pipelines that integrate with separate,
interface-driven tools to extract, load, and transform data. But without an Airflow includes a number of structures that enable us to define DAGs in
orchestration platform like Astro from Astronomer, these tools aren’t talking code. While they have unique names, they roughly equate to various
to each other. If there’s an error during the loading, the other tools won’t concepts that we’ve discussed in the book thus far.
know about it. The transformation will be run on bad data, or yesterday’s
data, and deliver an inaccurate report. It’s easy to avoid this, though — a data
orchestration platform can sit on top of everything, tying the DAGs together, How Work Gets Executed in Airflow
orchestrating the dataflow, and alerting in case of failures. Overseeing the
end-to-end life cycle of data allows businesses to maintain interdependency
• Operators are the building blocks of Airflow.
across all systems, which is vital for effective management of data.
Operators contain the logic of how data is processed in a pipeline.
There are different operators for different types of work: some operators
execute general types of code, while others are designed to complete
very specific types of work. We’ll cover various types of operators in the
Operators 101 chapter.
10 11
• Tasks are nodes in a DAG.
In Airflow, a DAG is a group of tasks that have been configured to run in
a directed, acyclic manner. Airflow’s Scheduler parses DAGs to find tasks
which are ready for execution based on their dependencies. If a task is
ready for execution, the Scheduler sends it to an Executor.
Want to
A real-time run of a task is called a task instance (it’s also common to
call this a task run). Airflow logs information about task instances, includ-
ing their running time and status, in a metadata database.
12 13
In this guide, we’ll walk through Airflow scheduling concepts and the differ-
ent ways you can schedule a DAG with a focus on timetables. For additional
instruction on these concepts, check out our Scheduling in Airflow webinar.
Building Blocks There are a couple of terms and parameters in Airflow that are important to
understand related to scheduling.
• Data interval
The data interval is a property of each DAG Run that represents the
Scheduling and Timetables period of data that each task should operate on. For example, for a DAG
scheduled hourly each data interval will begin at the top of the hour
in Airflow (minute 0) and end at the close of the hour (minute 59). The DAG Run is
typically executed at the end of the data interval, depending on whether
your DAG’s schedule has “gaps” in it.
One of the most fundamental features of Apache Airflow is the ability to
schedule jobs. Historically, Airflow users could schedule their DAGs by spec-
• Logical Date
ifying a schedule_interval with a cron expression, a timedelta, or a preset
The logical date of a DAG Run is the same as the start of the data inter-
Airflow schedule.
val. It does not represent when the DAG will actually be executed. Prior
to Airflow 2.2, this was referred to as the execution date.
Timetables, released in Airflow 2.2, brought new flexibility to scheduling.
Timetables allow users to create their own custom schedules using Python,
• Backfilling and Catchup
effectively eliminating the limitations of cron. With timetables, you can now
We won’t cover these concepts in depth here, but they can be related to
schedule DAGs to run at any time for any use case.
scheduling. We recommend reading the Apache Airflow documentation
on them to understand how they work and whether they’re relevant for
14 15
Parameters minutes later.
The following parameters are derived from the concepts described above
and are important for ensuring your DAG runs at the correct time.
Example
As a simple example of how these concepts work together, say we have a In the sections below, we’ll walk through how to use the schedule_interval
DAG that is scheduled to run every 5 minutes. Taking the most recent DAG or timetables to schedule your DAG.
Run, the logical date is 2021-10-08 19:12:36, which is the same as the data_
interval_start shown in the screenshot below. The data_interval_end is 5
16 17
Schedule Interval Note: Do not make your DAG’s schedule dynamic (e.g. datetime.
now())! This will cause an error in the Scheduler.
For pipelines with basic schedules, you can define a schedule_interval in
your DAG. For versions of Airflow prior to 2.2, this is the only mechanism for
defining a DAG’s schedule.
18 19
Schedule Interval Limitations to a timetable behind the scenes. And if a cron expression or timedelta is not
sufficient for your use case, you can define your own timetable.
The relationship between a DAG’s schedule_interval and its logical_date
leads to particularly unintuitive results when the spacing between DagRuns is Custom timetables can be registered as part of an Airflow plugin. They must
irregular. The most common example of irregular spacing is when DAGs run be a subclass of Timetable, and they should contain the following methods,
only during business days (Mon-Fri). In this case, the DagRun with an exe- both of which return a DataInterval with a start and an end:
cution_date of Friday will not run until Monday, even though all of Friday’s
data will be available on Saturday. This means that a DAG whose desired be- • next_dagrun_info: Returns the data interval for the DAG’s
havior is to summarize results at the end of each business day actually cannot regular schedule
be set using only the schedule_interval. In versions of Airflow prior to 2.2, • infer_manual_data_interval: Returns the data interval when the DAG is
one must instead schedule the DAG to run every day (including the week- manually triggered
end) and include logic in the DAG itself to skip all tasks for days on which the
DAG doesn’t really need to run. Below we’ll show an example of implementing these methods in a custom
In addition, it is difficult or impossible to implement situations like the follow- timetable.
ing using a schedule interval:
• Schedule a DAG at different times on different days, like 2pm on Thurs- Example Custom Timetable
days and 4pm on Saturdays.
• Schedule a DAG daily except for holidays. For this implementation, let’s run our DAG at 6:00 and 16:30. Because this
• Schedule a DAG at multiple times daily with uneven intervals (e.g. 1pm schedule has run times with differing hours and minutes, it can’t be repre-
and 4:30pm). sented by a single cron expression. But we can easily implement this sched-
ule with a custom timetable!
In the next section, we’ll describe how these limitations were addressed in
Airflow 2.2 with the introduction of timetables. To start, we need to define the next_dagrun_info and infer_manual_data_
interval methods. Before diving into the code, it’s helpful to think through
what the data intervals will be for the schedule we want. Remember that the
Timetables time the DAG runs (run_after) should be the end of the data interval since
our interval has no gaps. So in this case, for a DAG that we want to run at
Timetables, introduced in Airflow 2.2, address the limitations of cron expres- 6:00 and 16:30, we have two different alternating intervals:
sions and timedeltas by allowing users to define their own schedules in Py-
thon code. All DAG schedules are determined by their internal timetable. • Run at 6:00: Data interval is from 16:30 the previous day to 6:00 the
current day.
Going forward, timetables will become the primary method for scheduling • Run at 16:30: Data interval is from 6:00 to 16:30 the current day.
in Airflow. You can still define a schedule_interval, but Airflow converts this
With that in mind, first we’ll define next_dagrun_info. This method provides
Airflow with the logic to calculate the data interval for scheduled runs. It also
20 21
1 def next_dagrun_info( 28 dFalse, today is the earliest to consider.
2 self, 29 next_start = max(next_start, DateTime.combine(Date.
3 *, 30 today(), Time.min).replace(tzinfo=UTC))
4 last_automated_data_interval: Optional[DataInterval], 31 next_start = next_start.set(hour=6, minute=0).replace(tz-
5 restriction: TimeRestriction, 32 info=UTC)
6 ) -> Optional[DagRunInfo]: 33 next_end = next_start.set(hour=16, minute=30).replace(tz-
7 if last_automated_data_interval is not None: # There was a 34 info=UTC)
8 previous run on the regular schedule. 35 if restriction.latest is not None and next_start > restric-
9 last_start = last_automated_data_interval.start 36 tion.latest:
10 delta = timedelta(days=1) 37 return None # Over the DAG’s scheduled end; don’t sched-
11 if last_start.hour == 6: # If previous period started at 38 ule.
12 6:00, next period will start at 16:30 and end at 6:00 following 39 return DagRunInfo.interval(start=next_start, end=next_end)
13 day
14 next_start = last_start.set(hour=16, minute=30).re-
15 place(tzinfo=UTC) 15
16 next_end = (last_start+delta).replace(tzinfo=UTC) 16 Walking through the logic, this code is equivalent to:
17 else: # If previous period started at 14:30, next period 17
18 will start at 6:00 next day and end at 14:30 18 If there was a previous run for the DAG:
19 next_start = (last_start+delta).set(hour=6, min- 19 • If the previous DAG Run started at 6:00, then the next DAG run should
20 ute=0).replace(tzinfo=UTC) 20 start at 16:30 and end at 6:00 the next day.
21 next_end = (last_start+delta).replace(tzinfo=UTC) 1 •
21 If the previous DAG run started at 16:30, then the DAG run should start at
22 else: # This is the first ever run on the regular schedule. 22
2 6:00 the next day and end at 16:30 the next day.
23 First data interval will always start at 6:00 and end at 16:30 23
3
24 next_start = restriction.earliest 4 If it is the first run of the DAG:
24
25 if next_start is None: # No start_date. Don’t schedule. 5 •
25 Check for a start date. If there isn’t one, the DAG can’t be scheduled.
26 return None 6 •
26 Check if catchup=False. If so, the earliest date to consider should be the
27 if not restriction.catchup: # If the DAG has catchup=- 27
7 current date. Otherwise it is the DAG’s start date.
8 •
28 We’re mandating that the first DAG Run should always start at 6:00, so up-
9 date the time of the interval start to 6:00 and the end to 16:30.
10
11 If the DAG has an end date, do not schedule the DAG after that date has
passed.
Then we define the data interval for manually triggered DAG Runs by defining
the infer_manual_data_interval method. The code looks like this:
22 23
This method figures out what the most recent complete data interval is
1 def infer_manual_data_interval(self, run_after: DateTime) ->
based on the current time. There are three scenarios:
2 DataInterval:
3 delta = timedelta(days=1)
• The current time is between 6:00 and 16:30: In this case, the data interval
4 # If time is between 6:00 and 16:30, period ends at 6am and
is from 16:30 the previous day to 6:00 the current day.
5 starts at 16:30 previous day
• The current time is after 16:30 but before midnight: In this case, the data
6 if run_after >= run_after.set(hour=6, minute=0) and run_after
interval is from 6:00 to 16:30 the current day.
7 <= run_after.set(hour=16, minute=30):
• The current time is after midnight but before 6:00: In this case, the data
8 start = (run_after-delta).set(hour=16, minute=30, sec-
interval is from 6:00 to 16:30 the previous day.
9 ond=0).replace(tzinfo=UTC)
10 end = run_after.set(hour=6, minute=0, second=0).re-
We need to account for time periods in the same timeframe (6:00 to 16:30)
11 place(tzinfo=UTC)
on different days than the day that the DAG is triggered, which requires
12 # If time is after 16:30 but before midnight, period is be-
three sets of logic. When defining custom timetables, always keep in mind
13 tween 6:00 and 16:30 the same day
what the last complete data interval should be based on when the DAG
14 elif run_after >= run_after.set(hour=16, minute=30) and run_
should run.
15 after.hour <= 23:
16 start = run_after.set(hour=6, minute=0, second=0).re-
Now we can take those two methods and combine them into a Timetable
17 place(tzinfo=UTC)
class which will make up our Airflow plugin. The full custom timetable plugin
18 end = run_after.set(hour=16, minute=30, second=0).re-
is below:
19 place(tzinfo=UTC)
20 # If time is after midnight but before 6:00, period is be-
21 tween 6:00 and 16:30 the previous day
22 else:
23 start = (run_after-delta).set(hour=6, minute=0).re-
24 place(tzinfo=UTC)
25 end = (run_after-delta).set(hour=16, minute=30).re-
26 place(tzinfo=UTC)
27 return DataInterval(start=start, end=end)
24 25
12 def infer_manual_data_interval(self, run_after: DateTime) -> 49 delta = timedelta(days=1)
13 DataInterval: 50 if last_start.hour == 6: # If previous period started
14 delta = timedelta(days=1) 51 at 6:00, next period will start at 16:30 and end at 6:00 follow-
15 # If time is between 6:00 and 16:30, period ends at 6am 52 ing day
16 and starts at 16:30 previous day 53 next_start = last_start.set(hour=16, minute=30).
17 if run_after >= run_after.set(hour=6, minute=0) and run_ 54 replace(tzinfo=UTC)
18 after <= run_after.set(hour=16, minute=30): 55 next_end = (last_start+delta).replace(tzinfo=UTC)
19 start = (run_after-delta).set(hour=16, hour=30, sec- 56 else: # If previous period started at 14:30, next pe-
20 ond=0).replace(tzinfo=UTC) 57 riod will start at 6:00 next day and end at 14:30
21 end = run_after.set(hour=6, minute=0, second=0).re- 58 next_start = (last_start+delta).set(hour=6, min-
22 place(tzinfo=UTC) 59 ute=0).replace(tzinfo=UTC)
23 # If time is after 16:30 but before midnight, period is 60 next_end = (last_start+delta).replace(tzinfo=UTC)
24 between 6:00 and 16:30 the same day 61 else: # This is the first ever run on the regular sched-
25 elif run_after >= run_after.set(hour=16, minute=30) and 62 ule. First data interval will always start at 6:00 and end at
26 run_after.hour <= 23: 63 16:30
27 start = run_after.set(hour=6, minute=0, second=0). 64 next_start = restriction.earliest
28 replace(tzinfo=UTC) 65 if next_start is None: # No start_date. Don’t sched-
29 end = run_after.set(hour=16, minute=30, second=0). 66 ule.
30 replace(tzinfo=UTC) 67 return None
31 # If time is after midnight but before 6:00, period is 68 if not restriction.catchup: # If the DAG has catch-
32 between 6:00 and 16:30 the previous day 69 up=False, today is the earliest to consider.
33 else: 70 next_start = max(next_start, DateTime.combine(-
34 start = (run_after-delta).set(hour=6, minute=0).re- 71 Date.today(), Time.min).replace(tzinfo=UTC))
35 place(tzinfo=UTC) 72 next_start = next_start.set(hour=6, minute=0).re-
36 end = (run_after-delta).set(hour=16, minute=30).re- 73 place(tzinfo=UTC)
37 place(tzinfo=UTC) 74 next_end = next_start.set(hour=16, minute=30).re-
38 return DataInterval(start=start, end=end) 75 place(tzinfo=UTC)
39 76 if restriction.latest is not None and next_start > re-
40 def next_dagrun_info( 77 striction.latest:
41 self, 78 return None # Over the DAG’s scheduled end; don’t
42 *, 79 schedule.
43 last_automated_data_interval: Optional[DataInterval], 80 return DagRunInfo.interval(start=next_start, end=next_
44 restriction: TimeRestriction, 81 end)
45 ) -> Optional[DagRunInfo]: 82
46 if last_automated_data_interval is not None: # There was 83 class UnevenIntervalsTimetablePlugin(AirflowPlugin):
47 a previous run on the regular schedule. 84 name = “uneven_intervals_timetable_plugin”
48 last_start = last_automated_data_interval.start 85 timetables = [UnevenIntervalsTimetable]
26 27
Note that because timetables are plugins, you will need to restart the Airflow Looking at the Tree View in the UI (as of Airflow 2.3, this has been replaced
Scheduler and Webserver after adding or updating them. by the Grid View), we can see that this DAG has run twice per day at 6:00
and 16:30 since the start date of 10/9/2021.
In the DAG, we can then import the custom timetable plugin and use it to
schedule the DAG by setting the timetable parameter:
28 29
If we run the DAG manually after 16:30 but before midnight, we can see the
Current Limitations
data interval for the triggered run was between 6:00 and 16:30 that day as
expected.
There are some limitations to keep in mind when implementing custom
timetables:
• Timetable methods should return the same result every time they are
called (e.g. avoid things like HTTP requests). They are not designed to
implement event-based triggering.
This is a simple timetable that could easily be adjusted to suit other use cas-
es. In general, timetables are completely customizable as long as the meth-
ods above are implemented.
Never miss an update
from us.
Sign up for the Astronomer newsletter.
30 31
Operators 101 Python Operator
Operators are the main building blocks of Airflow DAGs. They are classes
1 def hello(**kwargs):
that contain the logic for how to complete a unit of work.
2 print(‘Hello from {kw}’.format(kw=kwargs[‘my_keyword’]))
3
You can use operators in Airflow by instantiating them in tasks. A task frames
4 t2 = PythonOperator(
the work that an operator does within the context of a DAG.
5 task_id=’python_hello’,
6 dag=dag,
To browse and search all of the available operators in Airflow, visit the
7 python_callable=hello,
Astronomer Registry. The following are examples of operators that are
8 op_kwargs={‘my_keyword’: ‘Airflow’}
frequently used in Airflow projects.
9 )
BashOperator
The PythonOperator calls a python function defined earlier in our code. You
1 t1 = BashOperator(
can pass parameters to the function via the op_kwargs parameter. This task
2 task_id=’bash_hello_world’,
will print “Hello from Airflow” when it runs.
3 dag=dag,
4 bash_command=’echo “Hello World”’
Github: PythonOperator Code
5 )
Postgres Operator
This BashOperator simply runs a bash command and echos “Hello World”.
1 t3 = PostgresOperator(
Github: BashOperator Code
2 task_id=’PythonOperator’,
3 sql=’CREATE TABLE my_table (my_column varchar(10));’,
4 postgres_conn_id=’my_postgres_connection’,
5 autocommit=False
6 )
32 33
This operator issues a SQL statement against a Postgres database. Creden-
S3 To Redshift Operator
tials for the database are stored in an Airflow connection called my_post-
gres_connection. If you look at the code for the PostgresOperator, it uses a
PostgresHook to actually interact with the database.
1 t5 = S3ToRedshiftOperator(
2 task_id=’S3ToRedshift’,
Github: PostgresOperator
3 schema=’public’,
4 table=’my_table’,
5 s3_bucket=’my_s3_bucket’,
6 s3_key=’{{ ds_nodash }}/my_file.csv’,
SSH Operator 7 redshift_conn_id=’my_redshift_connection’,
8 aws_conn_id=’my_aws_connection’
9 )
1 t4 = SSHOperator(
2 task_id=’SSHOperator’,
3 ssh_conn_id=’my_ssh_connection’,
4 command=’echo “Hello from SSH Operator”’
5 )
The S3ToRedshiftOperator loads data from S3 to Redshift via Redshift’s
6
COPY command. This operator is a “Transfer Operator”, which a type of
operator designed to move data from one system to another. In this case,
we’re moving data from S3 to Redshift using two separate Airflow Connec-
tions: one for S3 and one for Redshift.
Like the BashOperator, the SSHOperator allows you to run a bash command,
but has built-in support to SSH into a remote machine to run commands This also uses another concept - macros and templates. In the s3_key pa-
there. rameter, Jinja template notation is used to pass the execution date for this
DAG Run formatted as a string with no dashes (ds_nodash - a predefined
The private key to authenticate to the remote server is stored in Airflow Con- macro in Airflow). It will look for a key formatted similarly to my_s3_buck-
nections as my_ssh_conenction. This key can be referred to in all DAGs, so the et/20190711/my_file.csv, with the timestamp dependent on when the file
operator itself only needs the command you want to run. This operator uses an ran..
SSHHook to establish the ssh connection and run the command.
Templates can be used to set runtime parameters (e.g. the range of data for
Github: SSHOperator Code an API call) and also make your code idempotent (each intermediary file is
named for the data range it contains).`
34 35
different Python libraries for these interactions.
Hooks 101 For example, the S3Hook, which is one of the most widely used hooks, relies
on the boto3 library to manage its connection with S3.
Overview The S3Hook contains over 20 methods to interact with S3 buckets, including
methods like:
Hooks are one of the fundamental building blocks of Airflow. At a high level,
a hook is an abstraction of a specific API that allows Airflow to interact with • check_for_bucket: Checks if a bucket with a specific name exists.
an external system. Hooks are built into many operators, but they can also be • list_prefixes: Lists prefixes in a bucket according to specified parame-
used directly in DAG code. ters.
• list_keys: Lists keys in a bucket according to specified parameters.
In this guide, we’ll cover the basics of using hooks in Airflow and when to use
• load_file: Loads a local file to S3.
them directly in DAG code. We’ll also walk through an example of
• download_file: Downloads a file from the S3 location to the local file
implementing two different hooks in a DAG.
system.
Hooks wrap around APIs and provide methods to interact with different
• If you write a custom operator to interact with an external system, it
external systems. Because hooks standardize the way you can interact with
should use a hook to do so.
external systems, using them makes your DAG code cleaner, easier to read,
and less prone to errors.
• If an operator with built-in hooks exists for your specific use case, then it
To use a hook, you typically only need a connection ID to connect with an is best practice to use the operator over setting up hooks manually.
external system. More information on how to set up connections can be
found in Managing your Connections in Apache Airflow or in the example • If you regularly need to connect to an API for which no hook exists yet,
section below. consider writing your own and sharing it with the community!
All hooks inherit from the BaseHook class, which contains the logic to set up
an external connection given a connection ID. On top of making the con-
nection to an external system, each hook might contain additional methods
to perform various actions within that system. These methods might rely on
36 37
Next you will need to set up connections to the S3 bucket and Slack in the
Example Implementation Airflow UI.
The following example shows how you can use two hooks (S3Hook and Slack-
Go to Admin -> Connections and click on the plus sign to add a new
Hook) to retrieve values from files in an S3 bucket, run a check on them, post
connection.
the result of the check on Slack, and log the response of the Slack API.
1. Select Amazon S3 as connection type for the S3 bucket (if the connec-
For this use case, we use hooks directly in our Python functions because none
tion type is not showing up, double check that you installed the provider
of the existing S3 Operators can read data from several files within an S3
correctly) and provide the connection with your AWS access key ID as
bucket. Similarly, none of the existing Slack Operators can return the re-
login and your AWS secret access key as password (See AWS docu-
sponse of a Slack API call, which you might want to log for monitoring pur-
mentation for how to retrieve your AWS access key ID and AWS secret
poses.
access key).
The full source code of the hooks used can be found here:
2. Create a new connection. Select Slack Webhook as the connection type
• S3Hook source code
and provide your Bot User OAuth Token as a password. This token can be
• SlackHook source code
obtained by going to Features > OAuth & Permissions on api.slack.com/
apps.
Before running the example DAG, make sure you have the necessary Airflow
providers installed. If you are using the Astro CLI, you can do this by adding
The DAG below uses Airflow Decorators to define tasks and XCom to pass
the following packages to your requirements.txt:
information between them. The name of the S3 bucket and the names of the
files that the first task reads are stored as environment variables for security
purposes.
1 apache-airflow-providers-amazon
2 apache-airflow-providers-slack
38 39
1 # importing necessary packages 37 return (False, response[‘num3’])
2 import os 38
3 from datetime import datetime 39 # task posting to slack depending on the outcome of the above
4 from airflow import DAG 40 check
5 from airflow.decorators import task 41 # and returning the server response
6 from airflow.providers.slack.hooks.slack import SlackHook 42 @task.python
7 from airflow.providers.amazon.aws.hooks.s3 import S3Hook 43 def post_to_slack(sum_check_result):
8 44 slack_hook = SlackHook(slack_conn_id=’hook_tutorial_slack_
9 # import environmental variables for privacy (set in Dockerfile) 45 conn’)
10 S3BUCKET_NAME = os.environ.get(‘S3BUCKET_NAME’) 46
11 S3_EXAMPLE_FILE_NAME_1 = os.environ.get(‘S3_EXAMPLE_FILE_NAME_1’) 47 if sum_check_result[0] == True:
12 S3_EXAMPLE_FILE_NAME_2 = os.environ.get(‘S3_EXAMPLE_FILE_NAME_2’) 48 server_response = slack_hook.call(api_method=’chat.post-
13 S3_EXAMPLE_FILE_NAME_3 = os.environ.get(‘S3_EXAMPLE_FILE_NAME_3’) 49 Message’,
14 50 json={“channel”: “#test-airflow”,
15 # task to read 3 keys from your S3 bucket 51 “text”: f”””All is well in your bucket!
16 @task.python 52 Correct sum: {sum_check_result[1]}!”””})
17 def read_keys_form_s3(): 53 else:
18 s3_hook = S3Hook(aws_conn_id=’hook_tutorial_s3_conn’) 54 server_response = slack_hook.call(api_method=’chat.post-
19 response_file_1 = s3_hook.read_key(key=S3_EXAMPLE_FILE_NAME_1, 55 Message’,
20 bucket_name=S3BUCKET_NAME) 56 json={“channel”: “#test-airflow”,
21 response_file_2 = s3_hook.read_key(key=S3_EXAMPLE_FILE_NAME_2, 57 “text”: f”””A test on your bucket contents
22 bucket_name=S3BUCKET_NAME) 58 failed!
23 response_file_3 = s3_hook.read_key(key=S3_EXAMPLE_FILE_NAME_3, 59 Target sum not reached: {sum_check_re-
24 bucket_name=S3BUCKET_NAME) 60 sult[1]}”””})
25 61
26 response = {‘num1’ : int(response_file_1), 62 # return the response of the API call (for logging or use
27 ‘num2’ : int(response_file_2), 63 downstream)
28 ‘num3’ : int(response_file_3)} 64 return server_response
29 65
30 return response 66 # implementing the DAG
31 67 with DAG(dag_id=’hook_tutorial’,
32 # task running a check on the data retrieved from your S3 bucket 68 start_date=datetime(2022,5,20),
33 @task.python 69 schedule_interval=’@daily’,
34 def run_sum_check(response): 70 catchup=False,
35 if response[‘num1’] + response[‘num2’] == response[‘num3’]: 71 ) as dag:
36 return (True, response[‘num3’])
40 41
72
73
# the dependencies are automatically set by XCom
response = read_keys_form_s3()
Sensors 101
74 sum_check_result = run_sum_check(response)
75 post_to_slack(sum_check_result) Sensors are a special kind of operator. When they run, they check to see if
a certain criterion is met before they let downstream tasks execute. This is a
great way to have portions of your DAG wait on some external check or pro-
cess to complete.
The DAG above completes the following steps: To browse and search all of the available Sensors in Airflow, visit the Astrono-
mer Registry. Take the following sensor as an example:
1. Use a decorated Python Operator with a manually implemented S3Hook
to read three specific keys from S3 with the read_key method. Returns a
dictionary with the file contents converted to integers.
1 s1 = S3KeySensor(
2. With the results of the first task, use a second decorated Python Opera- 2 task_id=’s3_key_sensor’,
tor to complete a simple sum check. 3 bucket_key=’{{ ds_nodash }}/my_file.csv’,
4 bucket_name=’my_s3_bucket’,
3. Post the result of the check to a Slack channel using the call method of 5 aws_conn_id=’my_aws_connection’,
the SlackHook and return the response from the Slack API. 6 )
S3 Key Sensor
The S3KeySensor checks for the existence of a specified key in S3 every few
seconds until it finds it or times out. If it finds the key, it will be marked as a
success and allow downstream tasks to run. If it times out, it will fail and pre-
vent downstream tasks from running.
S3KeySensor Code
42 43
Sensor Params Deferrable Operators
There are sensors for many use cases, such as ones that check a database for
a certain row, wait for a certain time of day, or sleep for a certain amount of Prior to Airflow 2.2, all task execution occurred within your worker resources.
time. All sensors inherit from the BaseSensorOperator and have 4 parameters For tasks whose work was occurring outside of Airflow (e.g. a Spark Job),
you can set on any sensor. your tasks would sit idle waiting for a success or failure signal. These idle tasks
would occupy worker slots for their entire duration, potentially queuing other
• soft_fail: Set to true to mark the task as SKIPPED on failure. tasks and delaying their start times.
• poke_interval: Time in seconds that the job should wait in between each With the release of Airflow 2.2, Airflow has introduced a new way to run tasks
try. The poke interval should be more than one minute to prevent too in your environment: deferrable operators. These operators leverage
much load on the scheduler. Python’s asyncio library to efficiently run tasks waiting for an external
resource to finish. This frees up your workers, allowing you to utilize those
• timeout: Time, in seconds before the task times out and fails. resources more effectively. In this guide, we’ll walk through the concepts of
deferrable operators, as well as the new components introduced to Airflow
• mode: How the sensor operates. Options are: { poke | reschedule }, related to this feature.
default is poke. When set to poke the sensor will take up a worker slot
for its whole execution time (even between pokes). Use this mode if
Deferrable Operator Concepts
There are some terms and concepts that are important to understand when
discussing deferrable operators:
44 45
Note: The terms “deferrable” and “async” or “asynchronous” are
When and Why to Use Deferrable Operators
often used interchangeably. They mean the same thing in
In general, deferrable operators should be used whenever you have tasks
this context.
that occupy a worker slot while polling for a condition in an external system.
For example, using deferrable operators for sensor tasks (e.g. poking for a
file on an SFTP server) can result in efficiency gains and reduced operation-
With traditional operators, a task might submit a job to an external system al costs. In particular, if you are currently working with smart sensors, you
(e.g. a Spark Cluster) and then poll the status of that job until it completes. should consider using deferrable operators instead. Compared to smart sen-
Even though the task might not be doing significant work, it would still oc- sors, which were deprecated in Airflow 2.2.4, deferrable operators are more
cupy a worker slot during the polling process. As worker slots become oc- flexible and better supported by Airflow.
cupied, tasks may be queued resulting in delayed start times. Visually, this is
represented in the diagram below: Currently, the following deferrable operators are available in Airflow:
• TimeSensorAsync
Submit Job to Submit Job to • DateTimeSensorAsync
Poll Spark Cluster for Job Status
Spark Cluster Spark Cluster
However, this list will grow quickly as the Airflow community makes more
investments into these operators. In the meantime, you can also create your
Worker Slot Allocated own (more on this in the last section of this guide). Additionally, Astronomer
maintains some deferrable operators available only on Astro Runtime.
There are numerous benefits to using deferrable operators. Some of the most
notable are:
With deferrable operators, worker slots can be released while polling for job
status. When the task is deferred (suspended), the polling process is offload-
• Reduced resource consumption: Depending on the available resources
ed as a trigger to the triggerer, freeing up the worker slot. The triggerer has
and the workload of your triggers, you can run hundreds to thousands
the potential to run many asynchronous polling tasks concurrently, prevent-
of deferred tasks in a single triggerer process. This can lead to a reduc-
ing this work from occupying your worker resources. When the terminal status
tion in the number of workers needed to run tasks during periods of high
for the job is received, the task resumes, taking a worker slot while it finishes.
concurrency. With less workers needed, you are able to scale down the
Visually, this is represented in the diagram below:
underlying infrastructure of your Airflow environment.
• Resiliency against restarts: Triggers are stateless by design. This means
your deferred tasks will not be set to a failure state if a triggerer needs
Receive Terminal
Submit Job to to be restarted due to a deployment or infrastructure issue. Once a trig-
Poll Spark Cluster for Job Status Status for Job on
Spark Cluster
Spark Cluster gerer is back up and running in your environment, your deferred tasks will
resume.
Worker Slot Worker Slot • Paves the way to event-based DAGs: The presence of asyncio in core
Triggerer Process
Allocated Allocated Airflow is a potential foundation for event-triggered DAGs.
46 47
Example Workflow Using Deferrable Operators 1 from datetime import datetime
2 from airflow import DAG
Let’s say we have a DAG that is scheduled to run a sensor every minute,
3 from airflow.sensors.date_time import DateTimeSensor
where each task can take up to 20 minutes. Using the default settings with 1
4
worker, we can see that after 20 minutes we have 16 tasks running, each hold-
5 with DAG(
ing a worker slot:
6 “sync_dag”,
7 start_date=datetime(2021, 12, 22, 20, 0),
8 end_date=datetime(2021, 12, 22, 20, 19),
9 schedule_interval=”* * * * *”,
10 catchup=True,
11 max_active_runs=32,
12 max_active_tasks=32
13 ) as dag:
14
15 sync_sensor= DateTimeSensor(
16 task_id=”sync_task”,
17 target_time=”””{{ macros.datetime.utcnow() + macros.time-
18 delta(minutes=20) }}”””,
19 )
20
Because worker slots are held during task execution time, we would need at
least 20 worker slots available for this DAG to ensure that future runs are not
delayed. To increase concurrency, we would need to add additional resources
to our Airflow infrastructure (e.g. another worker pod).
48 49
Note that if you are running Airflow on Astro, the triggerer runs automatically
1 from datetime import datetime if you are on Astro Runtime 4.0+. If you are using Astronomer Software 0.26+,
2 from airflow import DAG you can add a triggerer to an Airflow 2.2+ deployment in the Deployment
3 from airflow.sensors.date_time import DateTimeSensorAsync Settings tab. This guide details the steps for configuring this feature in the
4 platform.
5 with DAG(
6 “async_dag”, As tasks are raised into a deferred state, triggers are registered in the trig-
7 start_date=datetime(2021, 12, 22, 20, 0), gerer. You can set the number of concurrent triggers that can run in a single
8 end_date=datetime(2021, 12, 22, 20, 19), triggerer process with the default_capacity configuration setting in Airflow.
9 schedule_interval=”* * * * *”, This can also be set via the AIRFLOW__TRIGGERER__DEFAULT_CAPACITY envi-
10 catchup=True, ronment variable. By default, this variable’s value is 1,000.
11 max_active_runs=32,
12 max_active_tasks=32 High Availability
13 ) as dag: Note that triggers are designed to be highly-available. You can implement
14 this by starting multiple triggerer processes. Similar to the HA scheduler in-
15 async_sensor = DateTimeSensorAsync( troduced in Airflow 2.0, Airflow ensures that they co-exist with correct lock-
16 task_id=”async_task”, ing and HA. You can reference the Airflow docs for further information on
17 target_time=”””{{ macros.datetime.utcnow() + macros.time- this topic.
18 delta(minutes=20) }}”””,
19 ) Creating Your Own Deferrable Operator
If you have an operator that would benefit from being asynchronous but
does not yet exist in OSS Airflow or Astro Runtime, you can create your own.
The Airflow docs have great instructions to get you started.
Running Deferrable Tasks in Your Airflow
Environment
50 51
Reviewing Idempotency
Before we jump into best practices specific to Airflow, we need to review one con-
cept which applies to all data pipelines.
3. DAG Design tent DAGs decreases recovery time from failures and prevents data loss.
DAG Design
Because Airflow is 100% code, knowing the basics of Python is all it takes to The following DAG design principles will help to make your DAGs idempo-
get started writing DAGs. However, writing DAGs that are efficient, secure, tent, efficient, and readable.
and scalable requires some Airflow-specific finesse. In this section, we will
cover some best practices for developing DAGs that make the most of what Keep Tasks Atomic
Airflow has to offer. When breaking up your pipeline into individual tasks, ideally each task should
be atomic. This means each task should be responsible for one operation that
In general, most of the best practices we cover here fall into one of two categories: can be rerun independently of the others. Said another way, in an automized
• DAG design task, a success in the part of the task means a success of the entire task.
• Using Airflow as an orchestrator
For example, in an ETL pipeline you would ideally want your Extract, Trans-
form, and Load operations covered by three separate tasks. Atomizing these
tasks allows you to rerun each operation in the pipeline independently, which
supports idempotence
52 53
Contrary to our best practices, the following example defines variables Incremental Record Filtering
based on datetime Python functions: It is ideal to break out your pipelines into incremental extracts and loads
wherever possible. For example, if you have a DAG that runs hourly, each DAG
Run should process only records from that hour, rather than the whole dataset.
When the results in each DAG Run represent only a small subset of your total
1 # Variables used by tasks
dataset, a failure in one subset of the data won’t prevent the rest of your DAG
2 # Bad example - Define today’s and yesterday’s date using date-
Runs from completing successfully. And if your DAGs are idempotent, you can
3 time module
rerun a DAG for only the data that failed rather than reprocessing the entire
4 today = datetime.today()
dataset.
5 yesterday = datetime.today() - timedelta(1)
There are multiple ways you can achieve incremental pipelines. The two best
and most common methods are described below.
If this code is in a DAG file, these functions will be executed on every
Scheduler heartbeat, which may not be performant. Even more importantly, • Last Modified Date
this doesn’t produce an idempotent DAG: If you needed to rerun a previ- Using a “last modified” date is the gold standard for incremental loads.
ously failed DAG Run for a past date, you wouldn’t be able to because date- Ideally, each record in your source system has a column containing the
time.today() is relative to the current date, not the DAG execution date. last time the record was modified. With this design, a DAG Run looks for
A better way of implementing this is by using an Airflow variable: records that were updated within specific dates from this column.
For example, with a DAG that runs hourly, each DAG Run will be respon-
sible for loading any records that fall between the start and end of its
1 # Variables used by tasks
hour. If any of those runs fail, it will not impact other Runs.
2 # Good example - Define yesterday’s date with an Airflow variable
3 yesterday = {{ yesterday_ds_nodash }}
• Sequence IDs
When a last modified date is not available, a sequence or incrementing
ID can be used for incremental loads. This logic works best when the
source records are only being appended to and never updated. While we
You can use one of Airflow’s many built-in variables and macros, or you can
recommend implementing a “last modified” date system in your
create your own templated field to pass in information at runtime. For more on
records if possible, basing your incremental logic off of a sequence ID
this topic check out our guide on templating and macros in Airflow.
can be a sound way to filter pipeline records without a last modified
date.
54 55
Avoid Top-Level Code in Your DAG File 15 #Instantiate DAG
In the context of Airflow, we use “top-level code” to mean any code that isn’t 16 with DAG(‘bad_practices_dag_1’,
part of your DAG or operator instantiations. 17 start_date=datetime(2021, 1, 1),
18 max_active_runs=3,
Airflow executes all code in the dags_folder on every min_file_process_in- 19 schedule_interval=’@daily’,
terval, which defaults to 30 seconds (you can read more about this parameter 20 default_args=default_args,
in the Airflow docs). Because of this, top-level code that makes requests to 21 catchup=False
external systems, like an API or a database, or makes function calls outside of 22 ) as dag:
your tasks can cause performance issues. Additionally, including code that isn’t 23
part of your DAG or operator instantiations in your DAG file makes the DAG 24 t0 = DummyOperator(task_id=’start’)
harder to read, maintain, and update. 25
26 #Bad example with top level SQL code in the DAG file
Treat your DAG file like a config file and leave all of the heavy lifting to the 27 query_1 = PostgresOperator(
hooks and operators that you instantiate within the file. If your DAGs need to 28 task_id=’covid_query_wa’,
access additional code such as a SQL script or a Python function, keep that 29 postgres_conn_id=’postgres_default’,
code in a separate file that can be read into a DAG Run. 30 sql=’’’with yesterday_covid_data as (
31 SELECT *
For one example of what not to do, in the DAG below a PostgresOperator 32 FROM covid_state_data
executes a SQL query that was dropped directly into the DAG file: 33 WHERE date = {{ params.today }}
34 AND state = ‘WA’
35 ),
1 from airflow import DAG 36 today_covid_data as (
2 from airflow.providers.postgres.operators.postgres import Post- 37 SELECT *
3 gresOperator 38 FROM covid_state_data
4 from datetime import datetime, timedelta 39 WHERE date = {{ params.yesterday }}
5 40 AND state = ‘WA’
6 #Default settings applied to all tasks 41 ),
7 default_args = { 42 two_day_rolling_avg as (
8 ‘owner’: ‘airflow’, 43
9 ‘depends_on_past’: False,
10 ‘email_on_failure’: False,
11 ‘email_on_retry’: False,
12 ‘retries’: 1,
13 ‘retry_delay’: timedelta(minutes=1)
14 }
56 57
44 SELECT AVG(a.state, b.state) as two_day_avg
14 }
45 FROM yesterday_covid_data a
15
46 JOIN yesterday_covid_data b
16 #Instantiate DAG
47 ON a.state = b.state
17 with DAG(‘good_practices_dag_1’,
48 )
18 start_date=datetime(2021, 1, 1),
49 SELECT a.state, b.state, c.two_day_avg
19 max_active_runs=3,
50 FROM yesterday_covid_data a
20 schedule_interval=’@daily’,
51 JOIN today_covid_data b
21 default_args=default_args,
52 ON a.state=b.state
22 catchup=False,
53 JOIN two_day_rolling_avg c
23 template_searchpath=’/usr/local/airflow/include’ #include
54 ON a.state=b.two_day_avg;’’’,
24 path to look for external files
55 params={‘today’: today, ‘yesterday’:yesterday}
25 ) as dag:
56 )
26
27 query = PostgresOperator(
28 task_id=’covid_query_{0}’.format(state),
29 postgres_conn_id=’postgres_default’,
30 sql=’covid_state_query.sql’, #reference query kept in
Keeping the query in the DAG file like this makes the DAG harder to read 31 separate file
and maintain. Instead, in the DAG below we call in a file named covid_ 32 params={‘state’: “’” + state + “’”}
state_query.sql into our PostgresOperator instantiation, which embodies 33 )
the best practice:
58 59
Use a Consistent Method for Task Dependencies Leverage Airflow Features
In Airflow, task dependencies can be set multiple ways. You can use set_up-
stream() and set_downstream() functions, or you can use << and >> opera-
The next category of best practices relates to using Airflow as
tors. Which method you use is a matter of personal preference, but for read-
what it was originally designed to be: an orchestrator. Using
ability it’s best practice to choose one method and stick with it.
Airflow as an orchestrator makes it easier to scale and pull in
the right tools based on your needs.
For example, instead of mixing methods like this:
For easy discovery of all the great provider packages out there, check out
the Astronomer Registry.
60 61
We recommend that you consider the size of your data now and in the future
Other Best Practices
when deciding whether to process data within Airflow or offload to an exter-
nal tool. If your use case is well suited to processing data within Airflow, then
Finally, here are a few other noteworthy best practices that don’t fall under
we would recommend the following:
the two categories above.
Depending on your data retention policy, you could modify the load logic
and rerun the entire historical pipeline without having to rerun the extracts.
This is also useful in situations where you no longer have access to the source
system (e.g. you hit an API limit).
Use DAG Name and Start Date Properly
You should always use a static start_date with your DAGs. A dynamic
Use an ELT Framework
start_date is misleading and can cause failures when clearing out failed task
Whenever possible, look to implement an ELT (extract, load, transform) data
instances and missing DAG runs.
pipeline pattern with your DAGs. This means that you should look to offload
as much of the transformation logic to the source systems or the destina-
Additionally, if you change the start_date of your DAG you should also change
tions systems as possible, which leverages the strengths of all tools in your
the DAG name. Changing the start_date of a DAG creates a new entry in
data ecosystem. Many modern data warehouse tools, such as Snowflake, give
Airflow’s database, which could confuse the scheduler because there will be two
you easy to access to compute to support the ELT framework, and are easily
DAGs with the same name but different schedules.
used in conjunction with Airflow.
Changing the name of a DAG also creates a new entry in the database, which
powers the dashboard, so follow a consistent naming convention since changing
a DAG’s name doesn’t delete the entry in the database for the old name.
62 63
Set Retries at the DAG Level Ensure Idempotency
Even if your code is perfect, failures happen. In a distributed environment An important concept for any data pipeline, including an Airflow DAG, is
where task containers are executed on shared hosts, it’s possible for tasks to idempotency. This is the property whereby an operation can be applied
be killed off unexpectedly. When this happens, you might see Airflow’s logs multiple times without changing the result. We often hear about this concept
mention a zombie process. as it applies to your entire DAG; if you execute the same DAGRun multiple
times, you will get the same result. However, this concept also applies to
Issues like this can be resolved by using task retries. The best practice is to tasks within your DAG; if every task in your DAG is idempotent, your full DAG
set retries as a default_arg so they are applied at the DAG level and get will be idempotent as well.
more granular for specific tasks only where necessary. A good range to try
is ~2–4 retries. When designing a DAG that passes data between tasks, it is important to
ensure that each task is idempotent. This will help you recover and ensure no
data is lost should you have any failures.
64 65
ue (e.g. if your Python callable for your PythonOperator has a return), that XCom cannot be used for passing large data sets between tasks. The limit
value will automatically be pushed to XCom. Tasks can also be configured to for the size of the XCom is determined by which metadata database you are
push XComs by calling the xcom_push() method. Similarly, xcom_pull() can using:
be used in a task to receive an XCom.
• Postgres: 1 Gb
You can view your XComs in the Airflow UI by navigating to Admin XComs. • SQLite: 2 Gb
• MySQL: 64 Kb
You should see something like this:
You can see that these limits aren’t very big. And even if you think your data
might squeak just under, don’t use XComs. Instead, see the section below on
using intermediary data storage, which is more appropriate for larger chunks
of data.
66 67
1 from airflow import DAG 32 # Default settings applied to all tasks
2 from airflow.operators.python_operator import PythonOperator 33 default_args = {
3 from datetime import datetime, timedelta 34 ‘owner’: ‘airflow’,
4 35 ‘depends_on_past’: False,
5 import requests 36 ‘email_on_failure’: False,
6 import json 37 ‘email_on_retry’: False,
7 38 ‘retries’: 1,
8 url = ‘https://covidtracking.com/api/v1/states/’ 39 ‘retry_delay’: timedelta(minutes=5)
9 state = ‘wa’ 40 }
10 41
11 def get_testing_increase(state, ti): 42 with DAG(‘xcom_dag’,
12 “”” 43 start_date=datetime(2021, 1, 1),
13 Gets totalTestResultsIncrease field from Covid API for given 44 max_active_runs=2,
14 state and returns value 45 schedule_interval=timedelta(minutes=30),
15 “”” 46 default_args=default_args,
16 res = requests.get(url+’{0}/current.json’.format(state)) 47 catchup=False
17 testing_increase = json.loads(res.text)[‘totalTestResultsIn- 48 ) as dag:
18 crease’] 49
19 50 opr_get_covid_data = PythonOperator(
20 ti.xcom_push(key=’testing_increase’, value=testing_increase) 51 task_id = ‘get_testing_increase_data_{0}’.format(state),
21 52 python_callable=get_testing_increase,
22 def analyze_testing_increases(state, ti): 53 op_kwargs={‘state’:state}
23 “”” 54 )
24 Evaluates testing increase results 55
25 “”” 56 opr_analyze_testing_data = PythonOperator(
26 testing_increases=ti.xcom_pull(key=’testing_increase’, task_ 57 task_id = ‘analyze_data’,
27 ids=’get_testing_increase_data_{0}’.format(state)) 58 python_callable=analyze_testing_increases,
28 print(‘Testing increases for {0}:’.format(state), testing_in- 59 op_kwargs={‘state’:state}
29 creases) 60 )
30 #run some analysis here 61
31 62 opr_get_covid_data >> opr_analyze_testing_data
68 69
In this DAG we have two PythonOperator tasks which share data using the
xcom_push and xcom_pull functions. Note that in the get_testing_increase
function, we used the xcom_push method so that we could specify the key
name. Alternatively, we could have made the function return the testing_
increase value, because any value returned by an operator in Airflow will
automatically be pushed to XCom; if we had used this method, the XCom key
would be “returned_value”.
If we run this DAG and then go to the XComs page in the Airflow UI, we see Another way to implement this use case is to use the TaskFlow API that was
that a new row has been added for our get_testing_increase_data_wa task released with Airflow 2.0. With the TaskFlow API, returned values are pushed
with the key testing_increase and value returned from the API. to XCom as usual, but XCom values can be pulled simply by adding the key
as an input to the function as shown in the following DAG:
70 71
13 @dag(‘xcom_taskflow_dag’, schedule_interval=’@daily’, default_
Intermediary Data Storage
14 args=default_args, catchup=False)
15 def taskflow():
As mentioned above, XCom can be a great option for sharing data between
16
tasks because it doesn’t rely on any tools external to Airflow itself. Howev-
17 @task
er, it is only designed to be used for very small amounts of data. What if the
18 def get_testing_increase(state):
data you need to pass is a little bit larger, for example, a small dataframe?
19 “””
20 Gets totalTestResultsIncrease field from Covid API for
The best way to manage this use case is to use intermediary data storage.
21 given state and returns value
This means saving your data to some system external to Airflow at the end of
22 “””
one task, then reading it in from that system in the next task. This is common-
23 res = requests.get(url+’{0}/current.json’.format(state))
ly done using cloud file storage such as S3, GCS, Azure Blob Storage, etc.,
24 return{‘testing_increase’: json.loads(res.text)[‘totalT-
but it could also be done by loading the data in either a temporary or per-
25 estResultsIncrease’]}
sistent table in a database.
26
27 @task
We will note here that while this is a great way to pass data that is too large
28 def analyze_testing_increases(testing_increase: int):
to be managed with XCom, you should still exercise caution. Airflow is meant
29 “””
to be an orchestrator, not an execution framework. If your data is very large,
30 Evaluates testing increase results
it is probably a good idea to complete any processing using a framework like
31 “””
Spark or compute-optimized data warehouses like Snowflake or dbt.
32 print(‘Testing increases for {0}:’.format(state), test-
33 ing_increase)
Example DAG
34 #run some analysis here
Building on our Covid example above, let’s say instead of a specific value of
35
testing increases, we are interested in getting all of the daily Covid data for a
36 analyze_testing_increases(get_testing_increase(state))
state and processing it. This case would not be ideal for XCom, but since the
37
data returned is a small dataframe, it is likely okay to process using Airflow.
38 dag = taskflow()
This DAG is functionally the same as the first one, but thanks to the TaskFlow
API there is less code required overall and no additional code required for
passing the data between the tasks using XCom.
72 73
1 from airflow import DAG 32 # Connect to S3
2 from airflow.operators.python_operator import PythonOperator 33 s3_hook = S3Hook(aws_conn_id=s3_conn_id)
3 from airflow.providers.amazon.aws.hooks.s3 import S3Hook 34
4 from datetime import datetime, timedelta 35 # Read data
5 36 data = StringIO(s3_hook.read_key(key=’{0}_{1}.csv’.for-
6 from io import StringIO 37 mat(state, date), bucket_name=bucket))
7 import pandas as pd 38 df = pd.read_csv(data, sep=’,’)
8 import requests 39
9 40 # Process data
10 s3_conn_id = ‘s3-conn’ 41 processed_data = df[[‘date’, ‘state’, ‘positive’, ‘negative’]]
11 bucket = ‘astro-workshop-bucket’ 42
12 state = ‘wa’ 43 # Save processed data to CSV on S3
13 date = ‘{{ yesterday_ds_nodash }}’ 44 s3_hook.load_string(processed_data.to_string(), ‘{0}_{1}_pro-
14 45 cessed.csv’.format(state, date), bucket_name=bucket, replace=True)
15 def upload_to_s3(state, date): 46
16 ‘’’Grabs data from Covid endpoint and saves to flat file on S3 47 # Default settings applied to all tasks
17 ‘’’ 48 default_args = {
18 # Connect to S3 49 ‘owner’: ‘airflow’,
19 s3_hook = S3Hook(aws_conn_id=s3_conn_id) 50 ‘depends_on_past’: False,
20 51 ‘email_on_failure’: False,
21 # Get data from API 52 ‘email_on_retry’: False,
22 url = ‘https://covidtracking.com/api/v1/states/’ 53 ‘retries’: 1,
23 res = requests.get(url+’{0}/{1}.csv’.format(state, date)) 54 ‘retry_delay’: timedelta(minutes=1)
24 55 }
25 # Save data to CSV on S3 56
26 s3_hook.load_string(res.text, ‘{0}_{1}.csv’.format(state, 57 with DAG(‘intermediary_data_storage_dag’,
27 date), bucket_name=bucket, replace=True) 58 start_date=datetime(2021, 1, 1),
28 59 max_active_runs=1,
29 def process_data(state, date): 60 schedule_interval=’@daily’,
30 ‘’’Reads data from S3, processes, and saves to new S3 file 61 default_args=default_args,
31 ‘’’ 62 catchup=False
74 75
63 ) as dag:
64 Using Task Groups in Airflow
65 generate_file = PythonOperator(
66 task_id=’generate_file_{0}’.format(state),
67 python_callable=upload_to_s3, Overview
68 op_kwargs={‘state’: state, ‘date’: date}
69 ) Prior to the release of Airflow 2.0 in December 2020, the only way to group
70 tasks and create modular workflows within Airflow was to use SubDAGs.
71 process_data = PythonOperator( SubDAGs were a way of presenting a cleaner-looking DAG by capitalizing on
72 task_id=’process_data_{0}’.format(state), code patterns. For example, ETL DAGs usually share a pattern of tasks that
73 python_callable=process_data, extract data from a source, transform the data, and load it somewhere. The
74 op_kwargs={‘state’: state, ‘date’: date} SubDAG would visually group the repetitive tasks into one UI task, making
75 ) the pattern between tasks clearer.
76
77 generate_file >> process_data However, SubDAGs were really just DAGs embedded in other DAGs. This
caused both performance and functional issues:
• When a SubDAG is triggered, the SubDAG and child tasks take up work-
er slots until the entire SubDAG is complete. This can delay other task
processing and, depending on your number of worker slots, can lead to
deadlocking.
In this DAG we make use of the S3Hook to save data retrieved from the API
• SubDAGs have their own parameters, schedule, and enabled settings.
to a CSV on S3 in the generate_file task. The process_data task then grabs
When these are not consistent with their parent DAG, unexpected be-
that data from S3, converts it to a dataframe for processing, and then saves
havior can occur.
the processed data back to a new CSV on S3.
Unlike SubDAGs, Task Groups are just a UI grouping concept. Starting in Air-
flow 2.0, you can use Task Groups to organize tasks within your DAG’s graph
view in the Airflow UI. This avoids the added complexity and performance
issues of SubDAGs, all while using less code!
In this section, we will walk through how to create Task Groups and show
some example DAGs to demonstrate their scalability.
76 77
In the Airflow UI, Task Groups look like tasks with blue shading. When we ex-
Creating Task Groups pand group1 by clicking on it, we see blue circles where the Task Group’s de-
pendencies have been applied to the grouped tasks. The task(s) immediately
To use Task Groups you’ll need to use the following import statement.
to the right of the first blue circle (t1) get the group’s upstream dependen-
cies and the task(s) immediately to the left (t2) of the last blue circle get
1 from airflow.utils.task_group import TaskGroup the group’s downstream dependencies.
For our first example, we will instantiate a Task Group using a with statement
and provide a group_id. Inside our Task Group, we will define our two tasks,
t1 and t2, and their respective dependencies.
You can use dependency operators (<< and >>) on Task Groups in the same
way that you can with individual tasks. Dependencies applied to a Task Group
are applied across its tasks. In the following code, we will add additional de-
pendencies to t0 and t3 to the Task Group, which automatically applies the
same dependencies across t1 and t2:
1 t0 = DummyOperator(task_id=’start’)
Note: When your task is within a Task Group, your callable task_id
2
will be the task_id prefixed with the group_id (i.e. group_id.task_
3 # Start Task Group definition
id). This ensures the uniqueness of the task_id across the DAG. This is
4 with TaskGroup(group_id=’group1’) as tg1:
important to remember when calling specific tasks with XCOM pass-
5 t1 = DummyOperator(task_id=’task1’)
ing or branching operator decisions.
6 t2 = DummyOperator(task_id=’task2’)
7
8 t1 >> t2
9 # End Task Group definition
10
11 t3 = DummyOperator(task_id=’end’)
12
13 # Set Task Group’s (tg1) dependencies
14 t0 >> tg1 >> t3
78 79
This screenshot shows the expanded view of the Task Groups we generated
Dynamically Generating Task Groups above in the Airflow UI:
Just like with DAGs, Task Groups can be dynamically generated to make use
of patterns within your code. In an ETL DAG, you might have similar down-
stream tasks that can be processed independently, such as when you call
different API endpoints for data that needs to be processed and stored in
the same way. For this use case, we can dynamically generate Task Groups
by API endpoint. Just like with manually written Task Groups, generated Task
Groups can be drilled into from the Airflow UI to see specific tasks.
In the code below, we use iteration to create multiple Task Groups. While the
tasks and dependencies remain the same across Task Groups, we can change
which parameters are passed in to each Task Group based on the group_id:
By default, using a loop to generate your Task Groups will put them in paral-
lel. If your Task Groups are dependent on elements of another Task Group,
you’ll want to run them sequentially. For example, when loading tables with
foreign keys, your primary table records need to exist before you can load
your foreign table.
In the example below, our third dynamically generated Task Group has a foreign
key constraint on both our first and second dynamically generated Task Groups, so
we will want to process it last. To do this, we will create an empty list and append
our Task Group objects as they are generated. Using this list, we can reference the
Task Groups and define their dependencies to each other:
80 81
1
groups = []
2 Conditioning on Task Groups
for g_id in range(1,4):
3 In the above example, we added an additional task to group1 based on our
tg_id = f’group{g_id}’
4 group_id. This was to demonstrate that even though we are dynamically cre-
with TaskGroup(group_id=tg_id) as tg1:
5 ating Task Groups to take advantage of patterns, we can still introduce vari-
t1 = DummyOperator(task_id=’task1’)
6 ations to the pattern while avoiding code redundancies from building each
t2 = DummyOperator(task_id=’task2’)
7 Task Group definition manually.
8
t1 >> t2
9
10 Nesting Task Groups
if tg_id == ‘group1’:
11
t3 = DummyOperator(task_id=’task3’)
12 For additional complexity, you can nest Task Groups. Building on our previ-
t1 >> t3
13 ous ETL example, when calling API endpoints, we may need to process new
14 records for each endpoint before we can process updates to them.
groups.append(tg1)
15
16 In the following code, our top-level Task Groups represent our new and
[groups[0] , groups[1]] >> groups[2]
updated record processing, while the nested Task Groups represent our API
endpoint processing:
1 groups = []
2 for g_id in range(1,3):
The following screenshot shows how these Task Groups appear in the 3 with TaskGroup(group_id=f’group{g_id}’) as tg1:
Airflow UI: 4 t1 = DummyOperator(task_id=’task1’)
5 t2 = DummyOperator(task_id=’task2’)
6
7 sub_groups = []
8 for s_id in range(1,3):
9 with TaskGroup(group_id=f’sub_group{s_id}’) as tg2:
10 st1 = DummyOperator(task_id=’task1’)
11 st2 = DummyOperator(task_id=’task2’)
12
13 st1 >> st2
14 sub_groups.append(tg2)
82 83
15
16
t1 >> sub_groups >> t2
groups.append(tg1)
Takeaways
17
18 groups[0] >> groups[1] Task Groups are a dynamic and scalable UI grouping concept that eliminates
the functional and performance issues of SubDAGs.
Ultimately, Task Groups give you the flexibility to group and organize your tasks
in a number of ways. To help guide your implementation of Task Groups, think
The following screenshot shows the expanded view of the nested Task
about:
Groups in the Airflow UI:
84 85
Cross-DAG Dependencies Implementing Cross-DAG Dependencies
• Upstream DAG: A DAG that must reach a specified state before a down- TriggerDagRunOperator
stream DAG can run The TriggerDagRunOperator is an easy way to implement cross-DAG de-
• Downstream DAG: A DAG that cannot run until an upstream DAG reach- pendencies. This Operator allows you to have a task in one DAG that triggers
es a specified state another DAG in the same Airflow environment.
According to the Airflow documentation on cross-DAG dependencies, The TriggerDagRunOperator is ideal in situations where you have one up-
designing DAGs in this way can be useful when: stream DAG that needs to trigger one or more downstream DAGs. It can
also work if you have dependent DAGs that have both upstream and down-
• Two DAGs are dependent, but they have different schedules. stream tasks in the upstream DAG (i.e. the dependent DAG is in the middle
• Two DAGs are dependent, but they are owned by different teams. of tasks in the upstream DAG). Because you can use this operator for any
• A task depends on another task but for a different execution date. task in your DAG, it is highly flexible. It’s also an ideal replacement for
SubDAGs.
For any scenario where you have dependent DAGs, we’ve got you covered!
In this section, we will discuss multiple methods for implementing cross-DAG Below is an example DAG that implements the TriggerDagRunOperator to
dependencies, including how to implement dependencies if your dependent trigger the dependent-dag between two other tasks.
DAGs are located in different Airflow deployments.
86 87
36 trigger_dependent_dag = TriggerDagRunOperator(
7 “””
37 task_id=”trigger_dependent_dag”,
8 Dummy function to call before and after dependent DAG.
38 trigger_dag_id=”dependent-dag”,
9 “””
39 wait_for_completion=True
10 print(f”The {kwargs[‘task_type’]} task has completed.”)
40 )
11
41
12 # Default settings applied to all tasks
42 end_task = PythonOperator(
13 default_args = {
43 task_id=’end_task’,
14 ‘owner’: ‘airflow’,
44 python_callable=print_task_type,
15 ‘depends_on_past’: False,
45 op_kwargs={‘task_type’: ‘ending’}
16 ‘email_on_failure’: False,
46 )
17 ‘email_on_retry’: False,
47
18 ‘retries’: 1,
48 start_task >> trigger_dependent_dag >> end_task
19 ‘retry_delay’: timedelta(minutes=5)
20 }
21
22 with DAG(‘trigger-dagrun-dag’,
23 start_date=datetime(2021, 1, 1),
24 max_active_runs=1,
25 schedule_interval=’@daily’,
In the following graph view, you can see that the trigger_dependent_dag
26 default_args=default_args,
task in the middle is the TriggerDagRunOperator, which runs the depen-
27 catchup=False
28 ) as dag:
29
30 start_task = PythonOperator(
31 task_id=’starting_task’,
32 python_callable=print_task_type,
33 op_kwargs={‘task_type’: ‘starting’}
34 )
35
88 89
There are a couple of things to note when using this Operator: 1 from airflow import DAG
2 from airflow.operators.python import PythonOperator
• If your dependent DAG requires a config input or a specific execution
3 from airflow.sensors.external_task import ExternalTaskSensor
date, these can be specified in the operator using the conf and execu-
4 from datetime import datetime, timedelta
tion_date params respectively.
5
6 def downstream_fuction():
• If your upstream DAG has downstream tasks that require the down-
7 “””
stream DAG to finish first, you should set the wait_for_completion
8 Downstream function with print statement.
param to True as shown in the example above. This param defaults to
9 “””
False, meaning once the downstream DAG has started, the upstream
10 print(‘Upstream DAG has completed. Starting other tasks.’)
DAG will mark the task as a success and move on to any downstream
11
tasks.
12 default_args = {
13 ‘owner’: ‘airflow’,
ExternalTaskSensor
14 ‘depends_on_past’: False,
The next method for creating cross-DAG dependencies is to add an Extern-
15 ‘email_on_failure’: False,
alTaskSensor to your downstream DAG. The downstream DAG will wait until
16 ‘email_on_retry’: False,
a task is completed in the upstream DAG before moving on to the rest of the
17 ‘retries’: 1,
DAG. You can find more info on this sensor on the Astronomer Registry.
18 ‘retry_delay’: timedelta(minutes=5)
19 }
This method is not as flexible as the TriggerDagRunOperator, since the de-
20
pendency is implemented in the downstream DAG. It is ideal in situations
21 with DAG(‘external-task-sensor-dag’,
where you have a downstream DAG that is dependent on multiple upstream
22 start_date=datetime(2021, 1, 1),
DAGs. An example DAG using the ExternalTaskSensor is shown below:
23 max_active_runs=3,
24 schedule_interval=’*/1 * * * *’,
25 catchup=False
26 ) as dag:
27
28 downstream_task1 = ExternalTaskSensor(
29 task_id=”downstream_task1”,
30 external_dag_id=’example_dag’,
31 external_task_id=’bash_print_date2’,
90 91
32 allowed_states=[‘success’],
Also, in the example above, the upstream DAG (example_dag) and down-
33 failed_states=[‘failed’, ‘skipped’]
stream DAG (external-task-sensor-dag) must have the same start date
34 )
and schedule interval. This is because the ExternalTaskSensor will look for
35
the completion of the specified task or DAG at the same execution_date. To
36 downstream_task2 = PythonOperator(
look for the completion of the external task at a different date, you can make
37 task_id=’downstream_task2’,
use of either of the execution_delta or execution_date_fn parameters
38 python_callable=downstream_fuction,
(these are described in more detail in the documentation linked above).
39 provide_context=True
40 )
41
Airflow API
42 downstream_task1 >> downstream_task2
The Airflow API is another way of creating cross-DAG dependencies. This
is especially useful in Airflow 2.0, which has a full stable REST API. To use
the API to trigger a DAG run, you can make a POST request to the DAGRuns
endpoint as described in the Airflow documentation.
If you want the downstream DAG to wait for the entire upstream DAG to
finish instead of a specific task, you can set the external_task_id to None.
In this case, we specify that the external task must have a state of success
for the downstream task to succeed, as defined by the allowed_states and
failed_states.
92 93
1 from airflow import DAG 30 with DAG(‘api-dag’,
2 from airflow.operators.python import PythonOperator 31 start_date=datetime(2021, 1, 1),
3 from airflow.providers.http.operators.http import SimpleHttpOperator 32 max_active_runs=1,
4 from datetime import datetime, timedelta 33 schedule_interval=’@daily’,
5 import json 34 catchup=False
6 35 ) as dag:
7 # Define body of POST request for the API call to trigger another DAG 36
8 date = ‘{{ ds }}’ 37 start_task = PythonOperator(
9 request_body = { 38 task_id=’starting_task’,
10 “logical_date”: date 39 python_callable=print_task_type,
11 } 40 op_kwargs={‘task_type’: ‘starting’}
12 json_body = json.dumps(request_body) 41 )
13 42
14 def print_task_type(**kwargs): 43 api_trigger_dependent_dag = SimpleHttpOperator(
15 “”” 44 task_id=”api_trigger_dependent_dag”,
16 Dummy function to call before and after downstream DAG. 45 http_conn_id=’airflow-api’,
17 “”” 46 endpoint=’/api/v1/dags/dependent-dag/dagRuns’,
18 print(f”The {kwargs[‘task_type’]} task has completed.”) 47 method=’POST’,
19 print(request_body) 48 headers={‘Content-Type’: ‘application/json’},
20 49 data=json_body
21 default_args = { 50 )
22 ‘owner’: ‘airflow’, 51
23 ‘depends_on_past’: False, 52 end_task = PythonOperator(
24 ‘email_on_failure’: False, 53 task_id=’end_task’,
25 ‘email_on_retry’: False, 54 python_callable=print_task_type,
26 ‘retries’: 1, 55 op_kwargs={‘task_type’: ‘ending’}
27 ‘retry_delay’: timedelta(minutes=5) 56 )
28 } 57
29 58 start_task >> api_trigger_dependent_dag >> end_task
94 95
This DAG has a similar structure to the TriggerDagRunOperator DAG above but
instead uses the SimpleHttpOperator to trigger the dependent-dag using the
Airflow API. The graph view looks like this:
96 97
The screenshot below shows the dependencies created by the Trigger- Cross-Deployment Dependencies
DagRunOperator and ExternalTaskSensor example DAGs in the sections
To implement cross-DAG dependencies on two different Airflow environ-
above.
ments on the Astronomer platform, we can follow the same general steps for
triggering a DAG using the Airflow API described above. It may be helpful
to first read our documentation on making requests to the Airflow API from
Astronomer. When you’re ready to implement a cross-deployment depen-
dency, follow these steps:
4. Ensure the downstream DAG is turned on, then run the upstream
DAG.
98 99
Because everything in Airflow is code, you can dynamically generate DAGs
using Python alone. As long as a DAG object in globals() is created by
Python code that lives in the DAG_FOLDER, Airflow will load it. In this section,
we will cover a few of the many ways of generating DAGs. We will also dis-
cuss when DAG generation is a good option and some pitfalls to watch out
for when doing this at scale.
Single-File Methods
4. Dynamically One method for dynamically generating DAGs is to have a single Python file
that generates DAGs based on some input parameter(s) (e.g., a list of APIs
Generating DAGs or tables). An everyday use case for this is an ETL or ELT-type pipeline with
many data sources or destinations. It would require creating many DAGs that
all follow a similar pattern.
in Airflow
Some benefits of the single-file method:
Overview + It’s simple and easy to implement.
+ It can accommodate input parameters from many different sourc-
es (see a few examples below).
Note: All code in this section can be found in this Github repo. + Adding DAGs is nearly instantaneous since it requires only chang-
ing the input parameters.
In Airflow, DAGs are defined as Python code. Airflow executes all Python However, there are also drawbacks:
code in the DAG_FOLDER and loads any DAG objects that appear in globals(). × Since a DAG file isn’t actually being created, your visibility into
The simplest way of creating a DAG is to write it as a static Python file. the code behind any specific DAG is limited.
× Since this method requires a Python file in the DAG_FOLDER, the
However, sometimes manually writing DAGs isn’t practical. Maybe you have generation code will be executed on every Scheduler heartbeat.
hundreds or thousands of DAGs that do similar things with just a parameter It can cause performance issues if the total number of DAGs is
changing between them. Or perhaps you need a set of DAGs to load tables large or if the code is connecting to an external system such as a
but don’t want to manually update DAGs every time those tables change. database. For more on this, see the Scalability section below.
In these cases and others, it can make more sense to generate DAGs
dynamically.
100 101
19 with dag:
In the following examples, the single-file method is implemented differently
20 t1 = PythonOperator(
based on which input parameters are used for generating DAGs.
21 task_id=’hello_world’,
22 python_callable=hello_world_py,
23 dag_number=dag_number)
EXAMPLE 24
Use a Create_DAG Method 25 return dag
To dynamically create DAGs from a file, we need to define a Python
function that will generate the DAGs based on an input parameter. In
this case, we are going to define a DAG template within a create_dag
function. The code here is very similar to what you would use when In this example, the input parameters can come from any source that the
creating a single DAG, but it is wrapped in a method that allows for Python script can access. We can then set a simple loop (range(1, 4)) to
custom parameters to be passed in. generate these unique parameters and pass them to the global scope, there-
by registering them as valid DAGs within the Airflow scheduler:
102 103
17 default_args=default_args)
18
19 with dag:
20 t1 = PythonOperator(
21 task_id=’hello_world’,
22 python_callable=hello_world_py)
23
24 return dag
25
26
27 # build a dag for each number in range(10)
28 for n in range(1, 4):
29 dag_id = ‘loop_hello_world_{}’.format(str(n))
30
31 default_args = {‘owner’: ‘airflow’,
32 ‘start_date’: datetime(2021, 1, 1)
33 } EXAMPLE:
And if we look at the Airflow UI, we can see the DAGs have been created.
Success!
104 105
We can retrieve this value by importing the Variable class and passing it into
27 number_of_dags = Variable.get(‘dag_number’, default_var=3)
our range. We want the interpreter to register this file as valid — regardless
28 number_of_dags = int(number_of_dags)
of whether the variable exists, the default_var is set to 3.
29
30 for n in range(1, number_of_dags):
31 dag_id = ‘hello_world_{}’.format(str(n))
1 from airflow import DAG 32
2 from airflow.models import Variable 33 default_args = {‘owner’: ‘airflow’,
3 from airflow.operators.python_operator import PythonOperator 34 ‘start_date’: datetime(2021, 1, 1)
4 from datetime import datetime 35 }
5 36
6 37 schedule = ‘@daily’
7 def create_dag(dag_id, 38 dag_number = n
8 schedule, 39 globals()[dag_id] = create_dag(dag_id,
9 dag_number, 40 schedule,
10 default_args): 41 dag_number,
11 42 default_args)
12 def hello_world_py(*args):
13 print(‘Hello World’)
14 print(‘This is DAG: {}’.format(str(dag_number)))
15
16 dag = DAG(dag_id,
17 schedule_interval=schedule,
If we look at the scheduler logs, we can see this variable was pulled into the
18 default_args=default_args)
DAG and, and 15 DAGs were added to the DagBag based on its value.
19
20 with dag:
21 t1 = PythonOperator(
22 task_id=’hello_world’,
23 python_callable=hello_world_py)
We can then go to the Airflow UI and see all of the new DAGs that have
24
been created.
25 return dag
26
106 107
1 from airflow import DAG, settings
2 from airflow.models import Connection
EXAMPLE: 3 from airflow.operators.python_operator import PythonOperator
Generate DAGs from Connections 4 from datetime import datetime
Another way to define input parameters for dynamically generating 5
DAGs is by defining Airflow connections. It can be a good option if 6 def create_dag(dag_id,
each of your DAGs connects to a database or an API. Because you 7 schedule,
will be setting up those connections anyway, creating the DAGs from 8 dag_number,
that source avoids redundant work. 9 default_args):
10
To implement this method, we can pull the connections we have in 11 def hello_world_py(*args):
our Airflow metadata database by instantiating the “Session” and 12 print(‘Hello World’)
querying the “Connection” table. We can also filter this query so that 13 print(‘This is DAG: {}’.format(str(dag_number)))
it only pulls connections that match specific criteria. 14
15 dag = DAG(dag_id,
16 schedule_interval=schedule,
108 109
17 default_args=default_args)
Notice that, as before, we access the Models library to bring in the Connec-
18
tion class (as we did previously with the Variable class). We are also ac-
19 with dag:
cessing the Session() class from settings, which will allow us to query the
20 t1 = PythonOperator(
current database session.
21 task_id=’hello_world’,
22 python_callable=hello_world_py)
23
24 return dag
25
26
27 session = settings.Session()
28 conns = (session.query(Connection.conn_id)
29 .filter(Connection.conn_id.ilike(‘%MY_DATABASE_CONN%’))
30 .all())
31
32 for conn in conns:
33 dag_id = ‘connection_hello_world_{}’.format(conn[0])
34
35 default_args = {‘owner’: ‘airflow’,
36 ‘start_date’: datetime(2018, 1, 1)
37 } We can see that all of the connections that match our filter have now been
38 created as a unique DAG. The one connection we had which did not match
39 schedule = ‘@daily’ (SOME_OTHER_DATABASE) has been ignored.
40 dag_number = conn
41
42 globals()[dag_id] = create_dag(dag_id,
43 schedule,
44 dag_number,
45 default_args)
110 111
Multiple-File Methods EXAMPLE:
On the other hand, this method includes drawbacks: 1 from airflow import DAG
× It can be complex to set up. 2 from airflow.operators.postgres_operator import PostgresOperator
× Changes to DAGs or additional DAGs won’t be generated until the 3 from datetime import datetime
script is run, which in some cases requires deployment. 4
5 default_args = {‘owner’: ‘airflow’,
6 ‘start_date’: datetime(2021, 1, 1)
7 }
Let’s see a simple example of how this method could be implemented. 8
9 dag = DAG(dag_id,
10 schedule_interval=scheduletoreplace,
11 default_args=default_args,
12 catchup=False)
13
112 113
14 with dag: 9 for filename in os.listdir(config_filepath):
15 t1 = PostgresOperator( 10 f = open(filepath + filename)
16 task_id=’postgres_query’, 11 config = json.load(f)
17 postgres_conn_id=connection_id 12
18 sql=querytoreplace) 13 new_filename = ‘dags/’+config[‘DagId’]+’.py’
14 shutil.copyfile(dag_template_filename, new_filename)
15
16
Next, we create a dag-config folder that will contain a JSON config file
17 for line in fileinput.input(new_filename, inplace=True):
for each DAG. The config file should define the parameters that we noted
18 line.replace(“dag_id”, “’”+config[‘DagId’]+”’”)
above, the DAG ID, schedule interval, and query to be executed.
19 line.replace(“scheduletoreplace”, config[‘Schedule’])
20 line.replace(“querytoreplace”, config[‘Query’])
21 print(line, end=””)
1 {
2 “DagId”: “dag_file_1”,
3 “Schedule”: “’@daily’”,
4 “Query”:”’SELECT * FROM table1;’”
5 } To generate our DAG files, we either run this script ad-hoc as part of our CI/
CD workflow, or we create another DAG that would run it periodically. Af-
ter running the script, our final directory would look like the example below,
Finally, we write a Python script to create the DAG files based on the tem- where the include/ directory contains the files shown above, and the dags/
plate and the config files. The script loops through every config file in the directory contain the two dynamically generated DAGs:
dag-config/ folder, makes a copy of the template in the dags/ folder and
overwrites the parameters in that file (including the parameters from the
config file).
1 dags/
2 ├── dag_file_1.py
1 import json 3 ├── dag_file_2.py
2 import os 4 include/
3 import shutil 5 ├── dag-template.py
4 import fileinput 6 ├── generate-dag-files.py
5 7 └── dag-config
6 config_filepath = ‘include/dag-config/’ 8 ├── dag1-config.json
7 dag_template_filename = ‘include/dag-template.py’ 9 └── dag2-config.json
8
114 115
This is obviously a simple starting example that works only if all DAGs fol-
Scalability
low the same pattern. However, it could be expanded upon to have dynamic
inputs for tasks, dependencies, different operators, etc.
Dynamically generating DAGs can cause performance issues when used at
scale. Whether or not any particular method will cause problems is depen-
dent on your total number of DAGs, your Airflow configuration, and your
infrastructure. Here are a few general things to look out for:
DAG Factory
• Any code in the DAG_FOLDER will run on every Scheduler heartbeat.
A notable tool for dynamically creating DAGs from the community is
Methods where that code dynamically generates DAGs, such as the sin-
dag-factory. dag-factory is an open-source Python library for dynamically
gle-file method, are more likely to cause performance issues at scale.
generating Airflow DAGs from YAML files.
• If the DAG parsing time (i.e., the time to parse all code in the DAG_
To use dag-factory, you can install the package in your Airflow environment
FOLDER) is greater than the Scheduler heartbeat interval, the scheduler
and create YAML configuration files for generating your DAGs. You can then
can get locked up, and tasks won’t get executed. If you are dynamically
build the DAGs by calling the dag-factory.generate_dags() method in a
generating DAGs and tasks aren’t running, this is a good metric to review
Python script, like this example from the dag-factory README:
in the beginning of troubleshooting.
Upgrading to Airflow 2.0 to make use of the HA Scheduler should help with
1 from airflow import DAG these performance issues. But it can still take some additional optimization
2 import dagfactory work depending on the scale you’re working at. There is no single right way
3 to implement or scale dynamically generated DAGs. Still, the flexibility of
4 dag_factory = dagfactory.DagFactory(“/path/to/dags/config_file.yml”) Airflow means there are many ways to arrive at a solution that works for a
5 particular use case.
6 dag_factory.clean_dags(globals())
7 dag_factory.generate_dags(globals())
116 117
T U TO R I A L
T U TO R I A L
ARTICLE
D O C U M E N TAT I O N
One of the core principles of Airflow is that your DAGs are defined as Python
code. Because you can treat data pipelines like you would any other piece of
code, you can integrate them into a standard software development lifecycle We also recommend checking out Airflow’s documentation on testing DAGs and
using source control, CI/CD, and automated testing. testing guidelines for contributors; we will walk through some of the concepts
covered in those docs in more detail below.
Although DAGs are 100% Python code, effectively testing DAGs requires
accounting for their unique structure and relationship to other code and data
in your environment. This guide will discuss a couple of types of tests that we
Note on test runners: Before we dive into different types of tests for
would recommend to anybody running Airflow in production, including DAG
Airflow, we have a quick note on test runners. There are multiple test
validation testing, unit testing, and data and pipeline integrity testing.
runners available for Python, including unittest, pytest, and nose2.
The OSS Airflow project uses pytest, so we will do the same in this
Before you begin
section. However, Airflow doesn’t require using a specific test runner.
If you are newer to test-driven development, or CI/CD in general, we’d
In general, choosing a test runner is a matter of personal preference
recommend the following resources to get started:
and experience level, and some test runners might work better than
others for a given use case.
118 119
You may also use DAG validation tests to test for properties that you want to
DAG Validation Testing be consistent across all DAGs. For example, if your team has a rule that all
DAGs must have two retries for each task, you might write a test like this to
DAG validation tests are designed to ensure that your DAG objects are de-
enforce that rule:
fined correctly, acyclic, and free from import errors.
These are things that you would likely catch if you were starting with the local
development of your DAGs. But in cases where you may not have access
to a local Airflow environment or want an extra layer of security, these tests 1 def test_retries_present():
can ensure that simple coding errors don’t get deployed and slow down your 2 dag_bag = DagBag()
development. 3 for dag in dag_bag.dags:
4 retries = dag_bag.dags[dag].default_args.get(‘retries’, [])
DAG validation tests apply to all DAGs in your Airflow environment, so you 5 error_msg = ‘Retries not set to 2 for DAG {id}’.format(id=dag)
only need to create one test suite. 6 assert retries == 2, error_msg
To test whether your DAG can be loaded, meaning there aren’t any syntax
errors, you can run the Python file:
To see an example of running these tests as part of a CI/CD workflow, check
out this repo, which uses GitHub Actions to run the test suite before deploy-
1 python your-dag-file.py
ing the project to an Astronomer Airflow deployment.
Or to test for import errors specifically (which might be syntax related but
could also be due to incorrect package import paths, etc.), you can use
something like the following:
1 import pytest
2 from airflow.models import DagBag
3
4 def test_no_import_errors():
5 dag_bag = DagBag()
6 assert len(dag_bag.import_errors) == 0, “No Import Failures”
120 121
10 def execute(self, context):
Unit Testing 11 if self.operator_param % 2:
12 return True
Unit testing is a software testing method where small chunks of source code
13 else:
are tested individually to ensure they function as intended. The goal is to iso-
14 return False
late testable logic inside of small, well-named functions, for example:
1 def test_function_returns_5(): We would then write a test_evencheckoperator.py file with unit tests like
2 assert my_function(input) == 5 the following:
In the context of Airflow, you can write unit tests for any part of your DAG, 1 import unittest
but they are most frequently applied to hooks and operators. All official Air- 2 import pytest
flow hooks, operators, and provider packages have unit tests that must pass 3 from datetime import datetime
before merging the code into the project. For an example, check out the 4 from airflow import DAG
AWS S3Hook, which has many accompanying unit tests. 5 from airflow.models import TaskInstance
6 from airflow.operators import EvenNumberCheckOperator
If you have your custom hooks or operators, we highly recommend using unit 7
tests to check logic and functionality. For example, say we have a custom 8 DEFAULT_DATE = datetime(2021, 1, 1)
operator that checks if a number is even: 9
10 class EvenNumberCheckOperator(unittest.TestCase):
11
12 def setUp(self):
13 super().setUp()
1 from airflow.models import BaseOperator 14 self.dag = DAG(‘test_dag’, default_args={‘owner’: ‘airflow’,
2 from airflow.utils.decorators import apply_defaults 15 ‘start_date’: DEFAULT_DATE})
3 16 self.even = 10
4 class EvenNumberCheckOperator(BaseOperator): 17 self.odd = 11
5 @apply_defaults 18
6 def __init__(self, my_operator_param, *args, **kwargs): 19 def test_even(self):
7 self.operator_param = my_operator_param 20 “””Tests that the EvenNumberCheckOperator returns True for 10.”””
8 super(EvenNumberCheckOperator, self).__init__(*args, **kwargs)
9
122 123
21 task = EvenNumberCheckOperator(my_operator_param=self.even,
Data Integrity Testing
22 task_id=’even’, dag=self.dag)
23 ti = TaskInstance(task=task, execution_date=datetime.now())
Data integrity tests are designed to prevent data quality issues from break-
24 result = task.execute(ti.get_template_context())
ing your pipelines or negatively impacting downstream systems. These tests
25 assert result is True
could also be used to ensure your DAG tasks produce the expected output
26
when processing a given piece of data. They are somewhat different in scope
27 def test_odd(self):
than the code-related tests described in previous sections since your data is
28 “””Tests that the EvenNumberCheckOperator returns False for 11.”””
not static like a DAG.
29 task = EvenNumberCheckOperator(my_operator_param=self.odd,
30 task_id=’odd’, dag=self.dag)
One straightforward way of implementing data integrity tests is to build them
31 ti = TaskInstance(task=task, execution_date=datetime.now())
directly into your DAGs. This allows you to use Airflow dependencies to man-
32 result = task.execute(ti.get_template_context())
age any errant data in whatever way makes sense for your use case.
33 assert result is False
There are many ways you could integrate data checks into your DAG. One
method worth calling out is using Great Expectations (GE), an open-source
Python framework for data validations. You can make use of the Great Ex-
pectations provider package to easily integrate GE tasks into your DAGs. In
Note that if your DAGs contain PythonOperators that execute your Python
practice, you might have something like the following DAG, which runs an
functions, it is a good idea to write unit tests for those functions as well.
Azure Data Factory pipeline that generates data then runs a GE check on the
data before sending an email.
The most common way of implementing unit tests in production is to auto-
mate them as part of your CI/CD process. Your CI tool executes the tests
and stops the deployment process if any errors occur. 1 from airflow import DAG
2 from datetime import datetime, timedelta
Mocking 3 from airflow.operators.email_operator import EmailOperator
Sometimes unit tests require mocking: the imitation of an external system, 4 from airflow.operators.python_operator import PythonOperator
dataset, or another object. For example, you might use mocking with an 5 from airflow.providers.microsoft.azure.hooks.azure_data_factory
Airflow unit test if you are testing a connection but don’t have access to the 6 import AzureDataFactoryHook
metadata database. Another example could be testing an operator that exe- 7 from airflow.providers.microsoft.azure.hooks.wasb import WasbHook
cutes an external service through an API endpoint, but you don’t want to wait 8 from great_expectations_provider.operators.great_expectations im-
for that service to run a simple test. 9 port GreatExpectationsOperator
10
Many Airflow tests have examples of mocking. This blog post also has a help-
ful section on mocking Airflow that may help get started.
124 125
11 #Get yesterday’s date, in the correct format
41 with DAG(‘adf_great_expectations’,
12 yesterday_date = ‘{{ yesterday_ds_nodash }}’
42 start_date=datetime(2021, 1, 1),
13
43 max_active_runs=1,
14 #Define Great Expectations file paths
44 schedule_interval=’@daily’,
15 data_dir = ‘/usr/local/airflow/include/data/’
45 default_args=default_args,
16 data_file_path = ‘/usr/local/airflow/include/data/’
46 catchup=False
17 ge_root_dir = ‘/usr/local/airflow/include/great_expectations’
47 ) as dag:
18
48
19 def run_adf_pipeline(pipeline_name, date):
49 run_pipeline = PythonOperator(
20 ‘’’Runs an Azure Data Factory pipeline using the AzureDataFactory-
50 task_id=’run_pipeline’,
21 Hook and passes in a date parameter
51 python_callable=run_adf_pipeline,
22 ‘’’
52 op_kwargs={‘pipeline_name’: ‘pipeline1’, ‘date’: yester-
23
53 day_date}
24 #Create a dictionary with date parameter
54 )
25 params = {}
55
26 params[“date”] = date
56 download_data = PythonOperator(
27
57 task_id=’download_data’,
28 #Make connection to ADF, and run pipeline with parameter
58 python_callable=get_azure_blob_files,
29 hook = AzureDataFactoryHook(‘azure_data_factory_conn’)
59 op_kwargs={‘blobname’: ‘or/’+ yesterday_date +’.csv’,
30 hook.run_pipeline(pipeline_name, parameters=params)
60 ‘output_filename’: data_file_path+’or_’+yesterday_date+’.csv’}
31
61 )
32 def get_azure_blob_files(blobname, output_filename):
62
33 ‘’’Downloads file from Azure blob storage
63 ge_check = GreatExpectationsOperator(
34 ‘’’
64 task_id=’ge_checkpoint’,
35 azure = WasbHook(wasb_conn_id=’azure_blob’)
65 expectation_suite_name=’azure.demo’,
36 azure.get_file(output_filename, container_name=’covid-data’,
66 batch_kwargs={
37 blob_name=blobname)
67 ‘path’: data_file_path+’or_’+yesterday_date+’.csv’,
38
68 ‘datasource’: ‘data__dir’
39
69 },
40 default_args = {
70 data_context_root_dir=ge_root_dir
‘owner’: ‘airflow’,
71 )
‘depends_on_past’: False,
‘email_on_failure’: False,
‘email_on_retry’: False,
‘retries’: 0,
‘retry_delay’: timedelta(minutes=5)
}
126 127
72 send_email = EmailOperator(
73 task_id=’send_email’,
74 to=’[email protected]’,
75 subject=’Covid to S3 DAG’,
76 send_email = EmailOperator(
77 task_id=’send_email’,
78 to=’[email protected]’,
79 subject=’Covid to S3 DAG’,
80
81 html_content=’<p>The great expectations checks passed successfully.
<p>’
)
DAG Authoring
run_pipeline >> download_data >> ge_check >> send_email
for Apache Airflow
If the GE check fails, any downstream tasks will be skipped. Implementing
The Astronomer Certification: DAG Authoring for Apache
checkpoints like this allows you to either conditionally branch your pipeline
Airflow gives you the opportunity to challenge yourself and show
to deal with data that doesn’t meet your criteria or potentially skip all down-
the world your ability to create incredible data pipelines.
stream tasks so problematic data won’t be loaded into your data warehouse
And don’t worry, we’ve also prepared a preparation course to
or fed to a model. For more information on conditional DAG design, check
give you the best chance of success!
out the documentation on Airflow Trigger Rules and our guide on branching
in Airflow.
Concepts Covered:
• Variables • Idempotency
It’s also worth noting that data integrity testing will work better at scale if
• Pools • Dynamic DAGs
you design your DAGs to load or process data incrementally. We talk more
• Trigger Rules • DAG Best Practices
about incremental loading in our Airflow Best Practices guide. Still, in short,
• DAG Dependencies • DAG Versioning and much more
processing smaller, incremental chunks of your data in each DAG Run en-
sures that any data quality issues have a limited blast radius and are easier to
recover from.
Get Certified
WEBINAR
128 129
Note: Following the Airflow 2.0 release in December of 2020, the
open-source project has addressed a significant number of pain
points commonly reported by users running previous versions.
We strongly encourage your team to upgrade to Airflow 2.x.
7 Common Errors to Check when You wrote a new DAG that needs to run every hour and you’re ready to turn it
on. You set an hourly interval beginning today at 2pm, setting a reminder to
Debugging Airflow DAGs check back in a couple of hours. You hop on at 3:30pm to find that your DAG
did in fact run, but your logs indicate that there was only one recorded exe-
cution at 2pm. Huh — what happened to the 3pm run?
Apache Airflow is the industry standard for workflow orchestration. It’s an
incredibly flexible tool that powers mission-critical projects, from machine Before you jump into debugging mode (you wouldn’t be the first), rest
learning model training to traditional ETL at scale, for startups and Fortune assured that this is expected behavior. The functionality of the Airflow
50 teams alike. Scheduler can be counterintuitive, but you’ll get the hang of it.
Airflow’s breadth and extensibility, however, can make it challenging to adopt The two most important things to keep in mind about scheduling are:
— especially for those looking for guidance beyond day-one operations. In • By design, an Airflow DAG will run at the end of its schedule_interval
an effort to provide best practices and expand on existing resources, our Airflow operates in UTC by default.
team at Astronomer has collected some of the most common issues we see
Airflow users face.
Whether you’re new to Airflow or an experienced user, check out this list of
common errors and some corresponding fixes to consider.
130 131
Airflow’s Schedule Interval Airflow Time Zones
As stated above, an Airflow DAG will execute at the completion of its sched- Airflow stores datetime information in UTC internally and in the database. This
ule_interval, which means one schedule_interval AFTER the start date. behavior is shared by many databases and APIs, but it’s worth clarifying.
An hourly DAG, for example, will execute its 2:00 PM run when the clock You should not expect your DAG executions to correspond to your local time-
strikes 3:00 PM. This happens because Airflow can’t ensure that all of the zone. If you’re based in US Pacific Time, a DAG run of 19:00 will correspond to
data from 2:00 PM - 3:00 PM is present until the end of that hourly interval. 12:00 local time.
This quirk is specific to Apache Airflow, and it’s important to remember — In recent releases, the community has added more time zone-aware features
especially if you’re using default variables and macros. Thankfully, Airflow to the Airflow UI. For more information, refer to Airflow documentation.
2.2+ simplifies DAG scheduling with the introduction of the timetables!
Use Timetables for Simpler Scheduling 2. One of Your DAGs Isn’t Running
There are some data engineering use cases that are difficult or even impossible
to address with Airflow’s original scheduling method. Scheduling DAGs to skip
holidays, run only at certain times, or otherwise run on varying intervals can If workflows on your Deployment are generally running smoothly but you
cause major headaches if you’re relying solely on cron jobs or timedeltas. find that one specific DAG isn’t scheduling tasks or running at all, it might
have something to do with how you set it to schedule.
This is why Airflow 2.2 introduced timetables as the new default scheduling
method. Essentially, timetable is a DAG-level parameter that you can set to a
Python function that contains your execution schedule. Make Sure You Don’t Have datetime.now() as Your
start_date
A timetable is significantly more customizable than a cron job or timedelta.
You can program varying schedules, conditional logic, and more, directly within It’s intuitive to think that if you tell your DAG to start “now” that it’ll execute
your DAG schedule. And because timetables are imported as Airflow plugins, immediately. But that’s not how Airflow reads datetime.now().
you can use community-developed timetables to quickly — and literally — get
your DAG up to speed. For a DAG to be executed, the start_date must be a time in the past, other-
wise Airflow will assume that it’s not yet ready to execute. When Airflow eval-
We recommend using timetables as your de facto scheduling mechanism in uates your DAG file, it interprets datetime.now() as the current timestamp
Airflow 2.2+. You might be creating timetables without even knowing it: if you (i.e. NOT a time in the past) and decides that it’s not ready to run.
define a schedule-interval, Airflow 2.2+ will convert it to a timetable behind
the scenes. To properly trigger your DAG to run, make sure to insert a fixed time in the
past and set catchup=False if you don’t want to perform a backfill.
132 133
In our experience, a 503 often indicates that your Webserver is crashing.
Note: You can manually trigger a DAG run via Airflow’s UI directly on
If you push up a deploy and your Webserver takes longer than a few seconds
your dashboard (it looks like a “Play” button). A manual trigger exe-
to start, it might hit a timeout period (10 secs by default) that “crashes” the
cutes immediately and will not interrupt regular scheduling, though
Webserver before it has time to spin up. That triggers a retry, which crashes
it will be limited by any concurrency configurations you have at the
again, and so on and so forth.
deployment level, DAG level, or task level. When you look at corre-
sponding logs, the run_id will show manual__ instead of scheduled__.
If your Deployment is in this state, your Webserver might be hitting a mem-
If your team is running Airflow 1 and would like help establishing a
ory limit when loading your DAGs even as your Scheduler and Worker(s)
migration path, reach out to us.
continue to schedule and execute tasks.
If you’ve already refreshed the page once or twice and continue to see a 503
error, read below for some Webserver-related guidelines.
Increase the Webserver Timeout Period
134 135
Depending on your use case, we’d suggest considering the following:
Avoid Making Requests Outside of an Operator
• Create a DAG that runs at a more frequent interval.
If you’re making API calls, JSON requests, or database requests outside of
• Trigger a Lambda function.
an Airflow operator at a high frequency, your Webserver is much more likely
• Set mode=’reschedule’. If you have more sensors than worker slots, the
to timeout.
sensor will now get thrown into an up_for_reschedule state, which frees
up its worker slot.
When Airflow interprets a file to look for any valid DAGs, it first runs all code
at the top level (i.e. outside of operators). Even if the operator itself only
gets executed at execution time, everything outside of an operator is called
Replace Sensors with Deferrable Operators
every heartbeat, which can be very taxing on performance.
136 137
Update Concurrency Settings Max Active Runs per DAG
The potential root cause for a bottleneck is specific to your setup. For ex- max_active_runs_per_dag determines the maximum number of active DAG
ample, are you running many DAGs at once, or one DAG with hundreds of runs per DAG. This setting is most relevant when backfilling, as all of your
concurrent tasks? DAGs are immediately vying for a limited number of resources. The default
value is 16.
Regardless of your use case, configuring a few settings as parameters or
environment variables can help improve performance. Use this section to Pro-tip: If you consider setting DAG or deployment-level concurrency con-
learn what those variables are and how to set them. figurations to a low number to protect against API rate limits, we’d recom-
mend instead using “pools” - they’ll allow you to limit parallelism at the task
Most users can set parameters in Airflow’s airflow.cfg file. If you’re using level and won’t limit scheduling or execution outside of the tasks that need it.
Astro, you can also set environment variables via the Astro UI or your proj-
ect’s Dockerfile. We’ve formatted these settings as parameters for readability
– the environment variables for these settings are formatted as AIRFLOW__ Worker Concurrency
CORE__PARAMETER_NAME. For all default values, refer here.
Defined as AIRFLOW__CELERY__WORKER_CONCURRENCY=9, worker_concurrency
determines how many tasks each Celery Worker can run at any given time.
Parallelism The Celery Executor will run a max of 16 tasks concurrently by default. Think
of this as “how many tasks each of my workers can take on at any given time.”
parallelism determines how many task instances can run in parallel across
all DAGs given your environment resources. Think of this as “maximum active It’s important to note that this number will naturally be limited by dag_con-
tasks anywhere.” To increase the limit of tasks set to run in parallel, set this currency. If you have 1 Worker and want it to match your Deployment’s
value higher than its default of 32. capacity, worker_concurrency should be equal to parallelism. The default
value is 16.
DAG Concurrency
max_active_tasks_per_dag (formerly dag_concurrency) determines how Pro-tip: If you consider setting DAG or deployment-level concur-
many task instances your Scheduler is able to schedule at once per DAG. rency configurations to a low number to protect against API rate
Think of this as “maximum tasks that can be scheduled at once, per DAG.” limits, we’d recommend instead using “pools” — they’ll allow you
The default is 16, but you should increase this if you’re not noticing an im- to limit parallelism at the task level and won’t limit scheduling or
provement in performance after provisioning more resources to Airflow. execution outside of the tasks that need it.
138 139
Try Scaling Up Your Scheduler or Adding a Worker Failed to fetch log file from worker. Invalid URL ‘http://:8793/
log/staging_to_presentation_pipeline_v5/redshift_to_s3_Order_Pay-
If tasks are getting bottlenecked and your concurrency configurations are al- ment_17461/2019-01-11T00:00:00+00:00/1.log’: No host supplied
ready optimized, the issue might be that your Scheduler is underpowered or
that your Deployment could use another worker. If you’re running on Astro,
we generally recommend 5 AU (0.5 CPUs and 1.88 GiB of memory) as the A few things to try:
default minimum for the Scheduler and 10 AU (1 CPUs and 3.76 GiB of mem- • Clear the task instance via the Airflow UI to see if logs show up. This will
ory) for workers. prompt your task to run again.
• Change the log_fetch_timeout_sec to something greater than 5 sec-
Whether or not you scale your current resources or add an extra Celery onds. Defined in seconds, this setting determines the amount of time
Worker depends on your use case, but we generally recommend the follow- that the Webserver will wait for an initial handshake while fetching logs
ing: from other workers.
• Give your workers a little more power. If you’re using Astro, you can do
• If you’re running a relatively high number of light tasks across DAGs and this in the Configure tab of the Astro UI.
at a relatively high frequency, you’re likely better off having 2 or 3 “light” • Are you looking for a log from over 15 days ago? If you’re using Astro,
workers to spread out the work. the log retention period is an Environment Variable we have hard-coded
• If you’re running fewer but heavier tasks at a lower frequency, you’re like- on our platform. For now, you won’t have access to logs over 15 days old.
ly better off with a single but “heavier” worker that can more efficiently • Exec into one of your Celery workers to look for the log files. If you’re
execute those tasks. running Airflow on Kubernetes or Docker, you can use kubectl or Docker
commands to run $ kubectl exec -it {worker_name} bash. Log files
For more information on the differences between Executors, we recommend should be in ~/logs. From there, they’ll be split up by DAG/TASK/RUN.
reading Airflow Executors: Explained. • Try checking your Scheduler and Webserver logs to see if there are any
errors that might tell you why your task logs are missing. If your tasks are
slower than usual to get scheduled, you might need to update Scheduler
settings to increase performance and optimize your environment.
If you’re missing logs, you might see something like this under “Log by at-
tempts” in the Airflow UI: If your tasks are slower than usual to get scheduled, you might need to up-
date Scheduler settings to increase performance and optimize your environ-
ment.
140 141
Just like with concurrency settings, users can set parameters in Airflow’s air- Pro-tip: Scheduler performance was a critical part of the Airflow 2
flow.cfg file. If you’re using Astro, you can also set environment variables via release and has seen significant improvements since December of
the Astro UI or your project’s Dockerfile. We’ve formatted these settings as 2020. If you are experiencing Scheduler issues, we strongly rec-
parameters for readability – the environment variables for these settings are ommend upgrading to Airflow 2.x. For more information, read our
formatted as AIRFLOW__CORE__PARAMETER_NAME. For all default values, refer blog post: The Airflow 2.0 Scheduler.
here.
142 143
Error Notifications in Airflow 1 from datetime import datetime
2 from airflow import DAG
3
Overview 4 default_args = {
5 ‘owner’: ‘airflow’,
A key question when using any data orchestration tool is “How do I know
6 ‘start_date’: datetime(2018, 1, 30),
if something has gone wrong?” Airflow users always have the option to
7 ‘email’: [‘[email protected]’],
check the UI to see the status of their DAGs, but this is an inefficient way
8 ‘email_on_failure’: True
of managing errors systematically, especially if certain failures need to be
9 }
addressed promptly or by multiple team members. Fortunately, Airflow has
10
built-in notification mechanisms that can be leveraged to configure error
11 with DAG(‘sample_dag’,
notifications in a way that works for your team.
12 default_args=default_args,
13 schedule_interval=’@daily’,
In this section, we will cover the basics of Airflow notifications and how to
14 catchup=False) as dag:
set up common notification mechanisms including email, Slack, and SLAs.
15
We will also discuss how to make the most of Airflow alerting when using the
16 ...
Astronomer platform.
Airflow has an incredibly flexible notification system. Having your DAGs de- In contrast, it’s sometimes useful to have notifications only for certain tasks.
fined as Python code gives you full autonomy to define your tasks and notifi- The BaseOperator that all Airflow Operators inherit from has support for
cations in whatever way makes sense for your use case. built-in notification arguments, so you can configure each task individually as
In this section, we will cover some of the options available when working with needed. In the DAG below, email notifications are turned off by default at
notifications in Airflow. the DAG level but are specifically enabled for the will_email task.
Notification Levels
• Sometimes it makes sense to standardize notifications across your entire
DAG. Notifications set at the DAG level will filter down to each task in
the DAG. These notifications are usually defined in default_args.
144 145
Notification Triggers
1 from datetime import datetime
The most common trigger for notifications in Airflow is a task failure. However,
2 from airflow import DAG
notifications can be set based on other events, including retries and successes.
3 from airflow.operators.dummy_operator import DummyOperator
4
Emails on retries can be useful for debugging indirect failures; if a task need-
5 default_args = {
ed to retry but eventually succeeded, this might indicate that the problem
6 ‘owner’: ‘airflow’,
was caused by extraneous factors like a load on an external system. To turn
7 ‘start_date’: datetime(2018, 1, 30),
on email notifications for retries, simply set the email_on_retry parameter
8 ‘email_on_failure’: False,
to True as shown in the DAG below.
9 ‘email’: [‘[email protected]’],
10 ‘retries’: 1
11 }
12
1 from datetime import datetime, timedelta
13 with DAG(‘sample_dag’,
2 from airflow import DAG
14 default_args=default_args,
3
15 schedule_interval=’@daily’,
4 default_args = {
16 catchup=False) as dag:
5 ‘owner’: ‘airflow’,
17
6 ‘start_date’: datetime(2018, 1, 30),
18 wont_email = DummyOperator(
7 ‘email’: [‘[email protected]’],
19 task_id=’wont_email’
8 ‘email_on_failure’: True,
20 )
9 ‘email_on_retry’: True,
21
10 ‘retry_exponential_backoff’: True,
22 will_email = DummyOperator(
11 ‘retry_delay’ = timedelta(seconds=300)
23 task_id=’will_email’,
12 ‘retries’: 3
24 email_on_failure=True
13 }
25 )
14
15 with DAG(‘sample_dag’,
16 default_args=default_args,
17 schedule_interval=’@daily’,
18 catchup=False) as dag:
19
20 ...
146 147
7 dag_run = context.get(‘dag_run’)
When working with retries, you should configure a retry_delay. This is the
8 task_instances = dag_run.get_task_instances()
amount of time between a task failure and when the next try will begin. You
9 print(“These task instances failed:”, task_instances)
can also turn on retry_exponential_backoff, which progressively increases
10
the wait time between retries. This can be useful if you expect that extrane-
11 def custom_success_function(context):
ous factors might cause failures periodically.
12 “Define custom success notification behavior”
13 dag_run = context.get(‘dag_run’)
Finally, you can also set any task to email on success by setting the email_
14 task_instances = dag_run.get_task_instances()
on_success parameter to True. This is useful when your pipelines have con-
15 print(“These task instances succeeded:”, task_instances)
ditional branching, and you want to be notified if a certain path is taken (i.e.
16
certain tasks get run).
17 default_args = {
18 ‘owner’: ‘airflow’,
Custom Notifications
19 ‘start_date’: datetime(2018, 1, 30),
The email notification parameters shown in the sections above are an exam-
20 ‘on_failure_callback’: custom_failure_function
ple of built-in Airflow alerting mechanisms. These simply have to be turned
21 ‘retries’: 1
on and don’t require any configuration from the user.
22 }
23
You can also define your own notifications to customize how Airflow alerts you
24 with DAG(‘sample_dag’,
about failures or successes. The most straightforward way of doing this is by
25 default_args=default_args,
defining on_failure_callback and on_success_callback Python functions.
26 schedule_interval=’@daily’,
These functions can be set at the DAG or task level, and the functions will be
27 catchup=False) as dag:
called when a failure or success occurs respectively. For example, the following
28
DAG has a custom on_failure_callback function set at the DAG level and an
29 failure_task = DummyOperator(
on_success_callback function for just the success_task.
30 task_id=’failure_task’
31 )
32
1 from datetime import datetime 33 success_task = DummyOperator(
2 from airflow import DAG 34 task_id=’success_task’,
3 from airflow.operators.dummy_operator import DummyOperator 35 on_success_callback=custom_success_function
4 36 )
5 def custom_failure_function(context):
6 “Define custom failure notification behavior”
148 149
Note that custom notification functions can be used in addition to email In order for Airflow to send emails, you need to configure an SMTP server in
notifications. your Airflow environment. You can do this by filling out the SMTP section of
your airflow.cfg like this:
Email Notifications
Email notifications are a native feature in Airflow and are easy to set up. As
shown above, the email_on_failure and email_on_retry parameters can be
1 [smtp]
set to True either at the DAG level or task level to send emails when tasks
2 # If you want airflow to send emails on retries, failure, and you want to use
fail or retry. The email parameter can be used to specify which email(s) you
3 # the airflow.utils.email.send_email_smtp function, you have to configure an
want to receive the notification. If you want to enable email alerts on all fail-
4 # smtp server here
ures and retries in your DAG, you can define that in your default arguments
5 smtp_host = your-smtp-host.com
like this:
6 smtp_starttls = True
7 smtp_ssl = False
8 # Uncomment and set the user/pass settings if you want to use SMTP AUTH
9 # smtp_user =
1 from datetime import datetime, timedelta
10 # smtp_password =
2 from airflow import DAG
11 smtp_port = 587
3
12 smtp_mail_from = [email protected]
4 default_args = {
5 ‘owner’: ‘airflow’,
6 ‘start_date’: datetime(2018, 1, 30),
7 ‘email’: [‘[email protected]’],
8 ‘email_on_failure’: True,
9 ‘email_on_retry’: True, You can also set these values using environment variables. In this case, all
10 ‘retry_delay’ = timedelta(seconds=300) parameters are preceded by AIRFLOW__SMTP__, consistent with Airflow envi-
11 ‘retries’: 1 ronment variable naming convention. For example, smtp_host can be speci-
12 } fied by setting the AIRFLOW__SMTP__SMTP_HOST variable. For more on Airflow
13 email configuration, check out the Airflow documentation.
14 with DAG(‘sample_dag’,
15 default_args=default_args,
16 schedule_interval=’@daily’,
17 catchup=False) as dag:
Note: If you are running on the Astronomer platform, you can set up
18
SMTP using environment variables since the airflow.cfg cannot be
19 ...
directly edited. For more on email alerting on the Astronomer plat-
form, see the ‘Notifications on Astronomer’ section below.
150 151
Customizing Email Notifications Slack Notifications
By default, email notifications will be sent in a standard format as defined
in the email_alert() and get_email_subject_content() methods of the Sending notifications to Slack is another common way of alerting with
TaskInstance class. The default email content is defined like this: Airflow.
There are multiple ways you can send messages to Slack from Airflow. In this
section, we will cover how to use the Slack Provider’s SlackWebhookOperator
1 default_subject = ‘Airflow alert: {{ti}}’ with a Slack Webhook to send messages, since this is Slack’s recommended way
2 # For reporting purposes, we report based on 1-indexed, of posting messages from apps. To get started, follow these steps:
3 # not 0-indexed lists (i.e. Try 1 instead of
4 # Try 0 for the first attempt). 1. From your Slack workspace, create a Slack app and an incoming Web-
5 default_html_content = ( hook. The Slack documentatio here walks through the necessary steps.
6 ‘Try {{try_number}} out of {{max_tries + 1}}<br>’ Make a note of the Slack Webhook URL to use in your Python function.
7 ‘Exception:<br>{{exception_html}}<br>’ 2. Create an Airflow connection to provide your Slack Webhook to Airflow.
8 ‘Log: <a href=”{{ti.log_url}}”>Link</a><br>’ Choose an HTTP connection type (if you are using Airflow 2.0 or greater,
9 ‘Host: {{ti.hostname}}<br>’ you will need to install the apache-airflow-providers-http provider for
10 ‘Mark success: <a href=”{{ti.mark_success_url}}”>Link</a><br>’ the HTTP connection type to appear in the Airflow UI). Enter https://
11 ) hooks.slack.com/services/ as the Host, and enter the remainder of
your Webhook URL from the last step as the Password (formatted as
T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX).
To see the full method, check out the source code here.
You can overwrite this default with your custom content by setting the sub-
ject_template and/or html_content_template variables in your airflow.cfg
with the path to your jinja template files for subject and content respectively.
152 153
3. Create a Python function to use as your on_failure_callback method.
Within the function, define the information you want to send and invoke
Note: In Airflow 2.0 or greater, to use the SlackWebhookOperator you
the SlackWebhookOperator to send the message. Here’s an example:
will need to install the apache-airflow-providers-slack provider
package.
Task States
• None (Light Blue): No associated state. Syntactically - set as Python
None.
154 155
• Queued (Gray) : The task is waiting to be executed, set as queued.
1 from airflow import DAG
• Scheduled (Tan): The task has been scheduled to run.
2 from airflow.operators.dummy_operator import DummyOperator
• Running (Lime): The task is currently being executed.
3 from airflow.operators.python_operator import PythonOperator
• Failed (Red): The task failed.
4 from datetime import datetime, timedelta
• Success (Green): The task was executed successfully.
5 import time
• Skipped (Pink): The task has been skipped due to an upstream condition.
6
• Shutdown (Blue): The task is up for retry.
7 def my_custom_function(ts,**kwargs):
• Removed (Light Grey): The task has been removed.
8 print(“task is sleeping”)
• Retry (Gold): The task is up for retry.
9 time.sleep(40)
• Upstream Failed (Orange): The task will not run because of a failed
10
upstream dependency.
11 # Default settings applied to all tasks
12 default_args = {
13 ‘owner’: ‘airflow’,
Airflow SLAs
14 ‘depends_on_past’: False,
15 ‘email_on_failure’: True,
Airflow SLAs are a type of notification that you can use if your tasks are tak-
16 ‘email’: ‘[email protected]’,
ing longer than expected to complete. If a task takes longer than a maximum
17 ‘email_on_retry’: False,
amount of time to complete as defined in the SLA, the SLA will be missed
18 ‘sla’: timedelta(seconds=30)
and notifications will be triggered. This can be useful in cases where you have
19 }
potentially long-running tasks that might require user intervention after a
20
certain period of time or if you have tasks that need to complete by a certain
21 # Using a DAG context manager, you don’t have to specify the dag
deadline.
22 property of each task
23 with DAG(‘sla-dag’,
Note that exceeding an SLA will not stop a task from running. If you want
24 start_date=datetime(2021, 1, 1),
tasks to stop running after a certain time, try using timeouts instead.
25 max_active_runs=1,
26 schedule_interval=timedelta(minutes=2),
You can set an SLA for all tasks in your DAG by defining ‘sla’ as a default
27 default_args=default_args,
argument, as shown in the DAG below:
28 catchup=False
29 ) as dag:
30
156 157
31 t0 = DummyOperator( 1 from airflow import DAG
32 task_id=’start’ 2 from airflow.operators.dummy_operator import DummyOperator
33 ) 3 from airflow.operators.python_operator import PythonOperator
34 4 from datetime import datetime, timedelta
35 t1 = DummyOperator( 5 import time
36 task_id=’end’ 6
37 ) 7 def my_custom_function(ts,**kwargs):
38 8 print(“task is sleeping”)
39 sla_task = PythonOperator( 9 time.sleep(40)
40 task_id=’sla_task’, 10
41 python_callable=my_custom_function 11 # Default settings applied to all tasks
42 ) 12 default_args = {
43 t0 >> sla_task >> t1 13 ‘owner’: ‘airflow’,
14 ‘depends_on_past’: False,
15 ‘email_on_failure’: True,
16 ‘email’: ‘[email protected]’,
17 ‘email_on_retry’: False
18 }
SLAs have some unique behaviors that you should consider before implementing 19
them: 20 # Using a DAG context manager, you don’t have to specify the dag
21 property of each task
• SLAs are relative to the DAG execution date, not the task start time.
22 with DAG(‘sla-dag’,
For example, in the DAG above the sla_task will miss the 30 second
23 start_date=datetime(2021, 1, 1),
SLA because it takes at least 40 seconds to complete. The t1 task will
24 max_active_runs=1,
also miss the SLA, because it is executed more than 30 seconds after
25 schedule_interval=timedelta(minutes=2),
the DAG execution date. In that case, the sla_task will be considered
26 default_args=default_args,
“blocking” to the t1 task.
27 catchup=False
28 ) as dag:
• SLAs will only be evaluated on scheduled DAG Runs. They will not be
29
evaluated on manually triggered DAG Runs.
• SLAs can be set at the task level if a different SLA is required for each
task. In this case, all task SLAs are still relative to the DAG execution
date. For example, in the DAG below, t1 has an SLA of 500 seconds.
If the upstream tasks (t0 and sla_task) combined take 450 seconds
to complete, and t1 takes 60 seconds to complete, then t1 will miss its
SLA even though the task did not take more than 500 seconds to
execute.
158 159
30 t0 = DummyOperator( If you configured an SMTP server in your Airflow environment, you will also
31 task_id=’start’, receive an email with notifications of any missed SLAs.
32 sla=timedelta(seconds=50)
33 )
34
35 t1 = DummyOperator(
36 task_id=’end’,
37 sla=timedelta(seconds=500)
38 )
39
40 sla_task = PythonOperator(
41 task_id=’sla_task’,
42 python_callable=my_custom_function,
43 sla=timedelta(seconds=5)
44 )
45 Note that there is no functionality to disable email alerting for SLAs. If you
46 t0 >> sla_task >> t1 have an ‘email’ array defined and an SMTP server configured in your Air-
flow environment, an email will be sent to those addresses for each DAG Run
that has missed SLAs.
Notifications on Astronomer
Any SLA misses will be shown in the Airflow UI. You can view them by going If you are running Airflow on the Astronomer platform, you have multiple
to Browse SLA Misses, which looks something like this: options for managing your Airflow notifications. All of the methods above for
sending task notifications from Airflow are easily implemented on Astrono-
mer. Our documentation discusses how to leverage these notifications on the
platform, including how to set up SMTP to enable email alerts.
160 161
Thank you
We hope you’ve enjoyed our guide to DAGs. Please follow us on
Twitter and LinkedIn, and share your feedback, if any.
Get Started
162