What I Wish I knew Before Using Apache Airflow.
What Apache Airflow is
Apache Airflow, originally created by Airbnb, is an incubator project for Apache. This project has only been around for two years and has quickly become a staple in many ETL platforms. Airflow is a workflow management system that provides dependency control, task management, task recovery, charting, logging, alerting, history, folder watching, trending and my personal favorite, dynamic tasks.
What Apache Airflow is not
Apache Airflow offers many tools and a lot of power which can greatly simplify your life. However, it is not perfect. Since it’s an open-source project that depends on its users to develop, there are definite bugs within the system. Follow these tips to avoid the pitfalls I had to learn the hard.
article continued below
An Overview of Apache Airflow DAGs
Let’s talk about the governing force behind Airflow, DAGs – Directed Acyclic Graphs. A DAG is a collection of all the tasks you want to run or interact with, in a process. DAG’s are written in Python. An operator is a description of how a task is performed. Operators are easy to program, so you can make your own. Anything that you need to execute and anyway that you need to execute it, you can program through a DAG.
1. Apache Airflow Task Runs
On the left-hand side of the DAG UI, you will see on/off switches. Nothing in Airflow will run unless it’s turned on. Even if you see it there and you hit the play button, nothing will happen unless you hit the on-switch. Make sure to monitor this.
2. Labeling DAGs in Apache Airflow
A word of warning, even if you have multiple Python files, if they use the same DAG ID, only one will show. Be Careful of that. It is imperative that they are truly unique.
3. Apache Airflow Task Runs & Dates
In Airflow, everything is based in UTC and one thing that is really difficult for a lot of people is the fact that the dates are so different than any other platform. If a scheduled task ran right now, and just finished executing, you would think the last run date displayed would be today. The date displayed is actually yesterday.
In Airflow, date’s are always a day behind and you may have to normalize that because if you run through task, backfill, or schedule, they all have different dates, so be aware.
airflow test DAG TASK DATE: The date passed is the date you specify, and it returns as the
airflow backfill DAG -s DATE -e : The date passed is both the start and end date. The end date will more than likely be the one you want. Also when trying to backfill, remember this. If you want to run for
2018-01-02, the start date must be
2018-01-01 or you’ll have the wrong date.
Airflow scheduled runs:
next_execution_date is the date of execution. Depending on your schedule this may be appropriate. Otherwise, it may be the
END_DATE again. Just be careful that you’re picking up the correct date.
When testing DAGs, you will need to test in all three modes if you are dependent on dates in any way, to ensure you’re handling all things correctly.
4. Concurrency & Max Runs in Apache Airflow
If you’re not careful, you can have multiple of the same task running at once. We found that was not something we wanted, so we enabled these two arguments at the dag level:
By enabling both of these, a singular DAG could only run one DAG run at a time, without any extra parallel execution. This fit our patterns of processing better. However, another word of caution, backfill disregards these arguments. So if you’re backfilling data and need it to run sequentially, wait for the jobs to complete first.
5. Using Apache Airflow Task State Callbacks
The following are the task states you can use to perform extra functions:
on_retry_callback to alert us of a delay. You can also use
on_failure_callback to call a cleanup function or
on_success_callback to move files out of a processing queue. The only limit is your imagination. Decide what would work best for your company.
6. Apache Airflow Task Action Window
If a task has been run, failed, or is in retry mode, you must clear it out before you can proceed forward. Otherwise, the UI will say it’s been set to run, but the scheduler will never run it. Be careful with clearing things out, you can clear more than you want to.
7. Manual vs Scheduled Runs in Apache Airflow
When you go to the graph view, the view you see is based on time. If you are looking at a dag with a scheduled run, the newest run will appear. However, if you ran a manual load (pressed the play button) just before this scheduled run, the scheduled run will NOT be at the top and you will be looking at the manual run. Why? Remember, Airflow scheduled runs are a day behind. This part of the UI will display the dates accordingly to the
START_DATE. As such, the start date for a scheduled run is a day behind, and your manual runs will always be on top.
Be aware of this so you do not try to look at logs for a manual run when you are debugging a scheduled run.
Remember, just because the UI says it works, sometimes it won’t. Also be careful. Not all options in Airflow play nice with each other. As we stated earlier, concurrency is denied when running multiple backfill operations.
For more information watch the full Apache Airflow Webinar.
If you know Python and you want to contribute to making this platform better visit https://airflow.apache.org
About Dovy Paukstys:
Dovy is a seasoned cloud compute expert with vast experience in IOT, user-facing systems, ETL pipeline designs, and on-prem tear-downs. He also is the owner and creator of Redux Framework. His goal is to provide the best solutions to fulfill the needs of his clients. You can follow Dovy on twitter @dovyp and connect with him on LinkedIn.