Developing an Apache Airflow Addiction
Why you need Apache Airflow, Spark, and Notebooks in your ETL code.
Assumption: Everyone uses Spark
What’s everyone’s obsession with Jupyter notebooks?
The terms data engineer and data scientist are thrown around by recruiters and employees alike, but their opinion on the Jupyter notebook will certainly differ. Engineers will shrug, and say “yeah, it’s not a bad I.D.E.,” meanwhile Data Scientists are likely to sacrifice their first born to ensure it stays open source.
Why? Notebooks allow Spark code to explore. A Data Scientist can run a small piece of code, delete it, add another piece of code, overwrite that, decorate it with some graphs, put in a few titles, and they have an interesting, executable presentation. Even more, notebooks can be common ground for both engineers and data scientists, when used correctly.
Even better you’re in luck, Google Datalab, Amazon EMR and Databricks all allow you to write code in notebooks freely, then put them (and their Spark code) into production with scheduled jobs.
Do I really need a task manager?
So now you’re using Jupyter notebook and you might be thinking, great now my ETL for Data Science is taken care of!
Sorry, think again.
The model that the Data Scientist built requires a complete code set after Job X is done. And Job X depends on Job Y. So we need a task manager. In comes Apache Airflow, an open source Python task manager, with a dashboard, worker nodes and even a few easy to use Docker containers.
Even better, it has hooks for Google, Amazon, and Databricks. Use Airflow to set up your dependencies, plug in your notebooks and you have a sturdy, scalable, transparent ETL task manager that your Data Engineers can easily work with and Data Scientists can geek out about.
Getting hooked on Apache Airflow.
Sound complicated? I’ve been doing it for the past few months and I’m not gonna lie, it kind of is. However, myself and three fellow engineers completed a multi-source ingestion big data platform using the above technologies for a large corporation and it was a strong learning experience. Once you get past the learning curve the benefits are well worth it and before you know it, you’re hooked.
I strongly advocate for using Apache Airflow, Spark, and notebooks in your ETL code.
My company, Caserta will be breaking down the technical details of using these technologies on Google Cloud at our next Meetup Tuesday (5/8). If you want to reduce the learning curve and get hooked, register for the Meetup and join me.
If you can’t make the Meetup but still want to learn, sign up for the Webinar (5/11.)