Forged In Stars: April 2020

Author's note: I recently wrote the below in nearly-unbroken stream-of-consciousness mode targeted at a specific audience of one. It is reproduced here with just a few minor redactions. The subject/prompt was "why I dislike Airflow".

Subject: Airflow

I've never used Prefect, but they wrote a detailed piece called "Why Not Airflow?" that hits on many of the relevant issues:

https://medium.com/the-prefect-blog/why-not-airflow-4cfa423299c4

In my own experience with Airflow I identified three major issues (some of which are covered at the above link):

1. Scheduling is based on fixed points

(docs here https://airflow.apache.org/docs/stable/scheduler.html look how confusing that is!)

When we think about schedules we naturally think of "when is this thing supposed to run?" It might be at a specific time, or it might be an interval description like "every hour" or "every day at 02:30", but it is almost certainly not "...the job instance is started once the period it covers has ended" or "The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period", as the Airflow docs describe it. Our natural conception of scheduling is future-oriented, whereas Airflow's is past-oriented. One way this manifests is that if I have a "daily" job and it first runs at say, 2020-04-01T11:57:23-06:00 (roughly now), its next run will be at 2020-04-02T11:57:23-06:00. That is effectively never what I want. I want to be able to set up a job to run e.g. daily at 11:00, and then since it's a little after 11:00 right now, kick off a manual run now without impacting that future schedule. Airflow can't do this. They try to paper over their weird notion of scheduling by supporting "@daily", "@hourly", and cron expressions, but these are all translated to their bizarre internal interval concept.

(Counterpoint: their schedule model does give rise to built-in backfill support, which is cool)

2. Schedules are optimized for machines, not humans

(see https://airflow.apache.org/docs/stable/timezone.html)

[Upfront weird bias note: I am cursed to trip over every timezone bug present in any system I use. As a result I have become very picky and opinionated about timezone handling.]

We run jobs on a schedule because of human concerns, not machine concerns. Any system that forces humans to bear the load of thinking about the gnarly details of time rather than making the machine do it, is not well designed. Originally, Airflow would only run in UTC. By now they've added support for running in other timezones but they still do not support DST, which basically means they don't actually support timezones. Now, standardizing on UTC certainly makes sense for some use cases at some firms, but for any firm headquartered in the US which mainly does business in the US, DST is a reality that affects humans and that means we have to deal with it. If we deny that, we're going to have problems. For example if I run a job at 05:00 UTC-7 a.k.a Mountain Standard Time, chosen such that it will complete and make data available by 08:00 UTC-7 when employees start arriving to work, I am setting myself up for problems every March when my employees change their clocks and start showing up at 08:00 UTC-6 (which is 07:00 UTC-7!) because they are now on Mountain Daylight Time. If I insist on scheduling in UTC or a fixed UTC offset, I am probably going to have to move half my schedules twice a year. That's crazy! Computers can do this for us!

3. DAGs cannot be dynamic

At the time I was seriously evaluating Airflow at [previous employer], this is what killed it.

A powerful technique in software design is to make our code data-driven. We don't often use that term, but it's a common technique, in fact so common we don't much notice it anymore. The simple way to think of this is I should be able to make my software do things by giving it new input rather than writing new code.

Consider a page like this one (from a former employer): https://shop.example.com/category-slug-foo/product-slug-bar/60774 [link removed, use your imagination]

No doubt you've been to thousands of such pages in your life as an internet user. And as an engineer, you know how they work. See that 60774 at the end? That's an ID, and we can infer that a request router will match against this URL, pull off that ID, and look it up in a database. The results of that lookup will be fed into a template, and the result of that template rendering will be the page that we see. In this way, one request handler and one template can render any product in the system, and the consequence of that is that adding new products requires only that we add data.

Airflow doesn't work this way!

In Airflow's marketing material (for lack of a better term), they say that you build up your DAG with code, and that this is better than specifying static configuration. What they don't tell you is that your DAG-constructing code is expected to evaluate to the same result every time. In order to change the shape of your DAG you must release new code. Sometimes this arguably makes sense. If my DAG at v1 is A -> B, and I change it in v2 to be A -> B -> C, perhaps it makes sense for that to be a new thing, or a new version of a thing. But what if my DAG is A -> B -> C, and I want to parallelize B, perhaps over an unpredictable number of input file chunks, as in A -> {B0, B1, ..., Bn} -> C where N is unknown until runtime? Airflow doesn't allow this, because again our DAG construction code must evaluate to the same shape every run. This means that if we want data to drive our code, that data must be stored inline with the code and we must re-deploy our code whenever that data changes.

This is not good. I have built multiple flows using Luigi that expand at runtime to thousands of dynamically-constructed task nodes, and whose behavior could be adjusted between runs by adding/changing rows in a table. These flows cannot be expressed in Airflow. You will find posts suggesting the contrary (e.g. https://towardsdatascience.com/creating-a-dynamic-dag-using-apache-airflow-a7a6f3c434f3) but note what is going on here: configuration is being fed to the DAG code but that configuration is stored with the code and changing it requires a code push. If you can't feed it input without a code push, it's not dynamic.

Airflow and the team at Airbnb that built it deserve a lot of credit for popularizing the concept of DAG-oriented structuring of data jobs in a way that Luigi (which predates it by years) failed to do. The slick UI, built-in scheduler, and built-in job executor are likewise praiseworthy. Ultimately though I've found that tightly coupling your flow structure to your scheduling system is a mis-feature. The fact that Luigi jobs must be initiated by an outside force is actually a powerful simplification: it means that a Luigi program is just a program which can be run from anywhere and does not (necessarily) require complex execution infrastructure. (Prefect can be used in this way as well, or with its own supplied scheduler.)

I also concede that there is value in wholesale adoption of Airflow (or something like it) as the central unifying structure of one's data wrangling universe. Regardless of the specific tech, having a single central scheduler is a great idea, because it makes the answers to "where is X scheduled?" or "is there anything that runs at time Y?" trivial to find. What's worrisome about Airflow specifically in that role is all the things it prevents you from doing, or allows only through dirty hacks like writing DAGs that use Luigi internally, or using code-generation to push dynamism to "build time".

Lastly, I have to concede that Airflow's sheer popularity is a vote in its favor. There's a lot of enthusiasm and momentum behind it, which bodes well for future feature additions and so on. There are already even managed Airflow-as-a-service products like Astronomer. I think it's still early, though. I've had a serious interest in dependency-structured data workflows since at least 2007, and until I encountered Luigi in 2014 I was aware of zero products that addressed this need, other than giant commercial monsters like Informatica. There's still a great deal of room for innovation and new players in this space.

[Original rant concludes here.]

Addendum

For whatever reason this topic keeps turning over in my head, so here are even more words.

I recently interviewed with 4 companies, of which 2 are using Airflow and a third is/was planning to adopt it. My current employer also uses it. I have little idea if any of them are actually happy with it, or if they understand the value and/or struggles it's creating for them.

Workflow / pipeline structuring is far from a solved problem. As I noted in my rant, the problem has existed basically forever, but general solutions -- platforms, frameworks -- have only started popping up in the last decade or so (again deliberately ignoring Informatica et al). There seems to be a temptation in the industry to treat Airflow as the de facto standard solution just because it's popular and appears to have a slick UI (the UI is actually clunky as hell, which you will discover within 30 seconds of trying to use it).

The options in this space by my reckoning are:

Luigi (Spotify, 2012)
Drake (Factual, 2013)
Airflow (Airbnb, 2015)
Dagster (dagster.io, 2018)
Prefect (prefect.io, 2019)
AWS Step Functions (AWS, 2016)
chaining jobs in Jenkins (2011?)
miscellaneous proprietary shit

These are all very different from each other! This is not a choice like React vs Vue, Flask vs FastAPI, Dropwizard vs Spring Boot, AWS vs GCP vs Azure, or [pick your favorite].

These tools aren't even all the same kind of thing. Luigi is a library, Drake is a CLI tool, Airflow and Prefect are libraries and schedulers and distributed task executors, Step Functions is a managed service, and Jenkins is a full server process and plugin ecosystem nominally intended for doing software builds.

They also differ markedly in how they model a workflow/pipeline. Luigi has Tasks which produce Targets and depend on other Targets, a design which almost fully externalizes state, with the result that Tasks whose outputs already exist will not be run, even if you ask them to. Airflow tightly couples its tracking of Task state to a server process + database, and requires Tasks to have an execution_date, so whether a Task will run if you ask it to depends on whether the server thinks it has or has not already run for the specified date. Drake, like make, uses output file timestamps to determine what needs to run. Jenkins just does whatever you tell it to (unless some plugin makes it work completely different!).

We don't even have standardized language for talking about these things. Several of these tools use the word "task" to name a major concept in their model. They're usually similar but hardly interchangeable. Perhaps a better example is trying to talk about "dynamic" DAGs, like I mentioned in my rant. I mean something very specific when I say a DAG is dynamic: that the shape of the execution graph is determined at runtime. Other people describe DAGs as dynamic simply because they were assembled by running code rather than specified as configuration data. These definitions are apples and oranges, and the result is a great deal of confusion in discussions of capabilities and alternatives, particularly in the very limited space of public conversation.

I encourage everyone to go out and try this stuff. Build a trivial, dummy pipeline and implement it in 3+ tools. Then repeat that exercise with a small pipeline that does real work and can stand in for the kind of problem you typically tackle. Then start building a solution to a serious problem. You don't have to build the whole thing. If you've gotten this far, simply writing stubbed-out functions/classes and focusing on how they wire up will tell you a great deal. Tasks that sleep for a random time and then touch a file or insert a row are often all you need to simulate your entire data processing world. As a final step think and work through what happens as you change things. Most of these tools don't discuss their implied deployment models, and the devil is in the details.

The bottom line for me is that this remains an active research area, even though I've been working on it for over a decade. I've learned quite a bit in that time but my wisdom remains dwarfed by my ignorance. Don't believe anyone who's trying to tell you that we have this figured out.

Monday, April 20, 2020

Chocolate Chip Banana Bread

Sunday, April 5, 2020

Far Too Many Words About Airflow

Subject: Airflow

Addendum