Monday, April 20, 2020

Chocolate Chip Banana Bread

This is AB's banana bread recipe from IJHFMF with a few tiny modifications and comments. I've made it probably 10 times over the last year
 
Ingredients
(dry team)
220g AP flour
35g oat flour, which means 35g oats put through a spice grinder or food processor
1 teaspoon salt
1 teaspoon baking soda

(wet team A)
1 stick unsalted butter, melted and cooled
2 eggs
1 teaspoon vanilla extract (or almond extract etc if you're feeling adventurous)

(wet team B)
4 bananas, extremely overripe, like seriously they will be nearly black, it takes a couple of weeks for them to get this way
180-210g sugar, to taste (original recipe says 210). I like to sub a little brown sugar or honey.
 
(misc)
extra butter for pan
dark(!) chocolate chips (only optional if you don't like being awesome)
chopped nuts (pecans or walnuts)

Note: if you have 6 bananas, or whatever, this recipe can be scaled up, but as written it fills a loaf pan so you will need additional pans.


Tools
kitchen scale (we bake by weight, not volume!)
3 bowls
mixing spoon
electric hand mixer (optional)
loaf pan (mine is about 10"x5"x3") or muffin pan(s)
parchment paper
oven (duh)
cooling rack


Procedure
1. Peel the bananas and pile them in a bowl. This step is first so that you can abort if you find that they're moldy.

2. Pre-heat the oven to 350degF

3. Melt the butter and set aside to cool.

4. Assemble wet team B by adding the sugar to the bananas and mashing/mixing thoroughly. I use a hand mixer.

5. Assemble the dry team. Toast the oats before grinding if you're feeling ambitious.

6. Finish wet team A by adding the eggs and vanilla to the butter and mixing gently. Just break the egg membranes and scramble them a bit. If the butter is hot when you do this it will cook the eggs and that is Bad.

7. Add wet team A to wet team B and mix. Again I use a hand mixer.

8. Add the combined wet team to the dry team. Mix only until combined (meaning no pockets of un-moistened flour). If your bananas were huge, or you used more than 4, the batter may seem too wet. Add a bit of extra flour. Getting this right takes practice.

9. Mix in chocolate chips and/or nuts to taste.

10a. If using a loaf pan, rub the inside with butter and then line with parchment paper (only the long sides need to be papered, the short sides will be touched by the batter and this is fine)
10b. If using a muffin pan, use muffin wrappers (or don't, in which case you're on your own)

11. Pour in your batter. For the loaf pan this is simple. For muffins, I haven't yet figured out how much should go in each one. Best of luck.

12a. For a loaf, bake in the center of the oven at 350 for 45 minutes then raise to 380-400 (experiment) for 15 minutes more (this browns the outside and firms the crust). Ovens vary and you may need to tweak times and temps. When it's done a toothpick will come out not-quite-clean (unlike a cake). If a toothpick comes out totally clean it's probably overbaked and the voice of Paul Hollywood will haunt your dreams.
12b. For muffins, bake for uhhhh less time than that? I haven't gotten them right yet.

13. Cool on a rack for 15 minutes in the pan, then remove from the pan and cool for 60 minutes or until you can't stand to wait any longer.

14. I wrap the loaf tightly in plastic, then foil, and keep it on the counter. It will definitely keep for a week. You will almost certainly eat it all before a week goes by.

Sunday, April 5, 2020

Far Too Many Words About Airflow

Author's note: I recently wrote the below in nearly-unbroken stream-of-consciousness mode targeted at a specific audience of one. It is reproduced here with just a few minor redactions. The subject/prompt was "why I dislike Airflow".
 

Subject: Airflow

 
I've never used Prefect, but they wrote a detailed piece called "Why Not Airflow?" that hits on many of the relevant issues:

In my own experience with Airflow I identified three major issues (some of which are covered at the above link):
1. Scheduling is based on fixed points
(docs here https://airflow.apache.org/docs/stable/scheduler.html look how confusing that is!)
When we think about schedules we naturally think of "when is this thing supposed to run?" It might be at a specific time, or it might be an interval description like "every hour" or "every day at 02:30", but it is almost certainly not "...the job instance is started once the period it covers has ended" or "The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period", as the Airflow docs describe it. Our natural conception of scheduling is future-oriented, whereas Airflow's is past-oriented. One way this manifests is that if I have a "daily" job and it first runs at say, 2020-04-01T11:57:23-06:00 (roughly now), its next run will be at 2020-04-02T11:57:23-06:00. That is effectively never what I want. I want to be able to set up a job to run e.g. daily at 11:00, and then since it's a little after 11:00 right now, kick off a manual run now without impacting that future schedule. Airflow can't do this. They try to paper over their weird notion of scheduling by supporting "@daily", "@hourly", and cron expressions, but these are all translated to their bizarre internal interval concept.

(Counterpoint: their schedule model does give rise to built-in backfill support, which is cool)

2. Schedules are optimized for machines, not humans
[Upfront weird bias note: I am cursed to trip over every timezone bug present in any system I use. As a result I have become very picky and opinionated about timezone handling.]

We run jobs on a schedule because of human concerns, not machine concerns. Any system that forces humans to bear the load of thinking about the gnarly details of time rather than making the machine do it, is not well designed. Originally, Airflow would only run in UTC. By now they've added support for running in other timezones but they still do not support DST, which basically means they don't actually support timezones. Now, standardizing on UTC certainly makes sense for some use cases at some firms, but for any firm headquartered in the US which mainly does business in the US, DST is a reality that affects humans and that means we have to deal with it. If we deny that, we're going to have problems. For example if I run a job at 05:00 UTC-7 a.k.a Mountain Standard Time, chosen such that it will complete and make data available by 08:00 UTC-7 when employees start arriving to work, I am setting myself up for problems every March when my employees change their clocks and start showing up at 08:00 UTC-6 (which is 07:00 UTC-7!) because they are now on Mountain Daylight Time. If I insist on scheduling in UTC or a fixed UTC offset, I am probably going to have to move half my schedules twice a year. That's crazy! Computers can do this for us!

3. DAGs cannot be dynamic
At the time I was seriously evaluating Airflow at [previous employer], this is what killed it.

A powerful technique in software design is to make our code data-driven. We don't often use that term, but it's a common technique, in fact so common we don't much notice it anymore. The simple way to think of this is I should be able to make my software do things by giving it new input rather than writing new code.

Consider a page like this one (from a former employer): https://shop.example.com/category-slug-foo/product-slug-bar/60774 [link removed, use your imagination]
No doubt you've been to thousands of such pages in your life as an internet user. And as an engineer, you know how they work. See that 60774 at the end? That's an ID, and we can infer that a request router will match against this URL, pull off that ID, and look it up in a database. The results of that lookup will be fed into a template, and the result of that template rendering will be the page that we see. In this way, one request handler and one template can render any product in the system, and the consequence of that is that adding new products requires only that we add data.

Airflow doesn't work this way!

In Airflow's marketing material (for lack of a better term), they say that you build up your DAG with code, and that this is better than specifying static configuration. What they don't tell you is that your DAG-constructing code is expected to evaluate to the same result every time. In order to change the shape of your DAG you must release new code. Sometimes this arguably makes sense. If my DAG at v1 is A -> B, and I change it in v2 to be A -> B -> C, perhaps it makes sense for that to be a new thing, or a new version of a thing. But what if my DAG is A -> B -> C, and I want to parallelize B, perhaps over an unpredictable number of input file chunks, as in A -> {B0, B1, ..., Bn} -> C where N is unknown until runtime? Airflow doesn't allow this, because again our DAG construction code must evaluate to the same shape every run. This means that if we want data to drive our code, that data must be stored inline with the code and we must re-deploy our code whenever that data changes.

This is not good. I have built multiple flows using Luigi that expand at runtime to thousands of dynamically-constructed task nodes, and whose behavior could be adjusted between runs by adding/changing rows in a table. These flows cannot be expressed in Airflow. You will find posts suggesting the contrary (e.g. https://towardsdatascience.com/creating-a-dynamic-dag-using-apache-airflow-a7a6f3c434f3) but note what is going on here: configuration is being fed to the DAG code but that configuration is stored with the code and changing it requires a code push. If you can't feed it input without a code push, it's not dynamic.


Airflow and the team at Airbnb that built it deserve a lot of credit for popularizing the concept of DAG-oriented structuring of data jobs in a way that Luigi (which predates it by years) failed to do. The slick UI, built-in scheduler, and built-in job executor are likewise praiseworthy. Ultimately though I've found that tightly coupling your flow structure to your scheduling system is a mis-feature. The fact that Luigi jobs must be initiated by an outside force is actually a powerful simplification: it means that a Luigi program is just a program which can be run from anywhere and does not (necessarily) require complex execution infrastructure. (Prefect can be used in this way as well, or with its own supplied scheduler.)

I also concede that there is value in wholesale adoption of Airflow (or something like it) as the central unifying structure of one's data wrangling universe. Regardless of the specific tech, having a single central scheduler is a great idea, because it makes the answers to "where is X scheduled?" or "is there anything that runs at time Y?" trivial to find. What's worrisome about Airflow specifically in that role is all the things it prevents you from doing, or allows only through dirty hacks like writing DAGs that use Luigi internally, or using code-generation to push dynamism to "build time".

Lastly, I have to concede that Airflow's sheer popularity is a vote in its favor. There's a lot of enthusiasm and momentum behind it, which bodes well for future feature additions and so on. There are already even managed Airflow-as-a-service products like Astronomer. I think it's still early, though. I've had a serious interest in dependency-structured data workflows since at least 2007, and until I encountered Luigi in 2014 I was aware of zero products that addressed this need, other than giant commercial monsters like Informatica. There's still a great deal of room for innovation and new players in this space.
 
[Original rant concludes here.]

Addendum

For whatever reason this topic keeps turning over in my head, so here are even more words.
 
I recently interviewed with 4 companies, of which 2 are using Airflow and a third is/was planning to adopt it. My current employer also uses it. I have little idea if any of them are actually happy with it, or if they understand the value and/or struggles it's creating for them.

Workflow / pipeline structuring is far from a solved problem. As I noted in my rant, the problem has existed basically forever, but general solutions -- platforms, frameworks -- have only started popping up in the last decade or so (again deliberately ignoring Informatica et al). There seems to be a temptation in the industry to treat Airflow as the de facto standard solution just because it's popular and appears to have a slick UI (the UI is actually clunky as hell, which you will discover within 30 seconds of trying to use it).

The options in this space by my reckoning are:
  • Luigi (Spotify, 2012)
  • Drake (Factual, 2013)
  • Airflow (Airbnb, 2015)
  • Dagster (dagster.io, 2018)
  • Prefect (prefect.io, 2019)
  • AWS Step Functions (AWS, 2016)
  • chaining jobs in Jenkins (2011?)
  • miscellaneous proprietary shit
These are all very different from each other! This is not a choice like React vs Vue, Flask vs FastAPI, Dropwizard vs Spring Boot, AWS vs GCP vs Azure, or [pick your favorite].
 
These tools aren't even all the same kind of thing. Luigi is a library, Drake is a CLI tool, Airflow and Prefect are libraries and schedulers and distributed task executors, Step Functions is a managed service, and Jenkins is a full server process and plugin ecosystem nominally intended for doing software builds.
 
They also differ markedly in how they model a workflow/pipeline. Luigi has Tasks which produce Targets and depend on other Targets, a design which almost fully externalizes state, with the result that Tasks whose outputs already exist will not be run, even if you ask them to. Airflow tightly couples its tracking of Task state to a server process + database, and requires Tasks to have an execution_date, so whether a Task will run if you ask it to depends on whether the server thinks it has or has not already run for the specified date. Drake, like make, uses output file timestamps to determine what needs to run. Jenkins just does whatever you tell it to (unless some plugin makes it work completely different!).

We don't even have standardized language for talking about these things. Several of these tools use the word "task" to name a major concept in their model. They're usually similar but hardly interchangeable. Perhaps a better example is trying to talk about "dynamic" DAGs, like I mentioned in my rant. I mean something very specific when I say a DAG is dynamic: that the shape of the execution graph is determined at runtime. Other people describe DAGs as dynamic simply because they were assembled by running code rather than specified as configuration data. These definitions are apples and oranges, and the result is a great deal of confusion in discussions of capabilities and alternatives, particularly in the very limited space of public conversation.

I encourage everyone to go out and try this stuff. Build a trivial, dummy pipeline and implement it in 3+ tools. Then repeat that exercise with a small pipeline that does real work and can stand in for the kind of problem you typically tackle. Then start building a solution to a serious problem. You don't have to build the whole thing. If you've gotten this far, simply writing stubbed-out functions/classes and focusing on how they wire up will tell you a great deal. Tasks that sleep for a random time and then touch a file or insert a row are often all you need to simulate your entire data processing world. As a final step think and work through what happens as you change things. Most of these tools don't discuss their implied deployment models, and the devil is in the details.

The bottom line for me is that this remains an active research area, even though I've been working on it for over a decade. I've learned quite a bit in that time but my wisdom remains dwarfed by my ignorance. Don't believe anyone who's trying to tell you that we have this figured out.