It’s been more than a year that I have been introduced to AWS Step Functions and my team decided to use them for our internal data workflows. It all started aiming to support all our ETL processes to extract data from our various databases, external platforms and APIs and eventually load and transform our tables using dbt into our Snowflake data warehouse. However since we also started implementing product facing features which required data transformations and exports to external systems, it made sense to extend our AWS Step Functions usage for those as well.

Back when we made the decision, the team was new while the whole company and the devops team was in the middle of its AWS migration. Therefore it made sense to choose a managed service provided by AWS rather than self-host our own Airflow or Prefect orchestrator or even use their managed versions. Around September 2021, AWS Step Functions was supporting only around 10 services instead of the whole AWS ecosystem, but just AWS Lambda and EKS support sufficed back then. It was really easy to deploy our ETL scripts using AWS Lambdas, prepare a few Step Functions per ETL workflow and put them in production. We were surprised to find out that AWS supported all AWS services a couple weeks after deploying our workflows to production, which allowed us to use more tasks per service, e.g. SSM functions to retrieve secrets from the parameter store, RDS functions, Athena queries, S3 operations to name but a few. We have implemented those with Lambdas but it was really straightforward to change and support them. While most of our workflows were Standard AWS Step Functions, we used Express for our product facing ones which allow us to orchestrate and easily scale real-time tasks such extracting and transforming data and sending them to customers.

But why do I miss Airflow? Well, I have been a heavy Apache Airflow user throughout my career and despite its cons, I have found a way to use it efficiently in various companies. I never relied to its data connectors and integrations, but always used it as a scheduler of tasks executed in the infrastructure and not within the Airflow workers. I have debugged so many errors, I have experimented with so many different celery settings and resource allocations to facilitate the tasks that the workers had to execute, which eventually made me just rely on its scheduling functionality and informative UI. On the contrary, AWS Step Functions is exactly the opposite. It’s the de-facto way to execute tasks within the AWS ecosystem and its transparrent serverless nature always hides you all this complexity and errors. So what do I miss?

  • A user interface. Perhaps the Airflow UI is not the best, but it fulfills its purpose. You can quickly see last execution, last failed DAGs, quick statistics, you can click on the task and see the logs. You can even give access to someone less technical and can understand whether something failed or not (with the exception that you might need to explain them why the execution timestamps are yesterday’s). Step Functions UI has very limited functionality only showing you the executions and input and output of the tasks and redictering you to them to figure out their logs (EKS tasks don’t expose logs so for errors there we have to check in Datadog so it’s tough to follow sometimes).

  • Scheduling. Well, we use Eventbridge rules to schedule tasks but it’s a different service doing multiple things and I would prefer if a part of it was bundled with AWS Step Functions.

  • Backfill and pause. There are many times in AWS Step Functions that a workflow failed due to some bug in our code and we can’t rerun parts of the workflow but we need to fully run them from scratch. In Airflow, I would be able to go to the failed task and backfill for that specific date until now. Same applies to pausing tasks and DAGs. I was always able to go and pause a DAG until something else resolves. Now in AWS Step Functions I need to remember to disable the AWS EventBridge rule and remember again that I need to enable it again.

  • Airflow DAG definition. It was always straightforward for me to create a new Airflow dag and connect tasks using Python in minutes, while the same workflow in AWS Step Functions might take hours and lots of dry-runs. Drag-and-drop of tasks makes it easy in some sense, but anything that has to do with keeping or exchanging state and variables between tasks has steep learning curve. Especially since in some cases it’s not anywhere documented what Lambda returns as an output, which parameter to pick from the JSON and things like this.

  • Manipulate state and paramters between tasks. This is probably what I miss the most. Besides a couple functions that AWS supports, you cannot define or manipulate any parameters or state between tasks. e.g. i cannot get the date of the execution, i cannot manipulate dates across tasks (unless I create a special Lambda for this), I cannot perform some basic string manipulation on the s3 bucket path I received from one task into another (again the special Lambda will do this), etc. Airflow’s pythonic way to setup tasks and transfer parameters between those is really efficient IMO.

This is my take on AWS Step Functions and what I miss from Apache Airflow. To be honest, if I had a proper user interface integrating scheduling, logs, alerts and enhanced parameter handling capabilities I would be extremely happy with them. While now we are considering whether a migration to Airflow or other orchestrators is worth.