Jekyll2022-11-11T19:20:21+00:00https://geopet85.github.io/feed.xmlMy life and data diarySharing my thoughts on life and data engineering, while practising my writing skills.George PetropoulosMissing Apache Airflow, is that even possible?2022-11-11T00:00:00+00:002022-11-11T00:00:00+00:00https://geopet85.github.io/2022/11/11/step-functions-airlow<p>It’s been more than a year that I have been introduced to AWS Step Functions and my team decided to use them for our internal data workflows. It all started aiming to support all our ETL processes to extract data from our various databases, external platforms and APIs and eventually load and transform our tables using dbt into our Snowflake data warehouse. However since we also started implementing product facing features which required data transformations and exports to external systems, it made sense to extend our AWS Step Functions usage for those as well.</p>
<p>Back when we made the decision, the team was new while the whole company and the devops team was in the middle of its AWS migration. Therefore it made sense to choose a managed service provided by AWS rather than self-host our own Airflow or Prefect orchestrator or even use their managed versions. Around September 2021, AWS Step Functions was supporting only around 10 services instead of the whole AWS ecosystem, but just AWS Lambda and EKS support sufficed back then. It was really easy to deploy our ETL scripts using AWS Lambdas, prepare a few Step Functions per ETL workflow and put them in production. We were surprised to find out that AWS supported all AWS services a couple weeks after deploying our workflows to production, which allowed us to use more tasks per service, e.g. SSM functions to retrieve secrets from the parameter store, RDS functions, Athena queries, S3 operations to name but a few. We have implemented those with Lambdas but it was really straightforward to change and support them. While most of our workflows were Standard AWS Step Functions, we used Express for our product facing ones which allow us to orchestrate and easily scale real-time tasks such extracting and transforming data and sending them to customers.</p>
<p>But why do I miss Airflow? Well, I have been a heavy Apache Airflow user throughout my career and despite its cons, I have found a way to use it efficiently in various companies. I never relied to its data connectors and integrations, but always used it as a scheduler of tasks executed in the infrastructure and not within the Airflow workers. I have debugged so many errors, I have experimented with so many different celery settings and resource allocations to facilitate the tasks that the workers had to execute, which eventually made me just rely on its scheduling functionality and informative UI. On the contrary, AWS Step Functions is exactly the opposite. It’s the de-facto way to execute tasks within the AWS ecosystem and its transparrent serverless nature always hides you all this complexity and errors. So what do I miss?</p>
<ul>
<li>
<p>A user interface. Perhaps the Airflow UI is not the best, but it fulfills its purpose. You can quickly see last execution, last failed DAGs, quick statistics, you can click on the task and see the logs. You can even give access to someone less technical and can understand whether something failed or not (with the exception that you might need to explain them why the execution timestamps are yesterday’s). Step Functions UI has very limited functionality only showing you the executions and input and output of the tasks and redictering you to them to figure out their logs (EKS tasks don’t expose logs so for errors there we have to check in Datadog so it’s tough to follow sometimes).</p>
</li>
<li>
<p>Scheduling. Well, we use Eventbridge rules to schedule tasks but it’s a different service doing multiple things and I would prefer if a part of it was bundled with AWS Step Functions.</p>
</li>
<li>
<p>Backfill and pause. There are many times in AWS Step Functions that a workflow failed due to some bug in our code and we can’t rerun parts of the workflow but we need to fully run them from scratch. In Airflow, I would be able to go to the failed task and backfill for that specific date until now. Same applies to pausing tasks and DAGs. I was always able to go and pause a DAG until something else resolves. Now in AWS Step Functions I need to remember to disable the AWS EventBridge rule and remember again that I need to enable it again.</p>
</li>
<li>
<p>Airflow DAG definition. It was always straightforward for me to create a new Airflow dag and connect tasks using Python in minutes, while the same workflow in AWS Step Functions might take hours and lots of dry-runs. Drag-and-drop of tasks makes it easy in some sense, but anything that has to do with keeping or exchanging state and variables between tasks has steep learning curve. Especially since in some cases it’s not anywhere documented what Lambda returns as an output, which parameter to pick from the JSON and things like this.</p>
</li>
<li>
<p>Manipulate state and paramters between tasks. This is probably what I miss the most. Besides a couple functions that AWS supports, you cannot define or manipulate any parameters or state between tasks. e.g. i cannot get the date of the execution, i cannot manipulate dates across tasks (unless I create a special Lambda for this), I cannot perform some basic string manipulation on the s3 bucket path I received from one task into another (again the special Lambda will do this), etc. Airflow’s pythonic way to setup tasks and transfer parameters between those is really efficient IMO.</p>
</li>
</ul>
<p>This is my take on AWS Step Functions and what I miss from Apache Airflow. To be honest, if I had a proper user interface integrating scheduling, logs, alerts and enhanced parameter handling capabilities I would be extremely happy with them. While now we are considering whether a migration to Airflow or other orchestrators is worth.</p>George PetropoulosIt’s been more than a year that I have been introduced to AWS Step Functions and my team decided to use them for our internal data workflows. It all started aiming to support all our ETL processes to extract data from our various databases, external platforms and APIs and eventually load and transform our tables using dbt into our Snowflake data warehouse. However since we also started implementing product facing features which required data transformations and exports to external systems, it made sense to extend our AWS Step Functions usage for those as well.My take on on-call2022-05-22T00:00:00+00:002022-05-22T00:00:00+00:00https://geopet85.github.io/2022/05/22/on-call<p>On-call is a huge issue in every tech company and a large discussion topic in tech communities and fora. Most people hate it, others burn out or switch positions, jobs and careers just to avoid being on-call.</p>
<p>I was on-call myself in the past and eventually i hated it. I remember getting automated calls at 3.00am for at least a whole week because our services were down. I had to wake up, open my laptop, trace the logs, figure out whether it’s a software or hardware (infrastructure) issue, restart services or docker agent in the affected hosts or our load balancer, check that everything is back to normal, close the laptop and get back to sleep after spending 15 minutes - 2 hours solving this. Sometimes i even had to patch the code and push some temporary container until the morning where the rest of the team would properly solve it. Sometimes i couldn’t even sleep again due to the stress. The issues were persisting for weeks and months, alerts were coming all day in any means of communication (emails, slack alerts, calls, sms, etc.). We were just a handful of people in the company, the money was good and i had to do it. Eventually the company grew a little bit more, we hired some good devops engineers, most services stabilized a little bit and i was removed from on-call. But i hated it. Everybody hated it. It was a big discussion topic back then and i complained a lot.</p>
<p>Fast forward some years later and almost 6 months ago, my wife was 40+ weeks pregnant. How pregnancy works in Greece is that each woman is assigned a midwife by her doctor who will take care of her until birth. This means anything from simple things like answering questions, doing seminars and 1-1 sessions to actually delivering the baby together with her doctor and supporting her the next few weeks. This is a real 24/7 on-call for weeks until birth. I remember my wife calling her in the middle of the night while being completely stressed and asking her simple questions. She was always calm, trying to understand the situation and provide clear feedback and responses. In a case when something really worrying happened, she suggested to meet at the hospital at 4.00am in order to be sure that everything is fine with the baby. That night she woke up, left her house 1 hour away from the hospital to meet my wife and do the medical examination together. I cannot count how many times we called her the last 2 days before birth. But i remember her calm tone and caring responses on every single phone call. I remember how she guided us through the birth night when the contractions started, her being awake all night answering our calls, text messages until we decided to go to the hospital at 4.00am where she also joined us, and then staying almost half a day with my wife to successfully deliver the baby. I will remember her forever and i also know how little she’s getting paid for this, while she takes care of multiple pregrant women in parallel. A continuous on-call.</p>
<p>Retrospectively, i go back to my on-call whining and realize how privileged i am to have complained for this. Yes i hated it, i lost some sleep, my wife lost some as well, i also burnt out, but i was clicking some buttons, checking logs while being half-asleep, doing some browsing while waiting deploys and restarts and getting back to sleep. I don’t even remember talking to anyone at that time other than providing some report the next day. This amazing lady had to answer the phone, calm down a stressed woman, accompany her to the hospital a few times in the middle of the night, be there for her and eventually make her feel safe. People will say that it’s her job, and it’s true, but same is ours. Perhaps our company’s on-call processes are terrible, the expectations unreal, the services unstable and nobody cares, but still, we just click some buttons, and we are getting paid a lot for it compared to others. The only thing i regret is not asking to be excluded as soon as i burnt out. Our health obviously matters. But we don’t deal with other people, their emotions and their anxiety at those alerts. We just monitor some logs and restart some services, we don’t have people relying on us, our answers and every single word coming out of our mouth.</p>
<p>Let’s fix tech on-call as much as we can but also let’s not make a big deal about it.</p>George PetropoulosOn-call is a huge issue in every tech company and a large discussion topic in tech communities and fora. Most people hate it, others burn out or switch positions, jobs and careers just to avoid being on-call.My experience with exporting Postgres RDS partitioned tables to s32022-05-15T00:00:00+00:002022-05-15T00:00:00+00:00https://geopet85.github.io/2022/05/15/rds-partitioned-tables-export<p>One of my recent tasks was refining a script which exports our RDS databases’ snapshots into s3. The concept was straightforward; we would utilize the daily system snapshots of our RDS instances, and export them to one of our s3 buckets as parquet files using boto3 python library. The IAM roles and s3 buckets were setup following the <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ExportSnapshot.html">AWS documentation</a>, including a KMS key to encrypt our exports, and the <a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds.html#RDS.Client.start_export_task">boto3 function</a> for starting the export was also straightforward. The API has an optional list field <code class="language-plaintext highlighter-rouge">ExportOnly</code> where we could provide the list of databases or schemas or tables to include in the export. In the first iterations we decided to leave it empty and see what is being exported and experiment accordingly.</p>
<h2 id="testing-in-staging">Testing in staging</h2>
<p>We have a few databases hosting different data for the application’s processing needs and we also have 2 environments (staging and production). We started with exporting our biggest staging RDS instance which does not include any partitioned tables. Everything went fine, export required a couple minutes (we also added a few more functions in our script to <a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/rds.html#RDS.Client.describe_export_tasks">monitor the export status</a>). Also data looked good, tables were exported in different folders including their schema:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.../database_name/schema_name.first_table/first.parquet
.../database_name/schema_name.first_table/second.parquet
...
.../database_name/schema_name.second_table/first.parquet
.../database_name/schema_name.second_table/second.parquet
...
</code></pre></div></div>
<p>We then proceeded testing the export in another RDS instance where almost all tables are partitioned based on a tenant identifier. Hence these tables have naming scheme as such <code class="language-plaintext highlighter-rouge">table_name, table_name_1, table_name_2, ..., table_name_10000, table_name_default</code>. The staging instance had around 14k table partitions deriving from 7 parent tables, but not so much data in total. Export was equally quick with the previous one, structure was a little bit weird but expected:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.../database_name/schema_name.first_table_1/first.parquet
.../database_name/schema_name.first_table_1/second.parquet
...
.../database_name/schema_name.first_table_2/first.parquet
.../database_name/schema_name.first_table_2/second.parquet
...
.../database_name/schema_name.first_table_2000/first.parquet
.../database_name/schema_name.first_table_2000/second.parquet
...
.../database_name/schema_name.first_table/first.parquet
.../database_name/schema_name.first_table/second.parquet
...
</code></pre></div></div>
<p>This means that both the parent and partitioned tables were exported, with potentially duplicate data. A quick check using a Glue crawler and an Athena query verified the duplication. <strong>For some reason AWS exports all table data into the parent table folder, as well as each partitioned table’s data into its own folder.</strong> We couldn’t find any documentation explaining why, we just accepted the way it works and tried to adapt. We added the list of parent tables in the <code class="language-plaintext highlighter-rouge">ExportOnly</code> argument of the <code class="language-plaintext highlighter-rouge">start_export_task</code> function. Next export iteration with this setting was also quick and now the folder structure was as if the database didn’t have any partitioned tables. Next step was testing with our production instances.</p>
<h2 id="testing-in-production">Testing in production</h2>
<p>Testing in production required a different IAM role, s3 bucket and KMS key. The first test using the RDS instance without partitioned tables was a total success, as it required around 30 minutes from start to end. More than 1TB of data were exported to s3 and tables’ folders contained from 1 to a few thousand parquet files. Implementation-wise this was a big relief as we definitely needed this database export to be done on a daily basis for our internal data analytics needs. We would add a few tasks to monitor snapshots, perform exports, monitor exports, and eventually load data into our data warehouse (Snowflake) using AWS Step Functions. Its pricing is another issue and i will tackle it in another post in the future and we are currently trying to bypass the export solution using <a href="https://www.postgresql.org/docs/current/logical-replication.html">Postgres’ WAL-based</a> change data capture (CDC).</p>
<p>When we tried exporting the highly-partitioned production database we were surprised by the results. The database had 100k tables (5 parent tables with 20k partitions each and 2 unpartitioned tables) with around 1TB of data and export took almost 1 day. This was not a viable approach as we wanted to setup some ETL workflows for each of the databases and having it running for 1+ day wouldn’t make any sense. Without looking at the data we were expecting the export to be completed in a similar timeframe as the non-partitioned one, since we were just exporting 7 tables. When we actually checked the folder structure, we were surprised to find out that each folder had an additional structure like below:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.../database_name/schema_name.first_table/1/first.parquet
.../database_name/schema_name.first_table/1/second.parquet
...
.../database_name/schema_name.first_table/2/first.parquet
.../database_name/schema_name.first_table/2/second.parquet
...
.../database_name/schema_name.first_table/20000/first.parquet
.../database_name/schema_name.first_table/20000/second.parquet
...
</code></pre></div></div>
<p>Notice the folders with the partition numbers. Hence the export task created around 100k folders with a few thousand small parquet files each (a couple kb each), which means a few hunded millions of parquet files. We assumed that the overhead was on s3 also considering <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html">s3 limitations</a> and that’s why the export was so slow. It wasn’t clear why these partitioned-like folders were generated under the main folders which created the issue in the first place (remember that in staging we didn’t see similar structure). Unfortunately there isn’t any AWS documentation on the topic. In any case, we abandoned any further experimentation with exporting any of our highly-partitioned databases and we aim to implement it using CDC.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Exporting RDS snapshots to s3 is already an expensive operation, and its application to a highly-partitioned database is completely inefficient especially for large databases. For another database we got an export time of almost 2 weeks which was almost useless. I wish AWS provided some proper documentation on how the whole operation works and how it could be optimized, as well as tackle the case of partitioned tables.</p>George PetropoulosOne of my recent tasks was refining a script which exports our RDS databases’ snapshots into s3. The concept was straightforward; we would utilize the daily system snapshots of our RDS instances, and export them to one of our s3 buckets as parquet files using boto3 python library. The IAM roles and s3 buckets were setup following the AWS documentation, including a KMS key to encrypt our exports, and the boto3 function for starting the export was also straightforward. The API has an optional list field ExportOnly where we could provide the list of databases or schemas or tables to include in the export. In the first iterations we decided to leave it empty and see what is being exported and experiment accordingly.Friday and 13th, my first post2022-05-13T00:00:00+00:002022-05-13T00:00:00+00:00https://geopet85.github.io/2022/05/13/my-first-post<p>It’s a superstitious day for some, it’s just another day for others including myself. In any case it’s my first post just to test that this works.</p>
<p>I will try to document my experiences from life, data engineering and engineering in general, while also maintaining my cv into these pages.</p>George PetropoulosIt’s a superstitious day for some, it’s just another day for others including myself. In any case it’s my first post just to test that this works.