Additionally, we can use Google Cloud Platform’s IAM (identify and access management) capabilities to control which users are able to upload files to a given environment. This also allows us to conditionally sync only a subset of the DAGs from a given bucket, or even sync DAGs from multiple buckets into a single file system based on the environment’s configuration (more on this later).Īltogether this provides us with fast file access as a stable, external source of truth, while maintaining our ability to quickly add or modify DAG files within Airflow. This script runs in a separate pod within the same cluster. We wrote a custom script which synchronizes the state of this volume with GCS, so that users only have to interact with GCS for uploading or managing DAGs. We then mounted this NFS server as a read-write-many volume into the worker and scheduler pods. The volume of reads was especially high because every pod in the environment had to mount the bucket separately.Īfter some experimentation we found that we could vastly improve performance across our Airflow environments by running an NFS (network file system) server within the Kubernetes cluster. However, at scale this proved to be a bottleneck on performance as every file read incurred a request to GCS. Our initial deployment of Airflow utilized GCSFuse to maintain a consistent set of files across all workers and schedulers in a single Airflow environment. This means the contents of the DAG directory must be consistent across all schedulers and workers in a single environment ( Airflow suggests a few ways of achieving this).Īt Shopify, we use Google Cloud Storage (GCS) for the storage of DAGs. These files must be scanned often in order to maintain consistency between the on-disk source of truth for each workload and its in-database representation. A well defined strategy for file access ensures that the scheduler can process DAG files quickly and keep your jobs up-to-date.Īirflow keeps its internal representation of its workflows up-to-date by repeatedly scanning and reparsing all the files in the configured DAG directory. File Access Can Be Slow When Using Cloud Storageįast file access is critical to the performance and integrity of an Airflow environment. As a result of this rapid growth, we have encountered a few challenges, including slow file access, insufficient control over DAG (directed acyclic graph) capabilities, irregular levels of traffic, and resource contention between workloads, to name a few.īelow we’ll share some of the lessons we learned and solutions we built in order to run Airflow at scale. As adoption increases within Shopify, the load incurred on our Airflow deployments will only increase. This environment averages over 400 tasks running at a given moment and over 150,000 runs executed per day. In our largest environment, we run over 10,000 DAGs representing a large variety of workloads. Shopify’s usage of Airflow has scaled dramatically over the past two years. At the time of writing, we are currently running Airflow 2.2 on Kubernetes, using the Celery executor and MySQL 8. At Shopify, we’ve been running Airflow in production for over two years for a variety of workflows, including data extractions, machine learning model training, Apache Iceberg table maintenance, and DBT-powered data modeling. Apache Airflow is an orchestration platform that enables development, scheduling and monitoring of workflows.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |