Update Airflow AWS MWAA deployment docs (#3860)

Signed-off-by: Dmitry Sorokin <[email protected]> Signed-off-by: Dmitry Sorokin <[email protected]> Co-authored-by: Ankita Katiyar <[email protected]> Co-authored-by: Merel Theisen <[email protected]>
kedro-org · May 20, 2024 · 56961af · 56961af
1 parent 7391f4f
commit 56961af
Show file tree

Hide file tree

Showing 2 changed files with 88 additions and 11 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -9,6 +9,7 @@
 
 ## Documentation changes
 * Improved documentation for custom starters
+* Added a new section on deploying Kedro project on AWS Airflow MWAA
 
 ## Community contributions
 Many thanks to the following Kedroids for contributing PRs to this release:

diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md
@@ -2,17 +2,23 @@
 
 Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro, because workflows in Airflow are modelled and organised as [DAGs](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
 
-## How to run a Kedro pipeline on Apache Airflow with Astronomer
+## Introduction and strategy
 
-The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud.
+The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md).
 
-[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible.
+Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes.
 
-### Strategy
+This guide provides instructions on running a Kedro pipeline on different Airflow platforms. You can jump to the specific sections by clicking the links below, how to run a Kedro pipeline on:
 
-The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an [Airflow task](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html) while the whole pipeline is converted to an [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html). This approach mirrors the principles of [running Kedro in a distributed environment](distributed.md).
+- [Apache Airflow with Astronomer](#how-to-run-a-kedro-pipeline-on-apache-airflow-with-astronomer)
+- [Amazon AWS Managed Workflows for Apache Airflow (MWAA)](#how-to-run-a-kedro-pipeline-on-amazon-aws-managed-workflows-for-apache-airflow-mwaa)
+- [Apache Airflow using a Kubernetes cluster](#how-to-run-a-kedro-pipeline-on-apache-airflow-using-a-kubernetes-cluster)
 
-Each node will be executed within a new Kedro session, which implies that `MemoryDataset`s cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the [`DataCatalog`](https://docs.kedro.org/en/stable/data/index.html) and stored in persistent storage. This approach enables nodes to access the results from preceding nodes.
+## How to run a Kedro pipeline on Apache Airflow with Astronomer
+
+The following tutorial shows how to deploy an example [Spaceflights Kedro project](https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html) on [Apache Airflow](https://airflow.apache.org/) with [Astro CLI](https://docs.astronomer.io/astro/cli/overview), a command-line tool created by [Astronomer](https://www.astronomer.io/) that streamlines the creation of local Airflow projects. You will deploy it locally first, and then transition to Astro Cloud.
+
+[Astronomer](https://docs.astronomer.io/astro/install-cli) is a managed Airflow platform which allows users to spin up and run an Airflow cluster in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible.
 
 ### Prerequisites
 
@@ -44,7 +50,7 @@ In this section, you will create a new Kedro project equipped with an example pi
 3. Open `conf/airflow/catalog.yml` to see the list of datasets used in the project. Note that additional intermediate datasets (`X_train`, `X_test`, `y_train`, `y_test`) are stored only in memory. You can locate these in the pipeline description under `/src/new_kedro_project/pipelines/data_science/pipeline.py`. To ensure these datasets are preserved and accessible across different tasks in Airflow, we need to include them in our `DataCatalog`. Instead of repeating similar code for each dataset, you can use [Dataset Factories](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), a special syntax that allows defining a catch-all pattern to overwrite the default `MemoryDataset` creation. Add this code to the end of the file:
 
 ```yaml
-{base_dataset}:
+"{base_dataset}":
   type: pandas.CSVDataset
   filepath: data/02_intermediate/{base_dataset}.csv
 ```
@@ -74,7 +80,7 @@ This step should produce a wheel file called `new_kedro_project-0.1-py3-none-any
 kedro airflow create --target-dir=dags/ --env=airflow
 ```
 
-This step should produce a `.py` file called `new_kedro_project_dag.py` located at `dags/`.
+This step should produce a `.py` file called `new_kedro_project_airflow_dag.py` located at `dags/`.
 
 ### Deployment process with Astro CLI
 
@@ -102,15 +108,15 @@ In this section, you will start by setting up a new blank Airflow project using
     cp -r new-kedro-project/conf kedro-airflow-spaceflights/conf
     mkdir -p kedro-airflow-spaceflights/dist/
     cp new-kedro-project/dist/new_kedro_project-0.1-py3-none-any.whl kedro-airflow-spaceflights/dist/
-    cp new-kedro-project/dags/new_kedro_project_dag.py kedro-airflow-spaceflights/dags/
+    cp new-kedro-project/dags/new_kedro_project_airflow_dag.py kedro-airflow-spaceflights/dags/
     ```
 
 Feel free to completely copy `new-kedro-project` into `kedro-airflow-spaceflights` if your project requires frequent updates, DAG recreation, and repackaging. This approach allows you to work with kedro and astro projects in a single folder, eliminating the need to copy kedro files for each development iteration. However, be aware that both projects will share common files such as `requirements.txt`, `README.md`, and `.gitignore`.
 
-4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro and to install the .whl file of our prepared Kedro project into the Airflow container:
+4. Add a few lines to the `Dockerfile` located in the `kedro-airflow-spaceflights` folder to set the environment variable `KEDRO_LOGGING_CONFIG` to point to `conf/logging.yml` to enable custom logging in Kedro (note that from Kedro 0.19.6 onwards, this step is unnecessary because Kedro uses the `conf/logging.yml` file by default) and to install the .whl file of our prepared Kedro project into the Airflow container:
 
 ```Dockerfile
-ENV KEDRO_LOGGING_CONFIG="conf/logging.yml"
+ENV KEDRO_LOGGING_CONFIG="conf/logging.yml" # This line is not needed from Kedro 0.19.6
 
 RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl
 ```
@@ -166,6 +172,76 @@ astro deploy
 
 ![](../meta/images/astronomer_cloud_deployment.png)
 
+## How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA)
+
+### Kedro project preparation
+MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA is similar to Astronomer, but there are some key differences: you need to store your project data in an AWS S3 bucket and make necessary changes to your `DataCatalog`. Additionally, you must configure how you upload your Kedro configuration, install your Kedro package, and set up the necessary environment variables.
+1. Complete steps 1-4 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section.
+2. Your project's data should not reside in the working directory of the Airflow container. Instead, [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [upload your data folder from the new-kedro-project folder to your S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html).
+3. Modify the `DataCatalog` to reference data in your S3 bucket by updating the filepath and add credentials line for each dataset in `new-kedro-project/conf/airflow/catalog.yml`. Add the S3 prefix to the filepath as shown below:
+```shell
+companies:
+  type: pandas.CSVDataset
+  filepath: s3://<your_S3_bucket>/data/01_raw/companies.csv
+  credentials: dev_s3
+```
+4. [Set up AWS credentials](https://docs.aws.amazon.com/keyspaces/latest/devguide/access.credentials.html) to provide read and write access to your S3 bucket. Update `new-kedro-project/conf/local/credentials.yml` with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and copy it to the `new-kedro-project/conf/airflow/` folder:
+```shell
+dev_s3:
+  client_kwargs:
+    aws_access_key_id: *********************
+    aws_secret_access_key: ******************************************
+```
+5. Add `s3fs` to your project’s `requirements.txt` in `new-kedro-project` to facilitate communication with AWS S3. Some libraries could cause dependency conflicts in the Airflow environment, so make sure to minimise the requirements list and avoid using `kedro-viz` and `pytest`.
+```shell
+s3fs
+```
+
+6. Follow steps 5-6 from the [Create, prepare and package example Kedro project](#create-prepare-and-package-example-kedro-project) section to package your Kedro project and generate an Airflow DAG.
+7. Update the DAG file `new_kedro_project_airflow_dag.py` located in the `dags/` folder by adding `conf_source="plugins/conf-new_kedro_project.tar.gz"` to the arguments of `KedroSession.create()` in the Kedro operator execution function. This change is necessary because your Kedro configuration archive will be stored in the `plugins/` folder, not the root directory:
+```shell
+    def execute(self, context):
+        configure_project(self.package_name)
+        with KedroSession.create(project_path=self.project_path,
+                                 env=self.env, conf_source="plugins/conf-new_kedro_project.tar.gz") as session:
+            session.run(self.pipeline_name, node_names=[self.node_name])
+```
+
+### Deployment on AWAA
+1. Archive your three files: `new_kedro_project-0.1-py3-none-any.whl` and `conf-new_kedro_project.tar.gz` located in `new-kedro-project/dist`, and `logging.yml` located in `new-kedro-project/conf/` into a file called `plugins.zip` and upload it to `s3://your_S3_bucket`.
+```shell
+zip -j plugins.zip dist/new_kedro_project-0.1-py3-none-any.whl dist/conf-new_kedro_project.tar.gz conf/logging.yml
+```
+This archive will be later unpacked to the `/plugins` folder in the working directory of the Airflow container.
+
+2. Create a new `requirements.txt` file, add the path where your Kedro project will be unpacked in the Airflow container, and upload `requirements.txt` to `s3://your_S3_bucket`:
+```shell
+./plugins/new_kedro_project-0.1-py3-none-any.whl
+```
+Libraries from `requirements.txt` will be installed during container initialisation.
+
+3. Upload `new_kedro_project_airflow_dag.py` from the `new-kedro-project/dags` to `s3://your_S3_bucket/dags`.
+4. Create an empty `startup.sh` file for container startup commands. Set an environment variable for custom Kedro logging:
+```shell
+export KEDRO_LOGGING_CONFIG="plugins/logging.yml"
+```
+5. Set up a new [AWS MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html) using the following settings:
+```shell
+S3 Bucket:
+  s3://your_S3_bucket
+DAGs folder
+  s3://your_S3_bucket/dags
+Plugins file - optional
+  s3://your_S3_bucket/plugins.zip
+Requirements file - optional
+  s3://your_S3_bucket/requrements.txt
+Startup script file - optional
+  s3://your_S3_bucket/startup.sh
+```
+On the next page, set the `Public network (Internet accessible)` option in the `Web server access` section if you want to access your Airflow UI from the internet. Continue with the default options on the subsequent pages.
+
+6. Once the environment is created, use the `Open Airflow UI` button to access the standard Airflow interface, where you can manage your DAG.
+
 ## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster
 
 The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only.