Skip to content

Commit

Permalink
Create a documentation package for Docker image (apache#14846)
Browse files Browse the repository at this point in the history
  • Loading branch information
mik-laj authored Mar 21, 2021
1 parent ed872a6 commit a18cbc4
Show file tree
Hide file tree
Showing 16 changed files with 783 additions and 875 deletions.
2 changes: 1 addition & 1 deletion docs/apache-airflow/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ installation with other tools as well.

.. note::

Airflow is also distributed as a Docker image (OCI Image). For more information, see: :ref:`docker_image`
Airflow is also distributed as a Docker image (OCI Image). Consider using it to guarantee that software will always run the same no matter where it is deployed. For more information, see: :doc:`docker-stack:index`.

Prerequisites
'''''''''''''
Expand Down
847 changes: 1 addition & 846 deletions docs/apache-airflow/production-deployment.rst

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/apache-airflow/start/docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ To stop and delete containers, delete volumes with database data and download im
Notes
=====

By default, the Docker Compose file uses the latest Airflow image (`apache/airflow <https://hub.docker.com/r/apache/airflow>`__). If you need, you can :ref:`customize and extend it <docker_image>`.
By default, the Docker Compose file uses the latest Airflow image (`apache/airflow <https://hub.docker.com/r/apache/airflow>`__). If you need, you can :doc:`customize and extend it <docker-stack:index>`.

What's Next?
============
Expand Down
12 changes: 8 additions & 4 deletions docs/build_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,19 +205,23 @@ def main():
_promote_new_flags()

with with_group("Available packages"):
for pkg in available_packages:
for pkg in sorted(available_packages):
print(f" - {pkg}")

if package_filters:
print("Current package filters: ", package_filters)
current_packages = process_package_filters(available_packages, package_filters)

with with_group("Fetching inventories"):
# Inventories that could not be retrieved should be retrieved first. This may mean this is a
# new package.
priority_packages = fetch_inventories()
current_packages = sorted(current_packages, key=lambda d: -1 if d in priority_packages else 1)

with with_group(f"Documentation will be built for {len(current_packages)} package(s)"):
for pkg_no, pkg in enumerate(current_packages, start=1):
print(f"{pkg_no}. {pkg}")

with with_group("Fetching inventories"):
fetch_inventories()

all_build_errors: Dict[Optional[str], List[DocBuildError]] = {}
all_spelling_errors: Dict[Optional[str], List[SpellingError]] = {}
package_build_errors, package_spelling_errors = build_docs_for_packages(
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@
'providers_packages_ref',
]
)
elif PACKAGE_NAME == "helm-chart":
elif PACKAGE_NAME in ("helm-chart", "docker-stack"):
# No extra extensions
pass
else:
Expand Down
380 changes: 380 additions & 0 deletions docs/docker-stack/build.rst

Large diffs are not rendered by default.

201 changes: 201 additions & 0 deletions docs/docker-stack/entrypoint.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Entrypoint
==========

If you are using the default entrypoint of the production image,
there are a few actions that are automatically performed when the container starts.
In some cases, you can pass environment variables to the image to trigger some of that behaviour.

The variables that control the "execution" behaviour start with ``_AIRFLOW`` to distinguish them
from the variables used to build the image starting with ``AIRFLOW``.

The image entrypoint works as follows:

* In case the user is not "airflow" (with undefined user id) and the group id of the user is set to ``0`` (root),
then the user is dynamically added to ``/etc/passwd`` at entry using ``USER_NAME`` variable to define the user name.
This is in order to accommodate the
`OpenShift Guidelines <https://docs.openshift.com/enterprise/3.0/creating_images/guidelines.html>`_

* The ``AIRFLOW_HOME`` is set by default to ``/opt/airflow/`` - this means that DAGs
are in default in the ``/opt/airflow/dags`` folder and logs are in the ``/opt/airflow/logs``

* The working directory is ``/opt/airflow`` by default.

* If ``AIRFLOW__CORE__SQL_ALCHEMY_CONN`` variable is passed to the container and it is either mysql or postgres
SQL alchemy connection, then the connection is checked and the script waits until the database is reachable.
If ``AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD`` variable is passed to the container, it is evaluated as a
command to execute and result of this evaluation is used as ``AIRFLOW__CORE__SQL_ALCHEMY_CONN``. The
``_CMD`` variable takes precedence over the ``AIRFLOW__CORE__SQL_ALCHEMY_CONN`` variable.

* If no ``AIRFLOW__CORE__SQL_ALCHEMY_CONN`` variable is set then SQLite database is created in
``${AIRFLOW_HOME}/airflow.db`` and db reset is executed.

* If first argument equals to "bash" - you are dropped to a bash shell or you can executes bash command
if you specify extra arguments. For example:

.. code-block:: bash
docker run -it apache/airflow:master-python3.6 bash -c "ls -la"
total 16
drwxr-xr-x 4 airflow root 4096 Jun 5 18:12 .
drwxr-xr-x 1 root root 4096 Jun 5 18:12 ..
drwxr-xr-x 2 airflow root 4096 Jun 5 18:12 dags
drwxr-xr-x 2 airflow root 4096 Jun 5 18:12 logs
* If first argument is equal to ``python`` - you are dropped in python shell or python commands are executed if
you pass extra parameters. For example:

.. code-block:: bash
> docker run -it apache/airflow:master-python3.6 python -c "print('test')"
test
* If first argument equals to "airflow" - the rest of the arguments is treated as an airflow command
to execute. Example:

.. code-block:: bash
docker run -it apache/airflow:master-python3.6 airflow webserver
* If there are any other arguments - they are simply passed to the "airflow" command

.. code-block:: bash
> docker run -it apache/airflow:master-python3.6 version
2.1.0.dev0
* If ``AIRFLOW__CELERY__BROKER_URL`` variable is passed and airflow command with
scheduler, worker of flower command is used, then the script checks the broker connection
and waits until the Celery broker database is reachable.
If ``AIRFLOW__CELERY__BROKER_URL_CMD`` variable is passed to the container, it is evaluated as a
command to execute and result of this evaluation is used as ``AIRFLOW__CELERY__BROKER_URL``. The
``_CMD`` variable takes precedence over the ``AIRFLOW__CELERY__BROKER_URL`` variable.

Creating system user
--------------------

Airflow image is Open-Shift compatible, which means that you can start it with random user ID and group id 0.
Airflow will automatically create such a user and make it's home directory point to ``/home/airflow``.
You can read more about it in the "Support arbitrary user ids" chapter in the
`Openshift best practices <https://docs.openshift.com/container-platform/4.1/openshift_images/create-images.html#images-create-guide-openshift_create-images>`_.

Waits for Airflow DB connection
-------------------------------

In case Postgres or MySQL DB is used, the entrypoint will wait until the airflow DB connection becomes
available. This happens always when you use the default entrypoint.

The script detects backend type depending on the URL schema and assigns default port numbers if not specified
in the URL. Then it loops until the connection to the host/port specified can be established
It tries ``CONNECTION_CHECK_MAX_COUNT`` times and sleeps ``CONNECTION_CHECK_SLEEP_TIME`` between checks
To disable check, set ``CONNECTION_CHECK_MAX_COUNT=0``.

Supported schemes:

* ``postgres://`` - default port 5432
* ``mysql://`` - default port 3306
* ``sqlite://``

In case of SQLite backend, there is no connection to establish and waiting is skipped.

Upgrading Airflow DB
--------------------

If you set ``_AIRFLOW_DB_UPGRADE`` variable to a non-empty value, the entrypoint will run
the ``airflow db upgrade`` command right after verifying the connection. You can also use this
when you are running airflow with internal SQLite database (default) to upgrade the db and create
admin users at entrypoint, so that you can start the webserver immediately. Note - using SQLite is
intended only for testing purpose, never use SQLite in production as it has severe limitations when it
comes to concurrency.

Creating admin user
-------------------

The entrypoint can also create webserver user automatically when you enter it. you need to set
``_AIRFLOW_WWW_USER_CREATE`` to a non-empty value in order to do that. This is not intended for
production, it is only useful if you would like to run a quick test with the production image.
You need to pass at least password to create such user via ``_AIRFLOW_WWW_USER_PASSWORD_CMD`` or
``_AIRFLOW_WWW_USER_PASSWORD_CMD`` similarly like for other ``*_CMD`` variables, the content of
the ``*_CMD`` will be evaluated as shell command and it's output will be set as password.

User creation will fail if none of the ``PASSWORD`` variables are set - there is no default for
password for security reasons.

+-----------+--------------------------+----------------------------------------------------------------------+
| Parameter | Default | Environment variable |
+===========+==========================+======================================================================+
| username | admin | ``_AIRFLOW_WWW_USER_USERNAME`` |
+-----------+--------------------------+----------------------------------------------------------------------+
| password | | ``_AIRFLOW_WWW_USER_PASSWORD_CMD`` or ``_AIRFLOW_WWW_USER_PASSWORD`` |
+-----------+--------------------------+----------------------------------------------------------------------+
| firstname | Airflow | ``_AIRFLOW_WWW_USER_FIRSTNAME`` |
+-----------+--------------------------+----------------------------------------------------------------------+
| lastname | Admin | ``_AIRFLOW_WWW_USER_LASTNAME`` |
+-----------+--------------------------+----------------------------------------------------------------------+
| email | [email protected] | ``_AIRFLOW_WWW_USER_EMAIL`` |
+-----------+--------------------------+----------------------------------------------------------------------+
| role | Admin | ``_AIRFLOW_WWW_USER_ROLE`` |
+-----------+--------------------------+----------------------------------------------------------------------+

In case the password is specified, the user will be attempted to be created, but the entrypoint will
not fail if the attempt fails (this accounts for the case that the user is already created).

You can, for example start the webserver in the production image with initializing the internal SQLite
database and creating an ``admin/admin`` Admin user with the following command:

.. code-block:: bash
docker run -it -p 8080:8080 \
--env "_AIRFLOW_DB_UPGRADE=true" \
--env "_AIRFLOW_WWW_USER_CREATE=true" \
--env "_AIRFLOW_WWW_USER_PASSWORD=admin" \
apache/airflow:master-python3.8 webserver
.. code-block:: bash
docker run -it -p 8080:8080 \
--env "_AIRFLOW_DB_UPGRADE=true" \
--env "_AIRFLOW_WWW_USER_CREATE=true" \
--env "_AIRFLOW_WWW_USER_PASSWORD_CMD=echo admin" \
apache/airflow:master-python3.8 webserver
The commands above perform initialization of the SQLite database, create admin user with admin password
and Admin role. They also forward local port ``8080`` to the webserver port and finally start the webserver.

Waits for celery broker connection
----------------------------------

In case Postgres or MySQL DB is used, and one of the ``scheduler``, ``celery``, ``worker``, or ``flower``
commands are used the entrypoint will wait until the celery broker DB connection is available.

The script detects backend type depending on the URL schema and assigns default port numbers if not specified
in the URL. Then it loops until connection to the host/port specified can be established
It tries ``CONNECTION_CHECK_MAX_COUNT`` times and sleeps ``CONNECTION_CHECK_SLEEP_TIME`` between checks.
To disable check, set ``CONNECTION_CHECK_MAX_COUNT=0``.

Supported schemes:

* ``amqp(s)://`` (rabbitmq) - default port 5672
* ``redis://`` - default port 6379
* ``postgres://`` - default port 5432
* ``mysql://`` - default port 3306
* ``sqlite://``

In case of SQLite backend, there is no connection to establish and waiting is skipped.
Binary file added docs/docker-stack/img/docker-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 54 additions & 0 deletions docs/docker-stack/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
.. image:: /img/docker-logo.png
:width: 100

Docker Image for Apache Airflow
===============================

.. toctree::
:hidden:

Home <self>
build
entrypoint
recipes

.. toctree::
:hidden:
:caption: References

build-arg-ref

For the ease of deployment in production, the community releases a production-ready reference container
image.

The docker image provided (as convenience binary package) in the
`apache/airflow DockerHub <https://hub.docker.com/r/apache/airflow>`_ is a bare image
that has a few external dependencies and extras installed..

The Apache Airflow image provided as convenience package is optimized for size, so
it provides just a bare minimal set of the extras and dependencies installed and in most cases
you want to either extend or customize the image. You can see all possible extras in
:doc:`extra-packages-ref`. The set of extras used in Airflow Production image are available in the
`Dockerfile <https://github.com/apache/airflow/blob/2c6c7fdb2308de98e142618836bdf414df9768c8/Dockerfile#L39>`_.

The production images are build in DockerHub from released version and release candidates. There
are also images published from branches but they are used mainly for development and testing purpose.
See `Airflow Git Branching <https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#airflow-git-branches>`_
for details.
70 changes: 70 additions & 0 deletions docs/docker-stack/recipes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Recipes
=======

Users sometimes share interesting ways of using the Docker images. We encourage users to contribute these
recipes to the documentation in case they prove useful to other members of the community by
submitting a pull request. The sections below capture this knowledge.

Google Cloud SDK installation
-----------------------------

Some operators, such as :class:`~airflow.providers.google.cloud.operators.kubernetes_engine.GKEStartPodOperator`,
:class:`~airflow.providers.google.cloud.operators.dataflow.DataflowStartSqlJobOperator`, require
the installation of `Google Cloud SDK <https://cloud.google.com/sdk>`__ (includes ``gcloud``).
You can also run these commands with BashOperator.

Create a new Dockerfile like the one shown below.

.. exampleinclude:: /docker-images-recipes/gcloud.Dockerfile
:language: dockerfile

Then build a new image.

.. code-block:: bash
docker build . \
--build-arg BASE_AIRFLOW_IMAGE="apache/airflow:2.0.1" \
-t my-airflow-image
Apache Hadoop Stack installation
--------------------------------

Airflow is often used to run tasks on Hadoop cluster. It required Java Runtime Environment (JRE) to run.
Below are the steps to take tools that are frequently used in Hadoop-world:

- Java Runtime Environment (JRE)
- Apache Hadoop
- Apache Hive
- `Cloud Storage connector for Apache Hadoop <https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage>`__


Create a new Dockerfile like the one shown below.

.. exampleinclude:: /docker-images-recipes/hadoop.Dockerfile
:language: dockerfile

Then build a new image.

.. code-block:: bash
docker build . \
--build-arg BASE_AIRFLOW_IMAGE="apache/airflow:2.0.1" \
-t my-airflow-image
Loading

0 comments on commit a18cbc4

Please sign in to comment.