Skip to content

Commit

Permalink
Add changes in response to reviewer #2: clarify the tradeoff between …
Browse files Browse the repository at this point in the history
…execution time and development time. Remove the word parity and be more explicit in comparisons of execution time.
  • Loading branch information
Adam Richie-Halford committed Jun 20, 2018
1 parent 845e7c1 commit 640bef3
Show file tree
Hide file tree
Showing 6 changed files with 127 additions and 75 deletions.
140 changes: 85 additions & 55 deletions papers/adam_richie-halford/adam_richie-halford.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,13 @@ resources.
Here, we introduce a new Python library: Cloudknot
:cite:`cloudknot-docs` :cite:`cloudknot-repo`, that launches Python
functions as jobs on the AWS Batch service, thereby lifting these
limitations. Cloudknot supports Python versions 2.7 and 3.5+.
limitations. Cloudknot supports Python versions 2.7 and 3.5+. Rather
than introducing its own set of terms and abstractions, Cloudknot
attempts to mimic Pythons concurrent futures’ :code:`Executor` objects.
Users of Cloudknot have to familiarize themselves with one new object:
the :code:`Knot`, while some of its functionality will initially be new
to users of Cloudknot (e.g., the way that resources on AWS are managed),
its :code:`map` method should be familiar to most Python users.

We propose that software designed to aid computational and data
scientists should be concerned with minimizing the time from
Expand Down Expand Up @@ -183,8 +189,11 @@ total, the resulting command line program downloads input data from S3,
executes the UDF, and sends output back to S3. :code:`Knot` then packages
the CLI, along with its dependencies, into a Docker container. The
container is uploaded into the Amazon Elastic Container Registry (ECR).
Cloudknot's use of Docker allows it to handle non-trivial software and
data dependencies (see examples below).
Cloudknot's use of Docker allows it to handle non-trivial software
and data dependencies (see examples below). This is because Docker
provides a consistent and isolated environment, allowing complete
control over the software dependencies of a particular application, and
near-immediate deployment of these dependencies :cite:`Boettiger14`.

Separately, :code:`Knot` uses an AWS CloudFormation template to create
the AWS resources required by AWS Batch:
Expand Down Expand Up @@ -217,13 +226,21 @@ the AWS resources required by AWS Batch:
minimum, desired, and maximum number of virtual CPUs and let AWS Batch
select and manage the EC2 instances.

:code:`Knot` uses sensible defaults for the job definition and compute
environment parameters so that the casual user may never need to concern
themselves with selecting an instance type or specifying an AMI. More advanced
users can control their jobs' memory requirements, instance types, or AMIs.
This might be necessary if the jobs require special hardware (e.g. GPGPU
computing) or if the user wants more fine-grained control over which resources
are launched.
:code:`Knot` uses job definition and compute environment defaults
conservative enough to run most simple jobs, with the goal of minimizing
errors due to insufficient resources. The casual user may never need to
concern themselves with selecting an instance type or specifying an AMI.
Users who want to minimize costs by specifying the minimum sufficient
resources or users who need additional resources for intensive jobs can
control their jobs' memory requirements, instance types, or AMIs. This
might be necessary if the jobs require special hardware (e.g. GPGPU
computing) or if the user wants more fine-grained control over which
resources are launched.

One of the most complex aspects of AWS is its permissions model. Here,
we assume that the user has the permissions needed to run AWS Batch
in the console. We also provide users with the minimal necessary
permissions in the documentation.

Finally, :code:`Knot` exposes AWS resource tags to the user so that
they can assign metadata to each created resource. This facilitates
Expand Down Expand Up @@ -369,11 +386,13 @@ summary of the status of all jobs submitted with this :code:`Knot` using
Examples
--------

In this section, we will present a few use-cases of Cloudknot, including real
life uses of the software in data analysis. We will start with examples that
have minimal software and data dependencies, and increase the complexity by
adding first data dependencies and subsequently complex software and resource
dependencies.
In this section, we will present a few use-cases of Cloudknot, including
real life uses of the software in data analysis. We will start with
examples that have minimal software and data dependencies, and increase
the complexity by adding first data dependencies and subsequently
complex software and resource dependencies. These and other examples
are available in Jupyter Notebooks in the Cloudknot repository
:cite:`cloudknot-examples`.


Solving differential equations
Expand All @@ -397,30 +416,40 @@ In this unrealistic example, we wish to parallelize execution both over a range
of different boundary conditions and over a range of grid sizes.

First, we hold the grid size constant at 10 x 10 and parallelize over
different temperature constraints on one edge of the simulation grid. We
investigate the scaling of job execution time as a function of the size
of the argument array. In Figure :ref:`fig.nargsscaling` we show the
execution time as a function of the length of the argument array (with
a :math:`\log_2` scale on both axes). The default :code:`Knot` instance
has a maximum of 256 vCPUs in its compute environment and a desired
vCPUs setting of 8. We testing scaling using these default parameters
and also using a custom parameters with :code:`min_vcpus=512`,
:code:`desired_vcpus=2048`, and :code:`max_vcpus=4096`. These tests
were also limited by the EC2 service limits for our region and account,
which vary by instance type but never exceeded 200 instances. The user
interested in maximizing throughput could request limit increases.
Regardless of the :code:`Knot` parameters, Pywren outperformed Cloudknot
at all argument array sizes. Indeed, Pywren appears to achieve
:math:`\mathcal{O}(1)` scaling for much of the argument range, revealing
AWS Lambda's capabilities for massively parallel computation.
different temperature constraints on one edge of the simulation grid.
We investigate the scaling of job execution time as a function of the
size of the argument array. In Figure :ref:`fig.nargsscaling` we show
the execution time as a function of :math:`n_\mathrm{args}`, the length
of the argument array (with a :math:`\log_2` scale on the :math:`x`-axis
and a :math:`\log_{10}` scale on the :math:`y`-axis). We testing scaling
using Cloudknot's default parameters and also using custom parameters
[#]_. Regardless of the :code:`Knot` parameters, Pywren outperformed
Cloudknot at all argument array sizes. Indeed, Pywren appears to achieve
:math:`\mathcal{O}(1)` scaling between :math:`4 \le n_\mathrm{args}
\le 512`, revealing AWS Lambda's capabilities for massively parallel
computation. For :math:`n_\mathrm{args} > 512`, Pywren appears to
conform to :math:`\mathcal{O}(n)` scaling with a constant of roughly
0.25. By contract, Cloudknot exhibits noisy :math:`\mathcal{O}(n)`
scaling for :math:`n_\mathrm{args} \gtrapprox 32`, with a constant that
is comparable to Pywren's scaling constant for :math:`n_\mathrm{args}
> 512`. Precise determination of these scaling constants would require
more data for a larger range of argument sizes.

.. [#] Default settings are :code:`min_vcpus=0`,
:code:`desired_vcpus=8`, and :code:`max_vcpus=256`. Custom settings
are :code:`desired_vcpus=2048`, :code:`max_vcpus=4096`, and
:min_vcpus=512`. Both default and custom Cloudknot cases were also
limited by the EC2 service limits for our region and account, which
vary by instance type but never exceeded 200 instances.
.. figure:: figures/nargsscaling.png

Execution time to find solutions of the 2D heat equation for many
different temperature constraints on a 10x10 grid. We show scaling
as a function of the number of constraints for Pywren, the default
Cloudknot configuration, and a Cloudknot configuration with more
available vCPUs. Pywren outperforms Cloudknot in all cases. We posit
different temperature constraints on a 10 x 10 grid. We show
scaling as a function of the number of constraints for Pywren, the
default Cloudknot configuration, and a Cloudknot configuration
with more available vCPUs. Note the :math:`log_2` scale for the
:math:`x`-axis. Pywren outperforms Cloudknot in all cases. We posit
that the additional overhead associated with building the Docker
image, along with EC2 service limits limited Cloudknot's throughput.
:label:`fig.nargsscaling`
Expand Down Expand Up @@ -598,31 +627,32 @@ neuroimaging and microscopy. And we've included scaling analyses
that show that Cloudknot performs comparably to other distributed
computing frameworks. On one hand, scaling charts like the ones in
Figures :ref:`fig.nargsscaling`, :ref:`fig.syssizescaling`, and
:ref:`fig.mribenchmark` are important because they show that Cloudknot
does not introduce undue overhead burden and exploits the scaling
efficiency of the underlying AWS Batch infrastructure.
:ref:`fig.mribenchmark` are important because they show potential users
the relative cost in execution time of using Cloudknot in comparison to
other distributed computing platforms.

On the other hand, the scaling results in this paper, indeed most
scaling results in general, measure :math:`t_\mathrm{exec}` from
Eq :ref:`eq.tcompute`, capturing only partial information about
:math:`t_\mathrm{compute}`. Precisely measuring :math:`t_\mathrm{env}`
including the time for users to learn a new system is a human computer
interaction (HCI) problem that was beyond our expertise and resource
limitations to solve at this time. But we believe an extra 30-50% in
execution time may be acceptable when users do not need to learn a
new queueing system or batch processing language nor do they have to
select from a dizzying array of instance types. Beginning Cloudknot
users simply add an extra import statement, instantiate a :code:`Knot`
object, call the :code:`map()` method, and wait for results. But because
Cloudknot is built using Docker and the AWS Batch infrastructure, it can
accomodate the needs of more advanced users who want to augment their
Dockerfiles or specify instance types.

Cloudknot's simplified API and ability to achieve rough parity with
other distributed computing frameworks makes it a viable tool for
researchers who want distributed execution of their computational
workflow, from within their Python environment, without the steep
learning curve of learning a new platform.
is beyond the scope of this paper so the reduction in
:math:`t_\mathrm{compute}` is admittedly speculative. But we believe an
extra 30-50% in execution time may be acceptable in some situations. For
example, if the amount of time that a user will spend learning a new
queueing system or batch processing language exceeds the time savings
due to reduced execution time, then it will be advantageous to accept
Cloudknot's suboptimal execution time in order to use its simplified
API. Beginning Cloudknot users simply add an extra import statement,
instantiate a :code:`Knot` object, call the :code:`map()` method, and
wait for results. And because Cloudknot is built using Docker and the
AWS Batch infrastructure, it can accomodate the needs of more advanced
users who want to augment their Dockerfiles or specify instance types.

Cloudknot's simple API and its conditionally acceptable execution
time compared to other distributed computing frameworks makes it a
viable tool for researchers who want distributed execution of their
computational workflow, from within their Python environment, without
the steep learning curve of learning a new platform.


Acknowledgements
Expand Down
38 changes: 19 additions & 19 deletions papers/adam_richie-halford/figures/compare_heateq_results.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -430,7 +430,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.5.3"
}
},
"nbformat": 4,
Expand Down
Binary file modified papers/adam_richie-halford/figures/nargsscaling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified papers/adam_richie-halford/figures/syssizescaling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 22 additions & 0 deletions papers/adam_richie-halford/mybib.bib
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,12 @@ @misc{cloudknot-repo
year = {2018}
}

@misc{cloudknot-examples,
author = {Richie-Halford, Adam and Rokem, Ariel},
title = {Cloudknot Examples},
howpublished = {\url{https://github.com/richford/cloudknot/tree/master/examples}},
year = {2018}
}

@misc{cloudknot-docs,
author = {Richie-Halford, Adam and Rokem, Ariel},
Expand Down Expand Up @@ -342,3 +348,19 @@ @misc{anwar_o_nunez_elizalde_2017_1034342
doi = {10.5281/zenodo.1034342},
url = {https://doi.org/10.5281/zenodo.1034342}
}

@article{Boettiger14,
author = {Carl Boettiger},
title = {An introduction to Docker for reproducible research, with examples
from the {R} environment},
journal = {CoRR},
volume = {abs/1410.0846},
year = {2014},
url = {http://arxiv.org/abs/1410.0846},
archivePrefix = {arXiv},
eprint = {1410.0846},
timestamp = {Wed, 07 Jun 2017 14:42:34 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/Boettiger14},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

0 comments on commit 640bef3

Please sign in to comment.