title | tags | authors | affiliations | date | bibliography | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
xesn: Echo state networks powered by Xarray and Dask |
|
|
|
1 November 2024 |
docs/references.bib |
Xesn
is a Python package that allows scientists to easily design
Echo State Networks (ESNs) for forecasting problems.
ESNs are a Recurrent Neural Network architecture introduced by @jaeger_echo_2001
that are part of a class of techniques termed Reservoir Computing.
One defining characteristic of these techniques is that all internal weights are
determined by a handful of global, scalar parameters, thereby avoiding problems
during backpropagation and reducing training time significantly.
Because this architecture is conceptually simple, many scientists implement ESNs
from scratch, leading to questions about computational performance.
Xesn
offers a straightforward, standard implementation of ESNs that operates
efficiently on CPU and GPU hardware.
The package leverages optimization tools to automate the parameter selection
process, so that scientists can reduce the time finding a good architecture and
focus on using ESNs for their domain application.
Importantly, the package flexibly handles forecasting tasks for out-of-core,
multi-dimensional datasets,
eliminating the need to write parallel programming code.
Xesn
was initially developed to handle the problem of forecasting weather
dynamics, and so it integrates naturally with Python packages that have become
familiar to weather and climate scientists such as Xarray
[@hoyer_xarray_2017].
However, the software is ultimately general enough to be utilized in other
domains where ESNs have been useful, such as in
signal processing [@jaeger_harnessing_2004].
ESNs are a conceptually simple Recurrent Neural Network architecture, leading many scientists who use ESNs to implement them from scratch. While this approach can work well for low dimensional problems, the situation quickly becomes more complicated when:
- deploying the code on GPUs,
- interacting with a parameter optimization algorithm in order to tune the model, and
- parallelizing the architecture for higher dimensional applications.
Xesn
is designed to address all of these points.
Additionally, while there are some design flexibilities for the ESN
architectures, the overall interface is streamlined based on the parameter and
design impact study shown by @platt_systematic_2022.
At its core, xesn
uses NumPy [@harris_array_2020] and SciPy [@scipy_2020] to perform array based
operations on CPUs.
The package then harnesses the CPU/GPU agnostic code capabilities afforded by CuPy [@cupy_learningsys2017]
to operate on GPUs.
Although ESNs do not employ backpropagation to train internal weights,
their behavior and performance is highly sensitive to a set of
5 scalar parameters.
Moreover, the interaction of these parameters is often not straightforward, and
it is therefore advantageous to optimize these parameter values
[@platt_systematic_2022].
Additionally, @platt_constraining_2023 and @smith_temporal_2023 showed that
adding invariant metrics to the loss function, like the leading Lyapunov
exponent or the Kinetic Energy spectrum, improved generalizability.
As a generic implementation of these metrics,
xesn
offers the capability to constrain the
system's Power Spectral Density during parameter optimization in addition to a
more traditional mean squared error loss function.
Xesn
enables parameter optimization by integrating with the Surrogate Modeling
Toolbox [@bouhlel_scalable_2020], which has a Bayesian optimization
implementation.
Xesn
provides a simple interface so that the user can specify all of the
settings for training, parameter optimization, and testing with a single YAML
file.
By doing so, all parts of the experiment are more easily reproducible and easier to manage
with scheduling systems like SLURM on HPC environments or in the cloud.
It is typical for ESNs to use a hidden layer that is xesn
.
Xesn
enables prediction for multi-dimensional systems by integrating its high
level operations with Xarray
[@hoyer_xarray_2017].
As with Xarray
, users refer to dimensions based on their named axes.
Xesn
parallelizes the core array based operations by using Dask
[@dask_2016; @rocklin_scipy_2015]
to map them across available resources, from a laptop to a distributed HPC or
cloud cluster.
It is important to note that there is already an existing software package in
Python for Reservoir Computing, called ReservoirPy
[@Trouvain2020].
To our knowledge, the purpose of this package is distinctly different.
The focus of ReservoirPy
is to develop a highly generic framework for Reservoir
Computing, for example, allowing one to change the network node type and graph structure
underlying the reservoir, and allowing for delayed connections.
On the other hand, xesn
is focused specifically on implementing ESN
architectures that can scale to multi-dimensional forecasting tasks.
Additionally, while ReservoirPy
enables hyperparameter grid search capabilities
via Hyperopt [@hyperopt], xesn
enables Bayesian optimization as noted above.
Another ESN implementation is that of [@arcomano_machine_2020;@arcomano_hybrid_2022;@arcomano_hybrid_2023], available at [@arcomano_code]. The code implements ESNs in Fortran, and focuses on using ESNs for hybrid physics-ML modeling.
Here we show brief scaling results in order to show
how the standard (eager)
xesn.ESN
scales with increasing hidden and input dimensions.
Additionally, we provide some baseline results to serve as guidance when
configuring Dask
to use the parallelized
xesn.LazyESN
architecture.
The scripts used to setup, execute, and visualize these scaling tests can be
found
here.
For methodological details on these two architectures, please refer to
the methods section of the documentation.
For reference, in \autoref{fig:eager} we show the wall time and peak memory usage required to
train the
standard (eager) ESN
architecture as a function of the input dimension us-central-1c
zone on Google Cloud Platform (GCP), using
a single c2-standard-60
instance to test the CPU (NumPy) implementation
and a single a2-highgpu-8g
(i.e., with 8 A100 cards) instance to test the GPU
(CuPy) implementation.
The training data was generated from the Lorenz96 model
[@lorenz_predictability_1996] with dimensions
In the CPU tests, wall time scales quadratically with the reservoir size, while
it is mostly constant on a GPU.
For this problem, it becomes advantageous to use GPUs once the reservoir size is
approximately
In order to evaluate the performance of the parallelized architecture, we take
the Lorenz96 system with dimension dask.distributed
Clusters, testing:
- Purely threaded mode (CPU only).
- The relevant default "LocalCluster" (i.e., single node) configuration for our resources.
On the CPU resource, a GCP
c2-standard-60
instance, the defaultdask.distributed.LocalCluster
has 6 workers, each with 5 threads. On the GPU resource, a GCPa2-highgpu-8g
instance, the defaultdask_cuda.LocalCUDACluster
has 8 workers, each with 1 thread. - A
LocalCluster
with 1Dask
worker per group. On GPUs, this assumes 1 GPU per worker and we are able to use a maximum of 8 workers due to our available resources.
\autoref{fig:lazy} shows the strong scaling results of xesn.LazyESN
for each of these
cluster configurations, where each point shows the ratio of the
wall time with the standard (serial) architecture to the lazy (parallel)
architecture with Dask
worker process per ESN group generally scales well,
which makes sense because each group is trained entirely independently.
On GPUs, the timing is largely determined by how many workers (GPUs) there are relative to the number of groups. When the number of workers is less than the number of groups, performance is detrimental. However, when there is at least one worker per group, the timing is almost the same as for the single worker case, only improving performance by 10-20%. While the strong scaling is somewhat muted, the invariance of wall time to reservoir size in \autoref{fig:eager} and number of groups in \autoref{fig:lazy} means that the distributed GPU implementation is able to tackle larger problems at roughly the same computational cost.
T.A. Smith and S.G. Penny acknowledge support from NOAA Grant NA20OAR4600277. S.G. Penny and J.A. Platt acknowledge support from the Office of Naval Research (ONR) Grants N00014-19-1-2522 and N00014-20-1- 2580. T.A. Smith acknowledges support from the Cooperative Institute for Research in Environmental Sciences (CIRES) at the University of Colorado Boulder. The authors thank the editor Jonny Saunders for comments that significantly improved the manuscript, and the reviewers Troy Arcomano and William Nicholas.