Releases: CaNS-World/CaNS
v2.0
CaNS 2.0
is finally released! 🎉
This is the most significant revision of our toolkit so far.
Co-authored by Pedro Costa, Massimiliano Fatica, and Josh Romero.
Summary
This release marks the ending of a fresh porting effort for massively parallel simulations on modern architectures, from one to thousands of GPUs with a focus on performance while ensuring a flexible and sustainable implementation that is easy to extend for more complex physical problems. We used OpenACC directives to accelerate loops and for host/device data transfer, interoperated with NVIDIA's cuFFT and the new cuDecomp domain decomposition library.
cuDecomp is the heart of the multi-GPU implementation, ensuring the solver's performance by bringing a novel, hardware-adaptive parallelization of the transposes in the Poisson/Helmholtz solver, and of the halo-exchange operations.
Although quite performant, the implementation is also flexible, allowing for an easy change of solver profiles, such as X-aligned default pencils, which are optimal for a fully explicit time integration, or Z-aligned default pencils, which are optimal for a Z-implicit time integration for wall flows.
Finally, another noteworthy (optional) feature is CaNS
' new mixed-precision mode, where the pressure Poisson equation is solved in lower precision. This mode makes a huge difference in performance for many-GPU calculations across multiple nodes.
In addition to these big-picture changes, there have been many impactful changes that make the solver more versatile and robust. All relevant changes are summarized below.
Changes:
- GPU acceleration using OpenACC directives for loops and data movement, which is interfaced with CUDA whenever needed
- Hardware-adaptive multi-GPU implementation using the cuDecomp library for transposes (seven possible communication backends) and halo exchanges (five possible communication backends), with different flavors of MPI, NCCL and NVSHMEM implementations
- Lean memory footprint on GPUs, which can be made even leaner by exploiting cuDecomp's in-place transposes
- Mixed-precision mode implemented on both CPUs and GPUs
- Hybrid MPI-OpenMP parallelization is still supported on CPUs
- Any default pencil orientation is supported, on both CPUs and GPUs
- A fast-kernel mode is used by default to speed up the calculation of the prediction velocity, on both CPUs and GPUs
- The 2DECOMP library is still used for the many-CPU parallelization of the Poisson solver, and some of the parallel data I/O
- Build process made much simpler and more robust, with the dependencies determined automatically
- Refactoring of the FFT-based Fourier, cosine, and sine transforms on GPUs, together with the Gauss elimination kernels, with improvements both in terms of speed and maintainability
- Support for uneven decompositions and odd numbers along any direction; perhaps surprisingly, at times setups with odd numbers near the desired resolution may result in a more efficient FFT computation
- External domain decomposition libraries, cuDecomp and 2DECOMP, loaded as Submodules
- Many changes for improved performance and robustness, with a focus on minimizing the memory footprint and computation intensity while keeping the tool versatile
Acknowledgements
CaNS 2.0
has been tested in several GPU-accelerated systems such as Marconi 100, Meluxina, Perlmutter, Selene, Summit and Vega. We acknowledge the support from CoE RAISE, NERSC and EuroHPC, which enabled thorough testing of CaNS 2.0
in these state-of-the-art supercomputers.
v1.3.1
Summary
This release features some simplifications of the OpenMP code and the removal of the nthreadsmax
input parameter. It was first meant at fixing an issue concerning boundary conditions, but the implementation is actually correct.
Changes
Full Changelog: v1.3.0...v1.3.1
v1.3.0
Summary
This release features a mixed-precision mode where the Poisson equation can be solved using lower precision, which may be useful for certain setups.
Changes
- Mixed precision mode by @p-costa in #26 after discussions w/ @maxcuda and @romerojosh. For more details on how to set it up, see the option
SINGLE_PRECISION_POISSON
underdoc/INFO_COMPILING.md
.
v1.2.0
Summary
This release features a more robust and friendly build process (still using Make). It also features some restructuring of the documentation.
Changes:
- better build process with a few pre-defined profiles and automatic dependency generation (requires
gawk
). Seedoc/INFO_COMPILING.md
- 2DECOMP built as an external library
- documentation files brought into the
doc
folder
(see #25)
Full Changelog: v1.1.5...v1.2.0
v1.1.5
Summary
This is release features minor changes, adding a new checkpointing mode;
Changes:
- new checkpointing mode was added to bound the number of checkpoints per run to a maximum, which can be set using a new parameter in the input file
dns.in
, namednsaves_max
; please seesrc/INFO_INPUT.md
for more details;
v1.1.4
Summary
This is release features minor changes, with performance improvements, and bugfixes;
Changes:
- implicit Z diffusion made considerably more efficient. For optimal performance, the code needs to be built with
-D_DECOMP_Z
, as explained inREADME.md
; - new example files and grid mapping functions have been added (thanks @GabrieleBoga for the temporal boundary layer setup! #22);
- other minor bugfixes;
v1.1.3
Summary
This release has a main major feature. It implements the option for choosing implicit diffusion along only one of the domain directions - the third one (z), where the grid can be non-uniform. Hence, CaNS can be run now in (1) fully explicit mode; (2) implicit diffusion along all directions, and (3) implicit diffusion only along the z-direction, which comes in handy for very fine grids along only z. See Compilation, under README.md
for how to activate this feature.
Changes:
- Option for implicit diffusion only along z;
- Minor changes in the Poisson solver to avoid scaling of the absolute pressure under certain combinations of BCs;
- Added a two-dimensional Taylor-Green vortex case.
v1.1.2
Summary
This release adds very minor features with respect to the previous major release v1.1.0. Just a more robust input sanity check, and a slightly larger flexibility of the domain and processor grids.
Changes:
- A very robust check of the direct Helmholtz solver for cell- and face-centered variables has been enabled -- at the beginning of any calculation, the Poisson and, if implicit diffusion is used, the three additional Helmholtz equations with normal boundary conditions the cell faces are checked for random inputs under
sanity.f90
, if the code is built with the-D_DEBUG
preprocessor flag; dims(:)
does not have to be divisible by2
anymore;- possibility of using
1
grid point along a certain direction, rather than the previous minimum of2
; - fixes an input sanity check bug introduced in v1.1.1 (#19);
- some nitpicking.
v1.1.0
Summary
This release features significant improvements in terms of performance and scalability, but also enhances the code modularity and the implementation in general. There is no breaking of backward compatibility.
Changes:
- x-aligned pencils are now used by default in the main branch, which results in improved speed and scalability;
- support for uneven partitioning of the computational subdomains: the total number of grid points along one direction does not have to be divisible by the number of tasks;
- simplified and unified the routines used for computing the prediction velocity with and without implicit diffusion;
- improved the routines for imposing boundary conditions, and the MPI I/O checkpointing (based on those of SNaC);
- support an arbitrary extent of boundary cells when imposing boundary conditions;
- lots of polishing and minor improvements.