Welcome to AMD's HPC Training Examples Repo! Here you will find a variety of examples to showcase the capabilities of AMD's GPU software stack. Please be aware that the repo is continuously updated to keep up with the most recent releases of the AMD software.
Please refer to this table of contents to locate the exercises you are interested in sorted by topic.
- HIP
- Basic Examples
Stream_Overlap
: this example shows how to share the workload of a GPU offload compation using several overlapping streams. The result is an additional gain in terms of time of execution due to the additional parallelism provided by the overlapping streams.README
.dgemm
: a (d)GEMM application created as an exercise to showcase simple matrix-matrix multiplications on AMD GPUs.README
.basic_examples
: a collection of introductory exercises such as device to host data transfer and basic GPU kernel implementation.README
.hip_stream
: modification of the STREAM benchmark for HIP.README
.jacobi
: distributed Jacobi solver, using GPUs to perform the computation and MPI for halo exchanges.README
.matrix_addition
: example of a HIP kernel performing a matrix addition.saxpy
: example of a HIP kernel performing a saxpy operation.README
.stencil_examples
: examples stencils operation with a HIP kernel, including the use of timers and asyncronous copies.vectorAdd
: example of a HIP kernel to perform a vector add.README
.vector_addition_examples
: another example of a HIP kernel to perform vector addition, including different versions such as one using shared memory, one with timers, and a CUDA one to try hipify and hipifly tools on. The examples in this directory are not part of the HIP test suite.
- CUDA to HIP Porting
HIP-Optimizations
: a daxpy HIP kernel is used to show how an initial version can be optimized to improve performance.README
.HIPFort
: a gemm example in Fortran using hipfort.HIPStdPar
: several examples showing C++ Std Parallelism on AMD GPUs.README
.HIP-OpenMP
: example on HIP/OpenMP interoperability.
- Basic Examples
- MPI-examples
- Benchmarks: GPU aware benchmarks (
collective.cpp
andpt2pt.cpp
) to assess the performance of the communication libraries.README
. NOTE: for more detailed instructions on how to run GPU aware MPI examples, see [14. [GPU_aware_MPI]((https://github.com/amd/HPCTrainingExamples/tree/main/GPU_aware_MPI/README.md). - GhostExchange: slimmed down example of an actual physics application where the solution is initialized on a square domain discretized with a Cartesian grid, and then advanced in parallel using MPI communications. NOTE: detailed
README
files are provided here for the different versions of theGhostExchange_ArrayAssign
code, that showcase how to useOmnitrace
to profile this application.
- Benchmarks: GPU aware benchmarks (
- ManagedMemory: programming model exercises, topics covered are APU programming model, OpenMP, performance protability frameworks (Kokkos and Raja) and discrete GPU programming model.
README
. - MLExamples: a variation of PyTorch's MNIST example code and a smoke test for mpi4py using cupy. Instructions on how to run and test other ML frameworks are in the
README
. - Occupancy: example on modifying thread occupancy, using several variants of a matrix vector multiplication leveraging shared memory and launch bounds.
- OmniperfExamples: several examples showing how to leverage Omniperf to perform kernel level optimization. NOTE: detailed READMEs are provided on each subdirectory.
README
.Video of Presentation
. - Omnitrace
- Omnitrace on Jacobi: Omnitrace used on the Jacobi solver example.
README
. - Omnitrace by Example: Omnitrace used on several versions of the Ghost Exchange example.
READMEs
available for each of the different versions of the example code.Video of Presentation
.
- Omnitrace on Jacobi: Omnitrace used on the Jacobi solver example.
- Pragma_Examples: OpenMP (in Fortran, C, and C++) and OpenACC examples.
README
. - Speedup_Examples: examples to show the speedup obtained going from a CPU to a GPU implementation.
README
. - atomics_openmp: examples on atomic operations using OpenMP.
- Kokkos: runs the Stream Triad example with a Kokkos implementation.
README
. - Rocgdb: debugs the HPCTrainingExamples/HIP/saxpy example with Rocgdb.
README
.Video of Presentation
. - Rocprof: uses Rocprof to profile HPCTrainingExamples/HIPIFY/mini-nbody/hip/.
README
. - GPU_aware_MPI: OSU Mini Benchmarks with GPU aware MPI.
README
.Video of Presentation
. - rocm_blog_codes: this directory contains accompany source code examples for select HPC ROCm blogs found at https://rocm.blogs.amd.com.
README
. - login_info
- AAC: instructions on how to log in to the AMD Accelerator Cloud (AAC) resource.
README
.
Most of the exercises in this repo can be run as a test suite by doing:
git clone https://github.com/amd/HPCTrainingExamples && \
cd HPCTrainingExamples && \
cd tests && \
./runTests.sh
You can also run a subset of the whole test suite by specifying the subset you are interested in as an input to the runTests.sh
script. For instance: ./runTests.sh --pytorch
. To see a full list of the possible subsets that can be run, do ./runTests.sh --help
.
NOTE: tests can also be run manually from their respective directories, provided the necessary modules have been loaded and they have been compiled appropriately.
We welcome your feedback and contributions, feel free to use this repo to bring up any issues or submit pull requests.
The software made available here is released under the MIT license, more details can be found in LICENSE.md
.