Replies: 1 comment 3 replies
-
Interesting. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi! I'm working on the project called DVM-system. Shortly, DVM-system was developed in Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, with the active participation of graduate students and students of Faculty of Computational Mathematics and Cybernetics of Lomonosov Moscow State University. It is designed to create parallel programs of scientific-technical calculations in C-DVMH and Fortran-DVMH languages. These languages use the same model of parallel programming (DVMH-model) and are the extensions of standard C and Fortran languages by parallelism specifications, implemented as compiler directives. The directives are invisible to standard compilers, so a programmer can have one program for sequential and for parallel execution on computers of different architectures. C-DVMH and Fortran-DVMH compilers convert the source program into a parallel program using standard programming technologies MPI, OpenMP and CUDA. DVM-system includes the tools of functional debugging and performance debugging of DVMH-programs. More details are available at http://dvm-system.org/en/about/
I am a GPU application developer using the CUDA model. So, I was pleasantly surprised to see your project and decided to try to launch our DVMH programs on AMD 6570XT GPU. I have done the comparison performance with Nvidia GPU RTX 3060Ti with approximately the same technical specifications. We have manually optimized NAS Parallel Benchmarks https://www.nas.nasa.gov/software/npb.html tests using our DVMH-model. The results are presented below.
Specs of 3060Ti: 16 TFlops FP32, 0.253 TFlops FP64, 448GB/s of memory bandwidth, 200W TDP
Specs of 6750XT: 13 TFlops FP32, 0.832 TFlops FP64, 432GB/s of memory bandwidth, 250W TDP
So, AMD is 18% slower than Nvidia in FP32, but 3 times faster in FP64, the memory bandwidth is almost the same. We can see three times difference on the test EP, which is fully compute intensive without any operations of read from or write to global memory.
I see a problem with performance of FT test only. This test uses arrays with double complex types. I have explored all parallel loops using our performance analysis tool and all loops slow down by approximately 1.5 times. The code for a simple loop copying of one array to another in Fortran language is below.
Our DVMH-compiler converts this loop into the following CUDA kernel:
We use for double complex the following class with two double elements (more details in attached file dvmhlib_f2c.h). Reading and writing array's elements require more than one transaction to global memory in general case, because real part or imaginary part of each array's element is stored with stride.
Next we can run profiling of this kernel in the profiler Nsight Compute and see total memory throughput approximately 72% on RTX 3060Ti:
Is it possible to profile this code on AMD to compare its performance? Do you have any ideas about this? Why is performance lost when using a complex data type? How to understand what the problem might be?
Beta Was this translation helpful? Give feedback.
All reactions