Performance of complex types #176

ALEXks · 2024-03-15T07:50:42Z

ALEXks
Mar 15, 2024

Hi! I'm working on the project called DVM-system. Shortly, DVM-system was developed in Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, with the active participation of graduate students and students of Faculty of Computational Mathematics and Cybernetics of Lomonosov Moscow State University. It is designed to create parallel programs of scientific-technical calculations in C-DVMH and Fortran-DVMH languages. These languages use the same model of parallel programming (DVMH-model) and are the extensions of standard C and Fortran languages by parallelism specifications, implemented as compiler directives. The directives are invisible to standard compilers, so a programmer can have one program for sequential and for parallel execution on computers of different architectures. C-DVMH and Fortran-DVMH compilers convert the source program into a parallel program using standard programming technologies MPI, OpenMP and CUDA. DVM-system includes the tools of functional debugging and performance debugging of DVMH-programs. More details are available at http://dvm-system.org/en/about/

I am a GPU application developer using the CUDA model. So, I was pleasantly surprised to see your project and decided to try to launch our DVMH programs on AMD 6570XT GPU. I have done the comparison performance with Nvidia GPU RTX 3060Ti with approximately the same technical specifications. We have manually optimized NAS Parallel Benchmarks https://www.nas.nasa.gov/software/npb.html tests using our DVMH-model. The results are presented below.
Specs of 3060Ti: 16 TFlops FP32, 0.253 TFlops FP64, 448GB/s of memory bandwidth, 200W TDP
Specs of 6750XT: 13 TFlops FP32, 0.832 TFlops FP64, 432GB/s of memory bandwidth, 250W TDP
So, AMD is 18% slower than Nvidia in FP32, but 3 times faster in FP64, the memory bandwidth is almost the same. We can see three times difference on the test EP, which is fully compute intensive without any operations of read from or write to global memory.

I see a problem with performance of FT test only. This test uses arrays with double complex types. I have explored all parallel loops using our performance analysis tool and all loops slow down by approximately 1.5 times. The code for a simple loop copying of one array to another in Fortran language is below.

double complex  u0(nxp,ny,nz), u1(nxp,ny,nz)   

do k = 1, nz
    do j = 1, ny
        do i = 1, nx
        
            kk =  mod(k-1+nz/2, nz) - nz/2
            kk2 = kk*kk     
        
            jj = mod(j-1+ny/2, ny) - ny/2
            kj2 = jj*jj+kk2
        
            ii = mod(i-1+nx/2, nx) - nx/2
            
            u0(i,j,k) = u0(i,j,k) * dexp(ap*dble(ii*ii+kj2))
            u1(i,j,k) = u0(i,j,k)
        end do          
    end do  
end do

Our DVMH-compiler converts this loop into the following CUDA kernel:

      __global__ void   loop_ft_1134_cuda_kernel_int(
                     dcmplx2 u1[], __indexTypeInt u10003, __indexTypeInt u10002, 
                     dcmplx2 u0[], __indexTypeInt u00003, __indexTypeInt u00002, 
                     __indexTypeInt begin_1, __indexTypeInt end_1, 
                     __indexTypeInt begin_2, __indexTypeInt end_2, 
                     __indexTypeInt begin_3, __indexTypeInt end_3,
                     __indexTypeInt blocks_1, __indexTypeInt blocks_2, 
                     __indexTypeInt add_blocks, int nz, int ny, int nx, double ap)
      {

// Private variables
         int kj2;
         int kk2;
         int ii;
         int jj;
         int kk;

// Local needs
         __indexTypeInt k, j, i;
         __indexTypeInt rest_blocks, cur_blocks;

// Calculate each thread's loop variables' values
         rest_blocks = add_blocks + blockIdx.x;
         cur_blocks = rest_blocks / blocks_1;
         k = begin_1 + (cur_blocks * blockDim.z + threadIdx.z);
         if (k <= end_1) 
         {
            rest_blocks = rest_blocks - cur_blocks * blocks_1;
            cur_blocks = rest_blocks / blocks_2;
            j = begin_2 + (cur_blocks * blockDim.y + threadIdx.y);
            if (j <= end_2) 
            {
               rest_blocks = rest_blocks - cur_blocks * blocks_2;
               cur_blocks = rest_blocks;
               i = begin_3 + (cur_blocks * blockDim.x + threadIdx.x);
               if (i <= end_3) 
               {

// Loop body
                  kk = (k - 1 + nz / 2) % nz - nz / 2;
                  kk2 = kk * kk;
                  jj = (j - 1 + ny / 2) % ny - ny / 2;
                  kj2 = jj * jj + kk2;
                  ii = (i - 1 + nx / 2) % nx - nx / 2;
                  u0[i + u00003 * j + u00002 * k] = u0[i + u00003 * j + u00002 * k] * exp(ap * double(ii * ii + kj2));
                  u1[i + u10003 * j + u10002 * k] = u0[i + u00003 * j + u00002 * k];
               }
            }
         }
      }

We use for double complex the following class with two double elements (more details in attached file dvmhlib_f2c.h). Reading and writing array's elements require more than one transaction to global memory in general case, because real part or imaginary part of each array's element is stored with stride.

template <>
class Complex<double> {
    typedef double T;
public:
    T x; ///< Real part of complex number.
    T y; ///< Imaginary part of complex number.
public:
    inline __host__ __device__ Complex() {}
    inline __host__ __device__ Complex(T x1) //implicit for conversions
    { x = x1; y = 0; }
    inline __host__ __device__ Complex(T x1, T y1)
    { x = x1; y = y1; }
    inline __host__ __device__ Complex(const Complex<float> &other) //implicit for propagation
    { x = (T)real(other); y = (T)imag(other); }
    inline __host__ __device__ Complex &operator=(const Complex<T> &right)
    { x = right.x; y = right.y; return *this; }
#ifdef HAVE_EXPLICIT_CAST
    explicit
#endif
            inline __host__ __device__ operator T() const
    { return x; }
#ifdef HAVE_EXPLICIT_CAST
    explicit inline __host__ __device__ operator float() const
    { return x; }
#endif
};

Next we can run profiling of this kernel in the profiler Nsight Compute and see total memory throughput approximately 72% on RTX 3060Ti:

Is it possible to profile this code on AMD to compare its performance? Do you have any ideas about this? Why is performance lost when using a complex data type? How to understand what the problem might be?

vosen · 2024-03-15T13:23:51Z

vosen
Mar 15, 2024
Maintainer

Interesting.
Your profiling options are either rgp on Windows (https://gpuopen.com/rgp/), it can do instruction-level profiling, but I could never get it to work or rocprof on Linux which is kernel-level.
Source code does not help (compiler versions, compiler flags can lead to different PTX being generated). Can you extract the PTX assembly and put it here? If you are building with nvcc use -save-temps. If you already have a binary use cuobjdump tool (with -lptx, -xptx). If your compilation process is too complex for either try using zluda_dump as explained here

3 replies

ALEXks Mar 15, 2024
Author

Okay, I understand. I tried running the rocprof with default options. The output file did not contain information about memory or processor load. So, this tool requires reading the documentation.... And it was useful for me if you know what options need to be specified for full profiling. I have a question which architecture should I specify when compiling with CUDA (-arhc=smXX flag)? I used default value of CUDA 12 (arch_60) for tests above and tried to use arch_86 and there was no difference in performance on FT.
I attached the full CUDA code and its PTX for sm_86 architecture ptx_and_cuda.zip.

vosen Mar 15, 2024
Maintainer

I usually run rocprof with rocprof --sys-trace <APP> <ARGS>, this produces time graph with kernel execution times in Chrome profiling format. rocprof can also collect metrics, but I don't know what options you need for that.
As for CUDA architecture choice:

Broadly speaking, with higher architectures CUDA compiler might emit newer, complex instructions, which might be more optimizable (sometimes there is an exact match between an NV GPU instruction and an AMD GPU instruction)
On the other hand you are risking emitting instructions that are "too new" and not supported by ZLUDA

Back to the main topic. That exp(...) expands to this in PTX:

	mov.f64 	%fd11, 0dC338000000000000;
	add.rn.f64 	%fd12, %fd10, %fd11;
	mov.f64 	%fd13, 0dBFE62E42FEFA39EF;
	fma.rn.f64 	%fd14, %fd12, %fd13, %fd1;
	mov.f64 	%fd15, 0dBC7ABC9E3B39803F;
	fma.rn.f64 	%fd16, %fd12, %fd15, %fd14;
	mov.f64 	%fd17, 0d3E928AF3FCA213EA;
	mov.f64 	%fd18, 0d3E5ADE1569CE2BDF;
	fma.rn.f64 	%fd19, %fd18, %fd16, %fd17;
	mov.f64 	%fd20, 0d3EC71DEE62401315;
	fma.rn.f64 	%fd21, %fd19, %fd16, %fd20;
	mov.f64 	%fd22, 0d3EFA01997C89EB71;
	fma.rn.f64 	%fd23, %fd21, %fd16, %fd22;
	mov.f64 	%fd24, 0d3F2A01A014761F65;
	fma.rn.f64 	%fd25, %fd23, %fd16, %fd24;
	mov.f64 	%fd26, 0d3F56C16C1852B7AF;
	fma.rn.f64 	%fd27, %fd25, %fd16, %fd26;
	mov.f64 	%fd28, 0d3F81111111122322;
	fma.rn.f64 	%fd29, %fd27, %fd16, %fd28;
	mov.f64 	%fd30, 0d3FA55555555502A1;
	fma.rn.f64 	%fd31, %fd29, %fd16, %fd30;
	mov.f64 	%fd32, 0d3FC5555555555511;
	fma.rn.f64 	%fd33, %fd31, %fd16, %fd32;
	mov.f64 	%fd34, 0d3FE000000000000B;
	fma.rn.f64 	%fd35, %fd33, %fd16, %fd34;
	mov.f64 	%fd36, 0d3FF0000000000000;
	fma.rn.f64 	%fd37, %fd35, %fd16, %fd36;
	fma.rn.f64 	%fd38, %fd37, %fd16, %fd36;

which compiles down to this on AMD:

	s_mov_b32 s2, 0                                            // 0000000277F0: BE820380
	s_mov_b32 s3, 0x43380000                                   // 0000000277F4: BE8303FF 43380000
	v_fma_f64 v[0:1], v[2:3], s[4:5], s[2:3]                   // 0000000277FC: D54C0000 00080902
	s_mov_b32 s3, 0xc3380000                                   // 000000027804: BE8303FF C3380000
	s_mov_b32 s4, 0xfefa39ef                                   // 00000002780C: BE8403FF FEFA39EF
	s_mov_b32 s5, 0xbfe62e42                                   // 000000027814: BE8503FF BFE62E42
	v_add_f64 v[4:5], v[0:1], s[2:3]                           // 00000002781C: D5640004 00000500
	v_fma_f64 v[6:7], v[4:5], s[4:5], v[2:3]                   // 000000027824: D54C0006 04080904
	s_mov_b32 s4, 0x3b39803f                                   // 00000002782C: BE8403FF 3B39803F
	s_mov_b32 s5, 0xbc7abc9e                                   // 000000027834: BE8503FF BC7ABC9E
	v_fma_f64 v[4:5], v[4:5], s[4:5], v[6:7]                   // 00000002783C: D54C0004 04180904
	s_mov_b32 s4, 0xfca213ea                                   // 000000027844: BE8403FF FCA213EA
	s_mov_b32 s5, 0x3e928af3                                   // 00000002784C: BE8503FF 3E928AF3
	v_fma_f64 v[6:7], v[4:5], s[8:9], s[4:5]                   // 000000027854: D54C0006 00101104
	s_mov_b32 s4, 0x62401315                                   // 00000002785C: BE8403FF 62401315
	s_mov_b32 s5, 0x3ec71dee                                   // 000000027864: BE8503FF 3EC71DEE
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 00000002786C: D54C0006 00120906
	s_mov_b32 s4, 0x7c89eb71                                   // 000000027874: BE8403FF 7C89EB71
	s_mov_b32 s5, 0x3efa0199                                   // 00000002787C: BE8503FF 3EFA0199
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 000000027884: D54C0006 00120906
	s_mov_b32 s4, 0x14761f65                                   // 00000002788C: BE8403FF 14761F65
	s_mov_b32 s5, 0x3f2a01a0                                   // 000000027894: BE8503FF 3F2A01A0
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 00000002789C: D54C0006 00120906
	s_mov_b32 s4, 0x1852b7af                                   // 0000000278A4: BE8403FF 1852B7AF
	s_mov_b32 s5, 0x3f56c16c                                   // 0000000278AC: BE8503FF 3F56C16C
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 0000000278B4: D54C0006 00120906
	s_mov_b32 s4, 0x11122322                                   // 0000000278BC: BE8403FF 11122322
	s_mov_b32 s5, 0x3f811111                                   // 0000000278C4: BE8503FF 3F811111
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 0000000278CC: D54C0006 00120906
	s_mov_b32 s4, 0x555502a1                                   // 0000000278D4: BE8403FF 555502A1
	s_mov_b32 s5, 0x3fa55555                                   // 0000000278DC: BE8503FF 3FA55555
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 0000000278E4: D54C0006 00120906
	s_mov_b32 s4, 0x55555511                                   // 0000000278EC: BE8403FF 55555511
	s_mov_b32 s5, 0x3fc55555                                   // 0000000278F4: BE8503FF 3FC55555
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 0000000278FC: D54C0006 00120906
	s_mov_b32 s4, 11                                           // 000000027904: BE84038B
	s_mov_b32 s5, 0x3fe00000                                   // 000000027908: BE8503FF 3FE00000
	v_fma_f64 v[6:7], v[6:7], v[4:5], s[4:5]                   // 000000027910: D54C0006 00120906

This is bad, all those s_mov_b32 should really be hoisted out of the loop.
I'll look into ZLUDA's code generation. This is such na obvious missed optimization that I doubt it's an LLVM issue.

ALEXks Mar 16, 2024
Author

I think it's not the instructions s_mov_b32, although having extra instructions can affect performance.
I simplified the code to the following (only copying was left):

!dvm$ parallel (k,j,i) on u0(i,j,k),private(kk, jj, ii, kk2, kj2), cuda_block(256)
do k = 1, nz
    do j = 1, ny
        do i = 1, nx

! kk =  mod(k-1+nz/2, nz) - nz/2
! kk2 = kk*kk

! jj = mod(j-1+ny/2, ny) - ny/2
! kj2 = jj*jj+kk2

! ii = mod(i-1+nx/2, nx) - nx/2

! u0(i,j,k) = u0(i,j,k) + dexp(ap*dble(ii*ii+kj2))

            u1(i,j,k) = u0(i,j,k)
        end do
    end do
end do

I got the following times on the original version of the code: 0.5 sec on AMD vs 0.41 on Nvidia (the ratio is 1.21x)
And I got the following times on a simplified version of the code: 0.25 sec on AMD vs 0.20 on Nvidia (the ratio is 1.25x).

I think that these instructions s_mov_b32 are no longer in the PTX code. sim_version.zip. But the time ratio remains the same.

It may be worth paying attention to the main computational loops and their PTX code, in which the time ratio is better expressed. They are on the lines 1160, 1251, 1349. The times for these loops are as follows: 3.79sec, 3.81sec and 3.94 sec respectively on AMD, and 2.54sec, 2.54sec and 2.63sec respectively on Nvidia, average ratio is 1.5x.

How can I get the PTX code on AMD using ZLUDA?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of complex types #176

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance of complex types #176

ALEXks Mar 15, 2024

Replies: 1 comment · 3 replies

vosen Mar 15, 2024 Maintainer

ALEXks Mar 15, 2024 Author

vosen Mar 15, 2024 Maintainer

ALEXks Mar 16, 2024 Author

ALEXks
Mar 15, 2024

Replies: 1 comment 3 replies

vosen
Mar 15, 2024
Maintainer

ALEXks Mar 15, 2024
Author

vosen Mar 15, 2024
Maintainer

ALEXks Mar 16, 2024
Author