Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorise running coupling scale updates (port update_scale_coupling_vec to cudacpp with SIMD/GPU)?? Maybe not #964

Open
valassi opened this issue Aug 13, 2024 · 0 comments

Comments

@valassi
Copy link
Member

valassi commented Aug 13, 2024

After introducing more detailed counters #962, it is now clear that the running of the coupling scale is a moderate scalar bottleneck in some processes.

One example is ggttggg, where this takes 20% of the ME calculation.
https://github.com/valassi/madgraph4gpu/blob/2169f6286a3f43c295c913118909a5e75c38cda8/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt#L676


*** (3-cuda) EXECUTE MADEVENT_CUDA x10 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
81920 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! ICONFIG number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 128/128
 [XSECTION] VECSIZE_USED = 8192
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 1
 [XSECTION] ChannelId = 1
 [XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1
 [UNWEIGHT] Wrote 303 events (found 1531 events)
 [COUNTERS] PROGRAM TOTAL                         :   17.9617s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1382s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0704s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    1.1767s for   467913 events => throughput is 2.51E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.5383s for   180224 events => throughput is 2.99E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    1.9975s for    90112 events => throughput is 2.22E-05 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.2803s for    90112 events => throughput is 3.11E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.1079s for    90112 events => throughput is 1.20E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1654s for   467913 events => throughput is 3.53E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    1.5325s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0322s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :   11.9224s for    90112 events => throughput is 1.32E-04 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    6.0393s
 [COUNTERS] OVERALL MEs                    ( 22 ) :   11.9224s for    90112 events => throughput is 1.32E-04 events/s

Unlike the porting of phase space sampling #963, however, the case for doing this and the possibility to do it successfully are much less obvious

  • 20% of MEs is not that much, ggttggg is still limited by MEs (and simpler ggttgg does not have a scale bottleneck)
  • especially, the relevant fortran functions, especially setclscales, look very difficult to port to data parallelism, they are full of if/then/else branches, which is likely to prevent lockstep processing

So I put this here, but I am not convinced that this makes much sense at this stage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant