You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After introducing more detailed counters #962, it is now clear that the running of the coupling scale is a moderate scalar bottleneck in some processes.
*** (3-cuda) EXECUTE MADEVENT_CUDA x10 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
81920 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! ICONFIG number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
[OPENMPTH] omp_get_max_threads/nproc = 1/4
[NGOODHEL] ngoodhel/ncomb = 128/128
[XSECTION] VECSIZE_USED = 8192
[XSECTION] MultiChannel = TRUE
[XSECTION] Configuration = 1
[XSECTION] ChannelId = 1
[XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1
[UNWEIGHT] Wrote 303 events (found 1531 events)
[COUNTERS] PROGRAM TOTAL : 17.9617s
[COUNTERS] Fortran Other ( 0 ) : 0.1382s
[COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0704s
[COUNTERS] Fortran Random2Momenta ( 3 ) : 1.1767s for 467913 events => throughput is 2.51E-06 events/s
[COUNTERS] Fortran PDFs ( 4 ) : 0.5383s for 180224 events => throughput is 2.99E-06 events/s
[COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 1.9975s for 90112 events => throughput is 2.22E-05 events/s
[COUNTERS] Fortran Reweight ( 6 ) : 0.2803s for 90112 events => throughput is 3.11E-06 events/s
[COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.1079s for 90112 events => throughput is 1.20E-06 events/s
[COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1654s for 467913 events => throughput is 3.53E-07 events/s
[COUNTERS] CudaCpp Initialise ( 11 ) : 1.5325s
[COUNTERS] CudaCpp Finalise ( 12 ) : 0.0322s
[COUNTERS] CudaCpp MEs ( 19 ) : 11.9224s for 90112 events => throughput is 1.32E-04 events/s
[COUNTERS] OVERALL NON-MEs ( 21 ) : 6.0393s
[COUNTERS] OVERALL MEs ( 22 ) : 11.9224s for 90112 events => throughput is 1.32E-04 events/s
Unlike the porting of phase space sampling #963, however, the case for doing this and the possibility to do it successfully are much less obvious
20% of MEs is not that much, ggttggg is still limited by MEs (and simpler ggttgg does not have a scale bottleneck)
especially, the relevant fortran functions, especially setclscales, look very difficult to port to data parallelism, they are full of if/then/else branches, which is likely to prevent lockstep processing
So I put this here, but I am not convinced that this makes much sense at this stage
The text was updated successfully, but these errors were encountered:
After introducing more detailed counters #962, it is now clear that the running of the coupling scale is a moderate scalar bottleneck in some processes.
One example is ggttggg, where this takes 20% of the ME calculation.
https://github.com/valassi/madgraph4gpu/blob/2169f6286a3f43c295c913118909a5e75c38cda8/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt#L676
Unlike the porting of phase space sampling #963, however, the case for doing this and the possibility to do it successfully are much less obvious
So I put this here, but I am not convinced that this makes much sense at this stage
The text was updated successfully, but these errors were encountered: