-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Improve timers (lower overhead using rdtcs) and profile additional fortran components (other than MEs) #962
base: master
Are you sure you want to change the base?
Conversation
…a counters namespace
…toring of counters using maps and explicit register methods
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 1.4510s [COUNTERS] Fortran Overhead ( 0 ) : 1.3466s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0871s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0164s for 16399 events => throughput is 1.00E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp INFO: No Floating Point Exceptions have been reported [COUNTERS] PROGRAM TOTAL : 1.9073s [COUNTERS] Fortran Overhead ( 0 ) : 1.2890s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5218s for 98304 events => throughput is 5.31E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0958s for 98371 events => throughput is 9.74E-07 events/s
…ke cleanall and rebuild) Note: the counter itself has a huge overhead... ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7742s [COUNTERS] Fortran Overhead ( 0 ) : 0.5162s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0906s for 16384 events => throughput is 5.53E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0174s for 16399 events => throughput is 1.06E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1493s for 98304 events => throughput is 1.52E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1335s [COUNTERS] Fortran Overhead ( 0 ) : 2.6717s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5176s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0961s for 98371 events => throughput is 9.77E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8474s for 589824 events => throughput is 1.44E-06 events/s
…ain, to reduce performance overhead from counters themselves ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 1.4700s [COUNTERS] Fortran Overhead ( 0 ) : 1.2236s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0867s for 16384 events => throughput is 5.29E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0162s for 16399 events => throughput is 9.88E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1428s for 98304 events => throughput is 1.45E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9569s [COUNTERS] Fortran Overhead ( 0 ) : 0.4895s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5181s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0958s for 98371 events => throughput is 9.74E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8528s for 589824 events => throughput is 1.45E-06 events/s
…points ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7442s [COUNTERS] Fortran Overhead ( 0 ) : 0.2437s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0871s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0162s for 16399 events => throughput is 9.86E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1335s for 98304 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2629s for 16399 events => throughput is 1.60E-05 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9099s [COUNTERS] Fortran Overhead ( 0 ) : 0.3233s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5203s for 98304 events => throughput is 5.29E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0956s for 98371 events => throughput is 9.71E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7980s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1719s for 98371 events => throughput is 1.75E-06 events/s
Note: the 'fortran overhead' now is something I should rename as 'other' (not pdf, not x2f, not i/o?... I suspect it is related to io anyway) |
The instrumentation of sample_put_point is here |
NB: there is some hysteresis, the timing results depend on what was executed before For instance, x1 results may be 0.7 or 1.5, and x10 results may be 1.5 or 4.1: this does NOT depend on the software version! Start with x1, several times, eventually it gives 0.7 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7417s [COUNTERS] Fortran Overhead ( 0 ) : 0.2435s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0861s for 16384 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1345s for 98304 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2603s for 16399 events => throughput is 1.59E-05 events/s Then the FIRST execution of x10 gives 1.9 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9285s [COUNTERS] Fortran Overhead ( 0 ) : 0.3277s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5237s for 98304 events => throughput is 5.33E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0964s for 98371 events => throughput is 9.80E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8057s for 589824 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1741s for 98371 events => throughput is 1.77E-06 events/s But the SECOND execution gives 4.1s! With the big increase coming from the I/O part (And any subsequent execution also gives the same) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1048s [COUNTERS] Fortran Overhead ( 0 ) : 1.1119s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5161s for 98304 events => throughput is 5.25E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0946s for 98371 events => throughput is 9.62E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7954s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 1.5861s for 98371 events => throughput is 1.61E-05 events/s Now the FIRST execution of x1 gives 1.4s! ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 1.4677s [COUNTERS] Fortran Overhead ( 0 ) : 0.5601s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0861s for 16384 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0167s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1338s for 98304 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.6702s for 16399 events => throughput is 4.09E-05 events/s But the SECOND execution gives again 0.7s! And all subsequent executions too (so we are back at the beginning of the loop above) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7480s [COUNTERS] Fortran Overhead ( 0 ) : 0.2472s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0870s for 16384 events => throughput is 5.31E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1337s for 98304 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2628s for 16399 events => throughput is 1.60E-05 events/s In the following, I will quote results for the second x1 and the first x10 only...
…een defined I had done this to try and decrease the 4.1s... but in the meantime I understood the problem is elsewhere. In particular, this is not faster than string comparison - will revert! ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7451s [COUNTERS] Fortran Overhead ( 0 ) : 0.2426s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0875s for 16384 events => throughput is 5.34E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0170s for 16399 events => throughput is 1.04E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1342s for 98304 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2631s for 16399 events => throughput is 1.60E-05 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.8970s [COUNTERS] Fortran Overhead ( 0 ) : 0.3151s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5182s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0952s for 98371 events => throughput is 9.67E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7950s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1729s for 98371 events => throughput is 1.76E-06 events/s
…g if a counter has been defined: use string comparison to "", it is not slower Revert "[prof] in gg_tt.mad counters.cc add a flag showing if a counter has been defined" This reverts commit ee6f9f5.
…BLECOUNTERS to disable individual counters I initially wanted to use this to check if it is the individual counters that caused the 4.1s in x10 tests. But in the meantime I understood that the problem is elsewhere, and that timings depend on execution order! Will probably revert! Note, the second x1 execution takes 0.7s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7485s [COUNTERS] Fortran Overhead ( 0 ) : 0.2472s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0872s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1346s for 98304 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2621s for 16399 events => throughput is 1.60E-05 events/s CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7349s And then the first x10 execution takes 1.9s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9127s [COUNTERS] Fortran Overhead ( 0 ) : 0.3268s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5172s for 98304 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0964s for 98371 events => throughput is 9.80E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7992s for 589824 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1723s for 98371 events => throughput is 1.75E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.8511s While the SECOND execution x10 takes 4.1s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1152s [COUNTERS] Fortran Overhead ( 0 ) : 1.1174s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5173s for 98304 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0950s for 98371 events => throughput is 9.65E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8117s for 589824 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 1.5731s for 98371 events => throughput is 1.60E-05 events/s CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.0680s Will therefore revert this
…CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters Revert "[prof] in gg_tt.mad counters add an env variable CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters" This reverts commit 0681a76.
…ther and make it counter[0] No change in the timings ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7531s [COUNTERS] Fortran Other ( 0 ) : 0.2447s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0862s for 16384 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1395s for 98304 events => throughput is 1.42E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2653s for 16399 events => throughput is 1.62E-05 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9572s [COUNTERS] Fortran Other ( 0 ) : 0.3215s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5202s for 98304 events => throughput is 5.29E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0941s for 98371 events => throughput is 9.57E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8486s for 589824 events => throughput is 1.44E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1720s for 98371 events => throughput is 1.75E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7543s [COUNTERS] Fortran Other ( 0 ) : 0.2451s [COUNTERS] Fortran X2F ( 1 ) : 0.0163s for 16399 events => throughput is 9.95E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1419s for 98304 events => throughput is 1.44E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.2617s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0885s for 16384 events => throughput is 5.40E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9649s [COUNTERS] Fortran Other ( 0 ) : 0.3239s [COUNTERS] Fortran X2F ( 1 ) : 0.0951s for 98371 events => throughput is 9.67E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.8467s for 589824 events => throughput is 1.44E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.1783s for 98371 events => throughput is 1.81E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5202s for 98304 events => throughput is 5.29E-06 events/s
…xcluded from fortran other calculation) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7510s [COUNTERS] Fortran Other ( 0 ) : 0.2485s [COUNTERS] Fortran X2F ( 1 ) : 0.0163s for 16399 events => throughput is 9.94E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1359s for 98304 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.2628s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0868s for 16384 events => throughput is 5.30E-06 events/s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6822s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9135s [COUNTERS] Fortran Other ( 0 ) : 0.3225s [COUNTERS] Fortran X2F ( 1 ) : 0.0938s for 98371 events => throughput is 9.54E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.7961s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.1819s for 98371 events => throughput is 1.85E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5184s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] PROGRAM sample_full ( 11 ) : 1.8445s
… that what is left is something inside sample_full Rephrasing: programtotal = samplefull + initialIO And FortranOther is inside sample_full ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7697s [COUNTERS] Fortran Other ( 0 ) : 0.1810s [COUNTERS] Fortran X2F ( 1 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1355s for 98304 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.2672s for 16399 events => throughput is 1.63E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0877s for 16384 events => throughput is 5.35E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0808s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6860s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 2.0621s [COUNTERS] Fortran Other ( 0 ) : 0.2829s [COUNTERS] Fortran X2F ( 1 ) : 0.1024s for 98371 events => throughput is 1.04E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.8580s for 589824 events => throughput is 1.45E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.1838s for 98371 events => throughput is 1.87E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5532s for 98304 events => throughput is 5.63E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0811s [COUNTERS] PROGRAM sample_full ( 11 ) : 1.9780s
…side the function to the calling sequence in sample_full ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7679s [COUNTERS] Fortran Other ( 0 ) : 0.1849s [COUNTERS] Fortran X2F ( 1 ) : 0.0169s for 16399 events => throughput is 1.03E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1380s for 98304 events => throughput is 1.40E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2611s for 16399 events => throughput is 1.59E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0877s for 16384 events => throughput is 5.35E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0785s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6862s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9454s [COUNTERS] Fortran Other ( 0 ) : 0.2618s [COUNTERS] Fortran X2F ( 1 ) : 0.0961s for 98371 events => throughput is 9.77E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.8161s for 589824 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.1695s for 98371 events => throughput is 1.72E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5216s for 98304 events => throughput is 5.31E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0794s [COUNTERS] PROGRAM sample_full ( 11 ) : 1.8627s
…ing (as "test12" for the moment, wip) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7447s [COUNTERS] Fortran Other ( 0 ) : 0.1308s [COUNTERS] Fortran X2F ( 1 ) : 0.0163s for 16399 events => throughput is 9.93E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1328s for 98304 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2614s for 16399 events => throughput is 1.59E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0878s for 16384 events => throughput is 5.36E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0649s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6768s [COUNTERS] Fortran TEST ( 12 ) : 0.0499s for 16384 events => throughput is 3.05E-06 events/s
…or the moment, wip) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7526s [COUNTERS] Fortran Other ( 0 ) : 0.1163s [COUNTERS] Fortran X2F ( 1 ) : 0.0165s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1428s for 98304 events => throughput is 1.45E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2589s for 16399 events => throughput is 1.58E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0870s for 16384 events => throughput is 5.31E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0659s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6829s [COUNTERS] Fortran TEST ( 12 ) : 0.0537s for 16384 events => throughput is 3.28E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0108s for 16384 events => throughput is 6.58E-07 events/s
This essentially completes the identification of all bottlenecks. Must now clean up the timers (and remove double counting, "Fortran Other" is now negative?) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7581s [COUNTERS] Fortran Other ( 0 ) : -0.0298s [COUNTERS] Fortran X2F ( 1 ) : 0.0168s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1441s for 98304 events => throughput is 1.47E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2627s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0882s for 16384 events => throughput is 5.38E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0656s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6896s [COUNTERS] Fortran TEST ( 12 ) : 0.0533s for 16384 events => throughput is 3.25E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0105s for 16384 events => throughput is 6.41E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1461s for 16384 events => throughput is 8.91E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7519s [COUNTERS] Fortran Other ( 0 ) : -0.0299s [COUNTERS] Fortran X2F ( 1 ) : 0.0165s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1421s for 98304 events => throughput is 1.45E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2589s for 16399 events => throughput is 1.58E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0873s for 16384 events => throughput is 5.33E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0651s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6838s [COUNTERS] Fortran TEST ( 12 ) : 0.0542s for 16384 events => throughput is 3.31E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0102s for 16384 events => throughput is 6.26E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1467s for 16384 events => throughput is 8.95E-06 events/s
…er.f ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7533s [COUNTERS] Fortran Other ( 0 ) : -0.0253s [COUNTERS] Fortran X2F ( 1 ) : 0.0165s for 16399 events => throughput is 1.00E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1355s for 98304 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2633s for 16399 events => throughput is 1.61E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0897s for 16384 events => throughput is 5.48E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0649s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6855s [COUNTERS] Fortran TEST ( 12 ) : 0.0490s for 16384 events => throughput is 2.99E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0102s for 16384 events => throughput is 6.20E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1488s for 16384 events => throughput is 9.08E-06 events/s
…g1.f This changes the overall balance, now Fortran Other is again positive. This is because pdg2pdf is also called elsewhere (e.g. in unwgt?) which was already profiled elsewhere. ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7551s [COUNTERS] Fortran Other ( 0 ) : 0.0111s [COUNTERS] Fortran X2F ( 1 ) : 0.0168s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0986s for 32768 events => throughput is 3.01E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2633s for 16399 events => throughput is 1.61E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0879s for 16384 events => throughput is 5.36E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0662s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6862s [COUNTERS] Fortran TEST ( 12 ) : 0.0515s for 16384 events => throughput is 3.14E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0099s for 16384 events => throughput is 6.07E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1492s for 16384 events => throughput is 9.11E-06 events/s
Now "Fortran Other" becomes negative again, there is again some double counting ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7511s [COUNTERS] Fortran Other ( 0 ) : -0.0373s [COUNTERS] Fortran X2F ( 1 ) : 0.0168s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0965s for 32768 events => throughput is 2.94E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2598s for 16399 events => throughput is 1.58E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0868s for 16384 events => throughput is 5.30E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0670s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6811s [COUNTERS] Fortran TEST ( 12 ) : 0.0506s for 16384 events => throughput is 3.09E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0099s for 16384 events => throughput is 6.01E-07 events/s [COUNTERS] Fortran TEST3 ( 14 ) : 0.0541s for 16384 events => throughput is 3.30E-06 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1462s for 16384 events => throughput is 8.93E-06 events/s
This makes it clearer that programtotal = samplefull + initialIO ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7554s [COUNTERS] Fortran Other ( 0 ) : -0.0393s [COUNTERS] Fortran X2F ( 1 ) : 0.0171s for 16399 events => throughput is 1.04E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0984s for 32768 events => throughput is 3.00E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2621s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0872s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0688s [COUNTERS] Fortran TEST ( 12 ) : 0.0521s for 16384 events => throughput is 3.18E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0100s for 16384 events => throughput is 6.08E-07 events/s [COUNTERS] Fortran TEST3 ( 14 ) : 0.0507s for 16384 events => throughput is 3.09E-06 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1478s for 16384 events => throughput is 9.02E-06 events/s [COUNTERS] PROGRAM initial_I/O ( 19 ) : 0.0688s [COUNTERS] PROGRAM sample_full ( 20 ) : 0.6838s
…econds() call and go back to the old getTotalDurationSeconds
…mer overhead if CUDACPP_RUNTIME_REMOVETIMEROVERHEAD is set However, test counters like sample_get_x need a special handling
…UNTERS, remove special meaning of PROGRAM counters
…ng a TEST counter as included in a non-TEST counter, to subtract ovberheads
…ated counter overhead
…SpaceSampling These are the first results where timer overhead is removed: looks nice, but the overhead should be computed in the counters.cc calls rather than in the individual timers (this would also make more sense with respect to timermap.h where this will not be possible - remane the env, too) ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 4.4608s [COUNTERS] Fortran Other ( 0 ) : 0.1171s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0690s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 3.2317s for 1087437 events => throughput is 3.36E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0917s for 32768 events => throughput is 3.57E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1719s for 16384 events => throughput is 9.53E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0483s for 16384 events => throughput is 3.39E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0691s for 16384 events => throughput is 2.37E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1276s for 1087437 events => throughput is 8.52E+06 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4718s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0269s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.3519s for 14136681 events => throughput is 6.01E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 4.4251s [COUNTERS] OVERALL MEs ( 32 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s CUDACPP_RUNTIME_USECHRONOTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 5.2204s [COUNTERS] Fortran Other ( 0 ) : 0.1550s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0697s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 3.9335s for 1087437 events => throughput is 2.76E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0924s for 32768 events => throughput is 3.55E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1722s for 16384 events => throughput is 9.52E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0487s for 16384 events => throughput is 3.36E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0689s for 16384 events => throughput is 2.38E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1401s for 1087437 events => throughput is 7.76E+06 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4779s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0263s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0358s for 16384 events => throughput is 4.58E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.8064s for 14136681 events => throughput is 5.04E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 5.1846s [COUNTERS] OVERALL MEs ( 32 ) : 0.0358s for 16384 events => throughput is 4.58E+05 events/s CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: RdtscTimer overhead : 0.0179s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 4.4668s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.2924s ------------------------------------------------------------- [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 4.1745s [COUNTERS] Fortran Other ( 0 ) : 0.1190s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0696s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 2.9612s for 1087437 events => throughput is 3.67E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0913s for 32768 events => throughput is 3.59E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1709s for 16384 events => throughput is 9.59E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0482s for 16384 events => throughput is 3.40E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0678s for 16384 events => throughput is 2.42E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1125s for 1087437 events => throughput is 9.67E+06 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4716s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0266s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0358s for 16384 events => throughput is 4.58E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.0989s for 14136681 events => throughput is 6.74E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 4.1387s [COUNTERS] OVERALL MEs ( 32 ) : 0.0358s for 16384 events => throughput is 4.58E+05 events/s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: ChronoTimer overhead : 0.0489s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 5.2779s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.7998s ------------------------------------------------------------- [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 4.4781s [COUNTERS] Fortran Other ( 0 ) : 0.1570s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0669s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 3.2485s for 1087437 events => throughput is 3.35E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0930s for 32768 events => throughput is 3.52E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1716s for 16384 events => throughput is 9.55E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0474s for 16384 events => throughput is 3.46E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0681s for 16384 events => throughput is 2.41E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.0929s for 1087437 events => throughput is 1.17E+07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4705s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0266s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.1629s for 14136681 events => throughput is 6.54E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 4.4424s [COUNTERS] OVERALL MEs ( 32 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 3.8210s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.0000s ------------------------------------------------------------- [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.8210s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 3.8301s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.0000s ------------------------------------------------------------- [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.8301s
…s: this will be moved to counters alone Revert "[prof] in gux_taptamggux.mad timer.h, add instead a getTotalOverheadSeconds() call and go back to the old getTotalDurationSeconds" This reverts commit ad9b747. Revert "[prof] in gux_taptamggux.mad timer.h, add the option to remove overhead from getTotalDurationSeconds calls" This reverts commit 5c0a2ed.
…unter overhead (remove it from timer.h: there will be none for tiumermap.h) Rename the env as CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD to make it clear that this is in the counters.cc infrastructure These are the results (1) keep overhead ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 4.5315s [COUNTERS] Fortran Other ( 0 ) : 0.1198s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0678s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 3.2691s for 1087437 events => throughput is 3.33E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1044s for 32768 events => throughput is 3.14E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1757s for 16384 events => throughput is 9.33E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0543s for 16384 events => throughput is 3.02E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0731s for 16384 events => throughput is 2.24E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1322s for 1087437 events => throughput is 8.23E+06 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4719s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0274s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0358s for 16384 events => throughput is 4.57E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.3686s for 14136681 events => throughput is 5.97E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 4.4957s [COUNTERS] OVERALL MEs ( 32 ) : 0.0358s for 16384 events => throughput is 4.57E+05 events/s CUDACPP_RUNTIME_USECHRONOTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 5.2048s [COUNTERS] Fortran Other ( 0 ) : 0.1559s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0673s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 3.9265s for 1087437 events => throughput is 2.77E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0993s for 32768 events => throughput is 3.30E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1648s for 16384 events => throughput is 9.94E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0514s for 16384 events => throughput is 3.19E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0700s for 16384 events => throughput is 2.34E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1365s for 1087437 events => throughput is 7.97E+06 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4711s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0264s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.8006s for 14136681 events => throughput is 5.05E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 5.1691s [COUNTERS] OVERALL MEs ( 32 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s (2) remove overhead CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0331s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 4.5208s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.5413s ------------------------------------------------------------- [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.9795s [COUNTERS] Fortran Other ( 0 ) : 0.1548s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0670s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 2.7547s for 1087437 events => throughput is 3.95E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0988s for 32768 events => throughput is 3.32E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1639s for 16384 events => throughput is 1.00E+05 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0510s for 16384 events => throughput is 3.21E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0674s for 16384 events => throughput is 2.43E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.0898s for 1087437 events => throughput is 1.21E+07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4700s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0266s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0356s for 16384 events => throughput is 4.60E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 1.8855s for 14136681 events => throughput is 7.50E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 3.9439s [COUNTERS] OVERALL MEs ( 32 ) : 0.0356s for 16384 events => throughput is 4.60E+05 events/s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0640s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 5.3491s [COUNTERS] PROGRAM COUNTEROVERHEAD : 1.0455s ------------------------------------------------------------- [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 4.3036s [COUNTERS] Fortran Other ( 0 ) : 0.2216s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0692s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 3.0230s for 1087437 events => throughput is 3.60E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0992s for 32768 events => throughput is 3.30E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1652s for 16384 events => throughput is 9.92E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0504s for 16384 events => throughput is 3.25E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0684s for 16384 events => throughput is 2.39E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.0716s for 1087437 events => throughput is 1.52E+07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4727s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0266s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 1.9427s for 14136681 events => throughput is 7.28E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 4.2679s [COUNTERS] OVERALL MEs ( 32 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s (3) remove overhead, disable individual timers (so here the overhead is 0) CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0039s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 3.7998s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.0000s ------------------------------------------------------------- [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.7998s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0038s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 3.9067s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.0000s ------------------------------------------------------------- [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.9067s
I have now added (only here in prof) a mechanism to remov ethe timer overhead from the timing measurements, For RDTSC timers this looks adequate (for chrono timers a bit less, but this is not what I would use anyway) These are the results, see valassi@6dcab81 (initially this was 6083af1, then I added the last bullet 4 and force pushed)
|
…ter overhead These are the results (1) keep overhead ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 4.4766s [COUNTERS] Fortran Other ( 0 ) : 0.1202s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0685s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 3.2400s for 1087437 events => throughput is 3.36E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1007s for 32768 events => throughput is 3.25E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1673s for 16384 events => throughput is 9.79E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0521s for 16384 events => throughput is 3.14E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0687s for 16384 events => throughput is 2.38E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1237s for 1087437 events => throughput is 8.79E+06 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4728s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0269s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.3496s for 14136681 events => throughput is 6.02E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 4.4409s [COUNTERS] OVERALL MEs ( 32 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s CUDACPP_RUNTIME_USECHRONOTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 5.3144s [COUNTERS] Fortran Other ( 0 ) : 0.1588s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0674s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 4.0191s for 1087437 events => throughput is 2.71E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0996s for 32768 events => throughput is 3.29E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1660s for 16384 events => throughput is 9.87E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0508s for 16384 events => throughput is 3.22E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0704s for 16384 events => throughput is 2.33E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1482s for 1087437 events => throughput is 7.34E+06 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4718s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0267s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 2.8646s for 14136681 events => throughput is 4.94E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 5.2787s [COUNTERS] OVERALL MEs ( 32 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s (2) remove overhead CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0338s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 4.8244s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.8905s ------------------------------------------------------------- [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.9339s [COUNTERS] Fortran Other ( 0 ) : 0.2954s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0674s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 2.7332s for 1087437 events => throughput is 3.98E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1003s for 32768 events => throughput is 3.27E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1688s for 16384 events => throughput is 9.71E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0507s for 16384 events => throughput is 3.23E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0695s for 16384 events => throughput is 2.36E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.0924s for 1087437 events => throughput is 1.18E+07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4692s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0263s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 1.8723s for 14136681 events => throughput is 7.55E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 3.8982s [COUNTERS] OVERALL MEs ( 32 ) : 0.0357s for 16384 events => throughput is 4.59E+05 events/s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0637s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 5.8826s [COUNTERS] PROGRAM COUNTEROVERHEAD : 1.6786s ------------------------------------------------------------- [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 4.2040s [COUNTERS] Fortran Other ( 0 ) : 0.4831s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0691s [COUNTERS] Fortran PhaseSpaceSampling ( 3 ) : 2.9924s for 1087437 events => throughput is 3.63E+05 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0983s for 32768 events => throughput is 3.33E+05 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1669s for 16384 events => throughput is 9.81E+04 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0506s for 16384 events => throughput is 3.24E+05 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0676s for 16384 events => throughput is 2.42E+05 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.0698s for 1087437 events => throughput is 1.56E+07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4712s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0267s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0350s for 16384 events => throughput is 4.68E+05 events/s [COUNTERS] TEST SampleGetX ( 21 ) : 1.9227s for 14136681 events => throughput is 7.35E+06 events/s [COUNTERS] OVERALL NON-MEs ( 31 ) : 4.1690s [COUNTERS] OVERALL MEs ( 32 ) : 0.0350s for 16384 events => throughput is 4.68E+05 events/s (3) remove overhead, disable individual timers (so here the overhead is 0) CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0333s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 4.1897s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.3330s ------------------------------------------------------------- [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.8567s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp INFO: COUNTERS overhead : 0.0659s for 1M start/stop cycles [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD : 4.5119s [COUNTERS] PROGRAM COUNTEROVERHEAD : 0.6594s ------------------------------------------------------------- [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.8525s (4) do not remove overhead, disable individual timers (remove also the overhead from the estimation of the overhead) (this test was done on another day on the same machine and build, but the results are compatible with the previous ones) CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.8072s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \ ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) *** [COUNTERS] PROGRAM TOTAL : 3.8214s
Reminder of two things to do
|
…r merging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…Source/makefile madgraph5#980) into prof (Checked that regenerating gg_tt.mad is all ok)
…ier merging git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)
…nerated code except gg_tt.mad for easier merging git checkout upstream/master $(git ls-tree --name-only upstream/master *.mad/SubProcesses/P*/auto_dsig1.f | grep -v ^gg_tt.mad)
…dhel, for360) into prof Fix conflicts: - epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f (use upstream/master, will add back all counters as in prof) - epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 (use upstream/master, will regenerate this) - epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common (use upstream/master, will regenerate this)
…f branch before merging upstream/master (fix conflicts)
…pstream/master including june24, goodhel, for360 The only files that still need to be patched are - 2 in patch.common: Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f) ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that gg_tt.mad can be regenerated ok)
…' (including june24, goodhel, for360) into prof Also add to the repo a few missing files in gux_taptamggux.mad and nobm_pp_ttW.mad
…ging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…ated code except gg_tt.mad for easier merging git checkout upstream/master $(git ls-tree --name-only upstream/master *.mad/Source/dsample.f | grep -v ^gg_tt.mad)
…also amd and v1.00.01 fixes) into prof Fix conflicts (use upstream/master version): epochX/cudacpp/gg_tt.mad/Source/dsample.f Will then regenerate patches from this gg_tt.mad
…/master including v1.00.00 and also amd and v1.00.01 fixes The only files that still need to be patched are - 2 in patch.common: Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f) ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad gives no change)
…and also amd and v1.00.01 fixes)
This is a very WIP PR extending the work in #960. Again related to the CMS #943 issue reported by @choij1589
The idea is to further improve timers and profile other fortran components
I added profiling to
The first two are related to the findings with nice flamegraphs by @Qubitol
So far I only had time on a simple gg_tt. Even here it is quite interesting
The problem is also that the pverhead from the counters themselves starts being important, so it is difficult to do it well. Especially the i/o counters have a large overhead.
Anyway, this is WIP. For ggtt
But very often the same command gives 4s, so not very reliable...