-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized considered harmful? #4
Comments
This issue kept me up last night, I had this nagging feeling I was giving bad advice, at https://discourse.julialang.org/t/best-way-to-optimize-code/37980/26 (I've followed up with the OP). FYI, this is potentially a real issue for the code in that thread where my program, using your package, that used to be 3x faster goes to 80% slower, 2x and I guess slower still (without bounds? I just didn't have the patience to try larger values), with a one line change:
I can still use your package (e.g. in that program for speedup, without the scalability issue. |
Benchmarking very short vectors (e.g., length == 1)Why is What this means is that, when you call The Of course, the higher the unroll factor, the more sets of state that have to be loaded from memory and then stored again when it is done. This loading takes longer than turning an integer to a floating point number, hence the floating point version will be faster until your vectors are long enough to amortize that cost. |
Deceptive buffers?
I'm not sure what you mean here. |
Memory bandwidth
Memory bandwidth is much slower than the CPU. You need to do a huge amount of computation relative to the amount of memory, otherwise every programming streaming through memory will be bottle-necked by memory bandwidth, and computation speed becomes irrelevant. Hence, MT, a Vectorized Xoshiro, or a scalar Xoshiro will all yield equal performance. I could help this by checking for array size, and using nontemporal stores if the array is large. Nontemporal stores prevent the store from being cached, improving performance in these types of scenarios. You can see how they improve the performance for |
How many numbers to generate?
Good question. Although, as always, benchmark to compare if possible. |
Clock speed vs VectorizationThe README of this library shows For example, my computer runs at 4.1 GHz for AVX512, 4.3 GHz for AVX(2), and 4.6 GHz for non-AVX code. If a CPU is in non-AVX mode and encounters AVX code, it will run at the same clock speed but something like 1/4 of the instructions per clock (IPC) for around a 9 microseconds (that is 4x slower!) and then not execute anything at all for 11 microseconds, before switching modes and resuming at full IPC at the reduced clock speed. 20 microseconds is a long time! What is the rest of the code doing? Just like it'd likely be better for the RNG to not run in AVX-mode if the rest of the program doesn't, if the rest of the program is running AVX instructions, it'd be better for the RNG to run AVX too. This is all just speculation on my part, based on articles like this one. |
Hi again, I can't believe I didn't write back at the time. I meant to, just saying I like your very good responses. I recall I had some comments too in mind, why I may have delayed. Anyway, I was looking this up since I was in contact with Vigna, after his work landed in Julia finally. In short, is this package more or less now outdated (while still working), since it's the same RNG? And while there's SIMD in Julia too, and |
Hi @PallHaraldsson , not necessarily. julia> using Random, VectorizedRNG
julia> x = Vector{Float64}(undef, 1024);
julia> lrng = local_rng(); drng = Random.default_rng(); typeof(drng)
TaskLocalRNG
julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 728; memory estimate: 0 bytes; allocs estimate: 0
ns
(173.73 - 174.15] ██████████████████████████████ 8728
(174.15 - 174.56] 0
(174.56 - 174.98] 0
(174.98 - 175.4 ] ███▊1085
(175.4 - 175.81] ▍99
(175.81 - 176.23] ▎69
(176.23 - 176.64] ▏4
(176.64 - 177.06] ▏1
(177.06 - 177.48] 0
(177.48 - 177.89] 0
(177.89 - 178.31] ▏2
(178.31 - 178.73] 0
(178.73 - 179.14] 0
(179.14 - 179.56] ▏2
(179.56 - 224.13] ▏10
Counts
min: 173.729 ns (0.00% GC); mean: 174.005 ns (0.00% GC); median: 173.788 ns (0.00% GC); max: 224.135 ns (0.00% GC).
julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 369; memory estimate: 0 bytes; allocs estimate: 0
ns
(252.0 - 253.0] ██████████████████████████████ 8317
(253.0 - 253.9] ██▊729
(253.9 - 254.9] ▏1
(254.9 - 255.8] ██▏586
(255.8 - 256.8] █▏278
(256.8 - 257.7] ▎60
(257.7 - 258.7] ▏11
(258.7 - 259.7] ▏1
(259.7 - 260.6] 0
(260.6 - 261.6] ▏1
(261.6 - 262.5] ▏2
(262.5 - 263.5] 0
(263.5 - 264.4] ▏1
(264.4 - 265.4] ▏3
(265.4 - 384.4] ▏10
Counts
min: 252.008 ns (0.00% GC); mean: 253.063 ns (0.00% GC); median: 252.713 ns (0.00% GC); max: 384.358 ns (0.00% GC).
julia> @benchmark randn!($lrng, $x)
samples: 10000; evals/sample: 10; memory estimate: 0 bytes; allocs estimate: 0
ns
(1058.0 - 1078.0] ██████████████████████████████ 7161
(1078.0 - 1098.0] ███████████▌2723
(1098.0 - 1118.0] 0
(1118.0 - 1138.0] 0
(1138.0 - 1158.0] 0
(1158.0 - 1178.0] ▏25
(1178.0 - 1198.0] ▍64
(1198.0 - 1218.0] ▏7
(1218.0 - 1238.0] ▏4
(1238.0 - 1258.0] ▏3
(1258.0 - 1278.0] ▏1
(1278.0 - 1298.0] ▏1
(1298.0 - 1318.0] 0
(1318.0 - 1338.0] ▏1
(1338.0 - 2830.0] ▏10
Counts
min: 1.058 μs (0.00% GC); mean: 1.076 μs (0.00% GC); median: 1.076 μs (0.00% GC); max: 2.829 μs (0.00% GC).
julia> @benchmark randn!($drng, $x)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns
(2234.0 - 2285.0] ▊115
(2285.0 - 2335.0] █████████████▊2384
(2335.0 - 2386.0] ██████████████████████████████ 5208
(2386.0 - 2437.0] ███████████▎1940
(2437.0 - 2488.0] █▌240
(2488.0 - 2539.0] ▌75
(2539.0 - 2589.0] ▏19
(2589.0 - 2640.0] ▏2
(2640.0 - 2691.0] ▏1
(2691.0 - 2742.0] ▏3
(2742.0 - 2793.0] 0
(2793.0 - 2843.0] 0
(2843.0 - 2894.0] ▏1
(2894.0 - 2945.0] ▏2
(2945.0 - 5269.0] ▏10
Counts
min: 2.234 μs (0.00% GC); mean: 2.363 μs (0.00% GC); median: 2.359 μs (0.00% GC); max: 5.269 μs (0.00% GC).
julia> versioninfo()
Julia Version 1.7.0-DEV.1258
Commit 97f446baf3* (2021-06-06 20:55 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, cascadelake) This computer has AVX512. julia> using Random, VectorizedRNG
julia> x = Vector{Float64}(undef, 1024);
julia> lrng = local_rng(); drng = Random.default_rng(); typeof(drng)
TaskLocalRNG
julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 129; memory estimate: 0 bytes; allocs estimate: 0
ns
(736.4 - 743.5] ██████████████████████████████ 9571
(743.5 - 750.5] ▏5
(750.5 - 757.5] ▏2
(757.5 - 764.6] ▉257
(764.6 - 771.6] ▎72
(771.6 - 778.7] ▏18
(778.7 - 785.7] ▏1
(785.7 - 792.7] ▏10
(792.7 - 799.8] ▏29
(799.8 - 806.8] ▏2
(806.8 - 813.8] ▏1
(813.8 - 820.9] 0
(820.9 - 827.9] ▏8
(827.9 - 835.0] ▏14
(835.0 - 937.0] ▏10
Counts
min: 736.434 ns (0.00% GC); mean: 739.276 ns (0.00% GC); median: 738.047 ns (0.00% GC); max: 937.016 ns (0.00% GC).
julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 155; memory estimate: 0 bytes; allocs estimate: 0
ns
(673.9 - 679.8] ██████████████████████████████ 9530
(679.8 - 685.7] ▏12
(685.7 - 691.6] ▏15
(691.6 - 697.5] ▉262
(697.5 - 703.4] ▍84
(703.4 - 709.3] ▏17
(709.3 - 715.2] ▏2
(715.2 - 721.1] ▏16
(721.1 - 727.0] ▏20
(727.0 - 732.9] ▏7
(732.9 - 738.8] ▏2
(738.8 - 744.7] ▏1
(744.7 - 750.6] ▏3
(750.6 - 756.5] ▏19
(756.5 - 797.6] ▏10
Counts
min: 673.923 ns (0.00% GC); mean: 676.059 ns (0.00% GC); median: 674.729 ns (0.00% GC); max: 797.581 ns (0.00% GC).
julia> @benchmark randn!($lrng, $x)
samples: 10000; evals/sample: 8; memory estimate: 0 bytes; allocs estimate: 0
ns
(3000.0 - 3100.0] ██████████████████████████████ 9923
(3100.0 - 3200.0] ▏2
(3200.0 - 3300.0] ▏2
(3300.0 - 3390.0] ▏18
(3390.0 - 3490.0] ▏14
(3490.0 - 3590.0] ▏12
(3590.0 - 3690.0] ▏1
(3690.0 - 3790.0] ▏2
(3790.0 - 3890.0] ▏5
(3890.0 - 3990.0] ▏2
(3990.0 - 4080.0] ▏1
(4080.0 - 4180.0] ▏1
(4180.0 - 4280.0] ▏1
(4280.0 - 4380.0] ▏6
(4380.0 - 4990.0] ▏10
Counts
min: 3.000 μs (0.00% GC); mean: 3.023 μs (0.00% GC); median: 3.016 μs (0.00% GC); max: 4.989 μs (0.00% GC).
julia> @benchmark randn!($drng, $x)
samples: 10000; evals/sample: 10; memory estimate: 0 bytes; allocs estimate: 0
ns
(1617.0 - 1667.0] ▎18
(1667.0 - 1717.0] ████▏579
(1717.0 - 1767.0] ████████████████████████▏3450
(1767.0 - 1817.0] ██████████████████████████████4296
(1817.0 - 1867.0] ██████████▌1494
(1867.0 - 1917.0] █131
(1917.0 - 1967.0] ▏3
(1967.0 - 2017.0] 0
(2017.0 - 2067.0] 0
(2067.0 - 2117.0] ▏2
(2117.0 - 2167.0] ▏7
(2167.0 - 2217.0] ▏5
(2217.0 - 2267.0] ▏3
(2267.0 - 2317.0] ▏2
(2317.0 - 2950.0] ▏10
Counts
min: 1.617 μs (0.00% GC); mean: 1.778 μs (0.00% GC); median: 1.775 μs (0.00% GC); max: 2.950 μs (0.00% GC).
julia> versioninfo()
Julia Version 1.7.0-DEV.1302
Commit f03832028a* (2021-06-10 20:51 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.5.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.0 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 4 I need to tune this library to work better on architectures without good SIMD performance.
Chethega and Jeff Bezanson did all the real work on SIMD there. |
But you can see the advantage of lower overhead on starting SIMD generation by sampling shorter vectors. julia> x = Vector{Float64}(undef, 128);
julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 964; memory estimate: 0 bytes; allocs estimate: 0
ns
(94.1 - 95.0 ] ██████████████████████████████ 9630
(95.0 - 95.9 ] ▏4
(95.9 - 96.9 ] ▏1
(96.9 - 97.8 ] ▊211
(97.8 - 98.7 ] ▎75
(98.7 - 99.7 ] ▏15
(99.7 - 100.6] 0
(100.6 - 101.5] ▏6
(101.5 - 102.5] ▏27
(102.5 - 103.4] ▏4
(103.4 - 104.3] 0
(104.3 - 105.3] 0
(105.3 - 106.2] ▏2
(106.2 - 107.1] ▏16
(107.1 - 109.4] ▏9
Counts
min: 94.052 ns (0.00% GC); mean: 94.444 ns (0.00% GC); median: 94.269 ns (0.00% GC); max: 109.354 ns (0.00% GC).
julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 930; memory estimate: 0 bytes; allocs estimate: 0
ns
(111.1 - 112.1] ██████████████████████████████ 9513
(112.1 - 113.1] ▏5
(113.1 - 114.0] ▏1
(114.0 - 115.0] █278
(115.0 - 116.0] ▎56
(116.0 - 116.9] ▎53
(116.9 - 117.9] 0
(117.9 - 118.9] ▏17
(118.9 - 119.8] ▏37
(119.8 - 120.8] ▏7
(120.8 - 121.8] 0
(121.8 - 122.7] ▏2
(122.7 - 123.7] ▏3
(123.7 - 124.7] ▏18
(124.7 - 127.6] ▏10
Counts
min: 111.111 ns (0.00% GC); mean: 111.490 ns (0.00% GC); median: 111.246 ns (0.00% GC); max: 127.643 ns (0.00% GC).
julia> x = Vector{Float64}(undef, 127);
julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 965; memory estimate: 0 bytes; allocs estimate: 0
ns
(94.2 - 95.2 ] ██████████████████████████████ 9615
(95.2 - 96.1 ] ▏5
(96.1 - 97.0 ] ▏2
(97.0 - 98.0 ] ▊209
(98.0 - 98.9 ] ▎80
(98.9 - 99.9 ] ▏10
(99.9 - 100.8] ▏2
(100.8 - 101.7] ▏10
(101.7 - 102.7] ▏24
(102.7 - 103.6] ▏6
(103.6 - 104.6] ▏2
(104.6 - 105.5] ▏1
(105.5 - 106.4] ▏2
(106.4 - 107.4] ▏23
(107.4 - 123.1] ▏9
Counts
min: 94.213 ns (0.00% GC); mean: 94.581 ns (0.00% GC); median: 94.388 ns (0.00% GC); max: 123.057 ns (0.00% GC).
julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 923; memory estimate: 0 bytes; allocs estimate: 0
ns
(113.5 - 114.5] ██████████████████████████████ 9531
(114.5 - 115.5] ▏1
(115.5 - 116.5] ▏2
(116.5 - 117.5] █279
(117.5 - 118.5] ▎74
(118.5 - 119.5] ▏34
(119.5 - 120.5] ▏2
(120.5 - 121.5] ▏8
(121.5 - 122.5] ▏28
(122.5 - 123.5] ▏7
(123.5 - 124.5] ▏1
(124.5 - 125.5] ▏4
(125.5 - 126.5] ▏6
(126.5 - 127.5] ▏13
(127.5 - 130.1] ▏10
Counts
min: 113.533 ns (0.00% GC); mean: 113.985 ns (0.00% GC); median: 113.759 ns (0.00% GC); max: 130.101 ns (0.00% GC). The 10980XE (AVX512): julia> x = Vector{Float64}(undef, 128);
julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 996; memory estimate: 0 bytes; allocs estimate: 0
ns
(23.42 - 23.61] ██████████████████████████████ 9748
(23.61 - 23.81] ▏12
(23.81 - 24.0 ] 0
(24.0 - 24.19] 0
(24.19 - 24.38] 0
(24.38 - 24.57] ▋167
(24.57 - 24.77] ▏40
(24.77 - 24.96] ▏6
(24.96 - 25.15] ▏7
(25.15 - 25.34] ▏6
(25.34 - 25.53] ▏1
(25.53 - 25.73] ▏1
(25.73 - 25.92] ▏1
(25.92 - 26.11] ▏1
(26.11 - 58.54] ▏10
Counts
min: 23.422 ns (0.00% GC); mean: 23.502 ns (0.00% GC); median: 23.449 ns (0.00% GC); max: 58.538 ns (0.00% GC).
julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 968; memory estimate: 0 bytes; allocs estimate: 0
ns
(77.63 - 78.03 ] ▏21
(78.03 - 78.43 ] ▏38
(78.43 - 78.83 ] ██████████████████████████████ 9131
(78.83 - 79.23 ] ▏32
(79.23 - 79.63 ] ▏6
(79.63 - 80.04 ] ██▍692
(80.04 - 80.44 ] ▎49
(80.44 - 80.84 ] ▏16
(80.84 - 81.24 ] ▏2
(81.24 - 81.64 ] 0
(81.64 - 82.04 ] ▏1
(82.04 - 82.44 ] 0
(82.44 - 82.85 ] ▏1
(82.85 - 83.25 ] ▏1
(83.25 - 113.52] ▏10
Counts
min: 77.626 ns (0.00% GC); mean: 78.804 ns (0.00% GC); median: 78.710 ns (0.00% GC); max: 113.522 ns (0.00% GC).
julia> x = Vector{Float64}(undef, 127);
julia> @benchmark rand!($lrng, $x)
samples: 10000; evals/sample: 996; memory estimate: 0 bytes; allocs estimate: 0
ns
(23.21 - 23.37] ██████████████████████████████ 9733
(23.37 - 23.52] ▏19
(23.52 - 23.68] 0
(23.68 - 23.83] 0
(23.83 - 23.99] 0
(23.99 - 24.14] 0
(24.14 - 24.29] ▏3
(24.29 - 24.45] ▌162
(24.45 - 24.6 ] ▎48
(24.6 - 24.76] ▏16
(24.76 - 24.91] ▏1
(24.91 - 25.07] ▏6
(25.07 - 25.22] ▏1
(25.22 - 25.37] ▏1
(25.37 - 57.93] ▏10
Counts
min: 23.214 ns (0.00% GC); mean: 23.361 ns (0.00% GC); median: 23.315 ns (0.00% GC); max: 57.930 ns (0.00% GC).
julia> @benchmark rand!($drng, $x)
samples: 10000; evals/sample: 957; memory estimate: 0 bytes; allocs estimate: 0
ns
(89.22 - 89.65 ] █████▎1355
(89.65 - 90.08 ] ██████████████████████████████ 7761
(90.08 - 90.51 ] ▏12
(90.51 - 90.95 ] ██▏526
(90.95 - 91.38 ] █▏286
(91.38 - 91.81 ] ▎45
(91.81 - 92.24 ] ▏2
(92.24 - 92.67 ] 0
(92.67 - 93.1 ] ▏1
(93.1 - 93.53 ] ▏1
(93.53 - 93.96 ] 0
(93.96 - 94.39 ] 0
(94.39 - 94.82 ] 0
(94.82 - 95.25 ] ▏1
(95.25 - 123.91] ▏10
Counts
min: 89.222 ns (0.00% GC); mean: 89.881 ns (0.00% GC); median: 89.765 ns (0.00% GC); max: 123.913 ns (0.00% GC). |
A. First I want to start on a positive note.
Your RNG implementation is 2.5 times faster (probably just because the RNG is faster, despite vectorized).
While you also beat base on Float64 (by a smaller factor), it's interesting that still your Int RNG is slower than Float generation (you do not generate Float first, what I think Base does, so understandable for their).
While I remember, and int generation doesn't work with PCG (I don't care), should your tagline be changed (maybe you just started with PCG, and I maybe influenced you to add the other RNG): "Vectorized PCG uniform, normal, and exponential random samplers."
B.
What you really want to show, is e.g. the possible 3.4 times gain, what you can ses in this microbenchmark, but I think maybe harmful:
Now, as you know, on discourse I've pointed to/used your package, and we could discuss this there in the open, but I want to make sure I'm not misunderstanding, before I give a warning about my program, 3x faster, mostly thanks to your package.
Someone else used this trick of precalculating the random numbers first, and I just improved that program keeping that idea. I've seen it before, maybe first on Debian's Benchmark game.
And it other do it, you are kind of forced to do so too in a microbenchmark to match other languages, since this works.
I just think this may be a very deceptive idea, and I think your vectorized package depends on it. At some point, for me around 2048 calculated random numbers it stops being faster, and gets slower with larger arrays, it seems/I guess around L1 cache amount of generated random numbers. But that's just the best case.
In all real-world programs you're going to do something with those values, and using the L1 cache for something else is valuable, and in those the right size might be much lower. So, is there a way to have your cake and eat it too?
I'm not sure what's a safe amount of generated random numbers. With just a few numbers generated you get almost all of the gain (maybe keep next for generated values in registers only?). But I think in programs where you use this idea, I'm not sure you can do away with a small buffer in memory.
The text was updated successfully, but these errors were encountered: