-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support also doing processing on GPUs #349
Comments
An alternative approach that may help is to use OpenMP, which has support for "offloading" work to GPUs (and the API for doing so may be less low-level than OpenCL). One upside of this approach may be that it would be straightforward to just annotate code with OpenMP pragmas. Resources:
|
GH-hosted GitHub Actions-runners don't have GPUs. But emulators seem to exist that would allow CI of GPU offloading, since we aren't worried about performance, only checking basic functionality. Resources:
|
A lot of scientists have Macs and there seems to be no good cross-platform solutions for this.
CUDA is NVIDIA-only and not supported on MacOS. OpenCL is deprecated in favour of Metal which is Apple-only
I don’t know if there is a good solution?
Karl
… On 15 Nov 2021, at 12:41 pm, mohawk2 ***@***.***> wrote:
An alternative approach that may help is to use OpenMP, which has support for "offloading" work to GPUs (and the API for doing so may be less low-level than OpenCL).
One upside of this approach may be that it would be straightforward to just annotate code with OpenMP pragmas.
Resources:
https://gcc.gnu.org/wiki/Offloading <https://gcc.gnu.org/wiki/Offloading> (Ubuntu's GCC seems to have NVIDIA offloading support as standard config, but Strawberry Perl 5.32's does not)
https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices <https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices>
https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf <https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf> presentation from 2018
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#349 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU7FGT2LKCVB6DL6HQJXHTUMBQLLANCNFSM5IAM7H5Q>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
PDL today (as of 2.059) on Macs will automatically use all available CPU cores using POSIX threads. That's not terrible already. To use OpenMP on Mac, these are apparently effective instructions: https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave (alternative: https://mac.r-project.org/openmp/#do). While there isn't yet support for OpenMP offloading to GPU on Mac, this recent discussion suggests such may not be far away: https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w Otherwise, given how relatively expensive Mac hardware is for the power you get, you might consider a PC gaming laptop (for powerful CPU and GPU) and run Linux on it ;-) |
Sure troll me!
I think the recent benchmarks have shown the Mac laptops smoking the Intel ones, and way way better per watt. I have one and that is my experience too, and it wasn’t that expensive!
Karl
… On 15 Nov 2021, at 4:55 pm, mohawk2 ***@***.***> wrote:
PDL today (as of 2.059) on Macs will automatically use all available CPU cores using POSIX threads. That's not terrible already.
To use OpenMP on Mac, these are apparently effective instructions: https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave <https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave> (alternative: https://mac.r-project.org/openmp/#do <https://mac.r-project.org/openmp/#do>). While there isn't yet support for OpenMP offloading to GPU on Mac, this recent discussion suggests such may not be far away: https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w <https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w>
Otherwise, given how relatively expensive Mac hardware is for the power you get, you might consider a PC gaming laptop (for powerful CPU and GPU) and run Linux on it ;-)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#349 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU7FGRYTAX2JG3EPABNZYLUMCOETANCNFSM5IAM7H5Q>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
This shows a proper analysis of M1 for science: https://github.com/neurolabusc/AppleSiliconForNeuroimaging - for me, a bit of a bombshell is that Metal doesn't do double precision, which seems to limit its value for scientific purposes. My reading indicates you're 100% right on M1 being vastly better per watt. My question: is it better per pound sterling? |
I will try and not get dragged further in to a ‘A is better than B’ debate :-), I will just say my own experience in one year with an M1 Macbook Air has been entirely positive. Everything runs faster and cooler, with much better battery life and I find the cost premium (20%?) a net benefit.
“Everything” still includes a lot of open source science code running under Rosetta (which all work fine BTW!), because it will take a while for the open source community to do architecture ports. The last seems esp. try of deep python stacks - still waiting on anaconda! On that note I do kind of think ‘how hard can it be’ as open source code should not in principle really care about your cpu, and I recompiled PDL myself just fine back in January and relatively easily.
Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice.
K
… On 16 Nov 2021, at 2:44 am, mohawk2 ***@***.***> wrote:
This shows a proper analysis of M1 for science: https://github.com/neurolabusc/AppleSiliconForNeuroimaging <https://github.com/neurolabusc/AppleSiliconForNeuroimaging> - for me, a bit of a bombshell is that Metal doesn't do double precision, which seems to limit its value for scientific purposes.
My reading indicates you're 100% right on M1 being vastly better per watt. My question: is it better per pound sterling?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#349 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU7FGVXQB42ELP36QVSFXDUMETEDANCNFSM5IAM7H5Q>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Glad to hear it!
hashcat/hashcat#2976 indicates that the Apple Silicon doesn't have double precision in hardware:
so there's no reason for Apple to make a Metal API for it. I think that implies that massive parallel double-precision stuff will be much faster on NVIDIA hardware, but unless the processing is highly CPU-intensive, the performance limits will still be of memory bandwidth. Which also has implications for how much benefit there is to adding GPGPU support to PDL (possibly not much). An alternative approach to increasing locality for more complex processing might be to make JIT operator-building. This would take the transformations being applied to ndarrays, and combine them into a single, JIT-created, operator, which would do all the steps in each "broadcastloop" while the data is still in cache, thereby limiting how many round-trips data takes out of main RAM and back. |
On 17 Nov 2021, at 1:21 pm, mohawk2 ***@***.***> wrote:
lad to hear it!
Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice.
hashcat/hashcat#2976 <hashcat/hashcat#2976> indicates that the Apple Silicon doesn't have double precision in hardware:
Device Name Apple M1
[...]
Double-precision Floating-point support (n/a)
so there's no reason for Apple to make a Metal API for it. I think that implies that massive parallel double-precision stuff will be much faster on NVIDIA hardware, but unless the processing is highly CPU-intensive, the performance limits will still be of memory bandwidth. Which also has implications for how much benefit there is to adding GPGPU support to PDL (possibly not much).
Well this is OpenCL reporting that, and Apple now deprecates that, so it may not be accurate.
But you are right until the situation settles down there is no benefit for trying to use GPUs to accelerate our array math
An alternative approach to increasing locality for more complex processing might be to make JIT operator-building. This would take the transformations being applied to ndarrays, and combine them into a single, JIT-created, operator, which would do all the steps in each "threadloop" while the data is still in cache, thereby limiting how many round-trips data takes out of main RAM and back.
This is an idea that has been expressed before in this mailing list, and remains a good one
Karl
|
Deprecating it doesn't mean they are not supporting it for some time to come. In particular, it in no way means the |
To expand a bit on the above-mentioned point about "kernels", @zmughal has correctly pointed out this is basically about "loop fusion". Further thinking on that (as also discussed on IRC The Wikipedia page linked above says that latest clang (12) and gcc (11.1) don't do redundant-allocation removal, and it seems to me that avoiding redundant allocation is very important. One way forward might be to use variable-length arrays (allocated on stack) so the compiler will know it doesn't necessarily ever need to go in main RAM. |
Actual measurement of performance with a simple loop-fusion, code: use strict;
use warnings;
use Time::HiRes qw(gettimeofday tv_interval);
use PDL;
use Inline Pdlpp => 'DATA';
sub with_time (&) {
my @t = gettimeofday();
$_[0]->();
printf "%g ms\n", tv_interval(\@t) * 1000;
}
$PDL::BIGPDL = $PDL::BIGPDL = 1;
my $N = $ARGV[0];
my ($a, $b, $c) = (ones($N), sequence($N), sequence($N));
print "with intermediates\n";
with_time { print +($a + $b * sin($c))->info, "\n" } for 1..5;
print "manual loop-fusion\n";
with_time { print PDL::a_plus_b_sin_c($a, $b, $c)->info, "\n" } for 1..5;
__DATA__
__Pdlpp__
pp_def('a_plus_b_sin_c',
Pars => 'a(); b(); c(); [o]d()',
GenericTypes => ['D'],
Code => '$d() = $a() + $b() * sin($c());',
); Results with
In nice round numbers, the manually loop-fused operation above was a little over twice as fast as the naive version. This was also true with lower values of |
I am merging #354 into this.
|
Just been re-reading the COS docs. It is primarily about dynamic polymorphism, which I cannot see a great need for in any current or future PDL. The heart of PDL is to write some type-generic A need for PDL that I am starting to see is first-class dimensions that could be referred to in I am open to the idea that COS could be part of the solution. Named dimensions (HDF5 etc style) would be a beneficial thing generally, and an "einops" style of expressing things might be part of that reversible dimension-translating. It seems unlikely, but not impossible, that ArrayFire is going to help us much. However, they do support farming processing out to OpenCL and CUDA, so it could be an alternative to OpenMP. What would be needed is an Alien::ArrayFire similar to Alien::OpenMP, then a generalisation of broadcastloop to make our current pthread-only implementation be just one possibility. Given the language is a bit different from C, that might be challenging, and OpenMP/OpenCL would probably be easier. However, the generalisation would need to happen in any case. Also needed would be to store in the |
Another optimisation to think about which isn't as well supported within PDL is "loop tiling/blocking". https://www.intel.com/content/www/us/en/developer/articles/technical/loop-optimizations-where-blocks-are-required.html. I noticed it being used as I was looking over the current |
I've read the I'm also reading https://www.alcf.anl.gov/sites/default/files/2020-01/OpenMP45_Bertoni.pdf which is a really good summary of OpenMP 4.5 constructs, especially the |
Thinking about how to make
This doesn't yet help with a proper A more generalised |
#324 acts as a reminder once this gets closer to completion that PDL::Dataflow will want a thorough update since it still has stuff in there about "families" and some quite complicated stuff that I don't think would be very useful. |
Note to self: when (not if) we complete this, we'd update the PDL::PP docs to revise this:
|
Following on from a long-running branch on main PDL (
It occurs to me that What we could have in addition is "narrowcasting". It would generate additional |
Another idea that occurred while pondering the above can be considered moving in the opposite direction; small and lightweight vs increasingly magnificent abstractions. Currently PDL operations (in particular the XS wrapper functions) if given non-broadcasting inputs, and/or single elements, and/or Perl scalars, do the full conversion of those to ndarrays, then set up a full This has been noted as being expensive (i.e. slow). An alternative would be to have the XS wrappers have some extra generated code which detects when the full machinery is not required, and interpolates a modified version of the fundamental code (or "kernel"). That could even allow for just using This would allow PDL libraries to compete directly with e.g. https://metacpan.org/pod/Math::GSL |
As noted in the updated PDL::Dataflow and on this PerlMonks thread, it would be interesting to shift PDL's threading model from "wait while all the threads return" to "start them off then react to their completion" using an event loop. |
Loop fusion note: using |
Discussion of parsing |
This might be a discussion for PDL::LinearAlgebra, but FYI in case it is relevant: "The NVBLAS Library is built on top of the cuBLAS Library using only the CUBLASXT API (refer to the CUBLASXT API section of the cuBLAS Documentation for more details). NVBLAS also requires the presence of a CPU BLAS lirbary on the system. Currently NVBLAS intercepts only compute intensive BLAS Level-3 calls (see table below). Depending on the charateristics of those BLAS calls, NVBLAS will redirect the calls to the GPUs present in the system or to CPU. That decision is based on a simple heuristic that estimates if the BLAS call will execute for long enough to amortize the PCI transfers of the input and output data to the GPU" [ https://docs.nvidia.com/cuda/nvblas/ ] It currently supports these routines, all others are passed to the system BLAS:
|
As per discussion in IRC, the following are worth looking at: |
Coming down to choosing which GPGPU tech to use:
|
This was discussed on pdl-devel back in Dec 2015 (see https://sourceforge.net/p/pdl/mailman/pdl-devel/?viewmonth=201512). There is also some discussion in Jun this year on PerlMonks (https://www.perlmonks.org/?node_id=11134476).
I recently revisited this; my thinking so far is that the best platform for mixed CPU/GPU processing seems to be OpenCL. However, OpenCL C doesn't support native-complex processing, but it would be possible to use OpenCL C++ with a class that overloads the arithmetic operators for source compatibility with existing
Code
; an easy first-cut alternative would be to only support GPU use on real types (which may also be further restricted to not supportingdouble
for <OpenCL 2.0 environments - GPUs have long been aimed at speed rather than extreme precision, because of graphics' needs).Defining some terms: "kernel" is a term in GPGPU-land for the code that actually runs in the GPU, typically being run once per "work item", which is basically identical to the "inner loop" inside a
broadcastloop %{ /* blah */ %}
(and/orloop
). Current PDL operations have code generated by PP that doesn't distinguish between the "kernel" and the setup code (which runs on the "host"), since the output is just a bunch of C to put inside the function. When that distinction starts to matter, it will need some work.Things that would need doing to allow this:
Code
(etc) into kernel and setup code; another benefit of doing so would be to aid lazy-building ndarrays since PDL could be made to set up a set of very small ndarrays, call the kernel on that, then do something with the results, then throwing them away - an example might be tosumover
a very longsequence
using basically no memoryResources:
The text was updated successfully, but these errors were encountered: