Support also doing processing on GPUs #349

mohawk2 · 2021-11-14T23:33:26Z

This was discussed on pdl-devel back in Dec 2015 (see https://sourceforge.net/p/pdl/mailman/pdl-devel/?viewmonth=201512). There is also some discussion in Jun this year on PerlMonks (https://www.perlmonks.org/?node_id=11134476).

I recently revisited this; my thinking so far is that the best platform for mixed CPU/GPU processing seems to be OpenCL. However, OpenCL C doesn't support native-complex processing, but it would be possible to use OpenCL C++ with a class that overloads the arithmetic operators for source compatibility with existing Code; an easy first-cut alternative would be to only support GPU use on real types (which may also be further restricted to not supporting double for <OpenCL 2.0 environments - GPUs have long been aimed at speed rather than extreme precision, because of graphics' needs).

Defining some terms: "kernel" is a term in GPGPU-land for the code that actually runs in the GPU, typically being run once per "work item", which is basically identical to the "inner loop" inside a broadcastloop %{ /* blah */ %} (and/or loop). Current PDL operations have code generated by PP that doesn't distinguish between the "kernel" and the setup code (which runs on the "host"), since the output is just a bunch of C to put inside the function. When that distinction starts to matter, it will need some work.

Things that would need doing to allow this:

split Code (etc) into kernel and setup code; another benefit of doing so would be to aid lazy-building ndarrays since PDL could be made to set up a set of very small ndarrays, call the kernel on that, then do something with the results, then throwing them away - an example might be to sumover a very long sequence using basically no memory
figure a way to allow calling at least one other different broadcastloop initiator (as called in PP-generated C code); for back-compat with current CPU-only it's not desirable to make PDL have a hard requirement of OpenCL to operate at all
implement dividing work into a task queue with 4x as many items as the number of processing-units as alluded to here (https://sourceforge.net/p/pdl/mailman/pdl-devel/?viewmonth=202109) (which will also help with parallelising non-deterministic tasks)
either figuring how to identify code that calls into GPU-unavailable C functions (e.g. LAPACK) and having that not use GPU, or figure the best way to manage opt-in / opt-out of GPU use - an easy way might be to simply try OpenCL-compiling the generated code, which undoubtedly will fail if any unsuitable function calls (or headers) are found

Resources:

https://github.com/CastXML/CastXMLSuperbuild which illustrates how to use CMake to build an external project (here, LLVM might be needed)
http://portablecl.org/docs/html/install.html shows how to install Portable OpenCL (POCL)
https://en.wikipedia.org/wiki/OpenCL#Example:_complex_number_arithmetic shows OpenCL C++ code to implement overloaded arithmetic operations for complex numbers
https://github.com/KhronosGroup/libclcxx OpenCL C++ libraries
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html the actual spec for OpenCL 3.0 C

The text was updated successfully, but these errors were encountered:

mohawk2 · 2021-11-15T01:40:58Z

An alternative approach that may help is to use OpenMP, which has support for "offloading" work to GPUs (and the API for doing so may be less low-level than OpenCL).

One upside of this approach may be that it would be straightforward to just annotate code with OpenMP pragmas.

Resources:

https://gcc.gnu.org/wiki/Offloading (Ubuntu's GCC seems to have NVIDIA offloading support as standard config, but Strawberry Perl 5.32's does not)
https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices
https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf presentation from 2018

mohawk2 · 2021-11-15T02:11:17Z

GH-hosted GitHub Actions-runners don't have GPUs. But emulators seem to exist that would allow CI of GPU offloading, since we aren't worried about performance, only checking basic functionality.

Resources:

karlglazebrook · 2021-11-15T04:44:59Z

A lot of scientists have Macs and there seems to be no good cross-platform solutions for this. CUDA is NVIDIA-only and not supported on MacOS. OpenCL is deprecated in favour of Metal which is Apple-only I don’t know if there is a good solution? Karl

…

On 15 Nov 2021, at 12:41 pm, mohawk2 ***@***.***> wrote: An alternative approach that may help is to use OpenMP, which has support for "offloading" work to GPUs (and the API for doing so may be less low-level than OpenCL). One upside of this approach may be that it would be straightforward to just annotate code with OpenMP pragmas. Resources: https://gcc.gnu.org/wiki/Offloading <https://gcc.gnu.org/wiki/Offloading> (Ubuntu's GCC seems to have NVIDIA offloading support as standard config, but Strawberry Perl 5.32's does not) https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices <https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices> https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf <https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf> presentation from 2018 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#349 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU7FGT2LKCVB6DL6HQJXHTUMBQLLANCNFSM5IAM7H5Q>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

mohawk2 · 2021-11-15T05:55:10Z

PDL today (as of 2.059) on Macs will automatically use all available CPU cores using POSIX threads. That's not terrible already.

To use OpenMP on Mac, these are apparently effective instructions: https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave (alternative: https://mac.r-project.org/openmp/#do). While there isn't yet support for OpenMP offloading to GPU on Mac, this recent discussion suggests such may not be far away: https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w

Otherwise, given how relatively expensive Mac hardware is for the power you get, you might consider a PC gaming laptop (for powerful CPU and GPU) and run Linux on it ;-)

karlglazebrook · 2021-11-15T07:08:47Z

Sure troll me! I think the recent benchmarks have shown the Mac laptops smoking the Intel ones, and way way better per watt. I have one and that is my experience too, and it wasn’t that expensive! Karl

…

On 15 Nov 2021, at 4:55 pm, mohawk2 ***@***.***> wrote: PDL today (as of 2.059) on Macs will automatically use all available CPU cores using POSIX threads. That's not terrible already. To use OpenMP on Mac, these are apparently effective instructions: https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave <https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave> (alternative: https://mac.r-project.org/openmp/#do <https://mac.r-project.org/openmp/#do>). While there isn't yet support for OpenMP offloading to GPU on Mac, this recent discussion suggests such may not be far away: https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w <https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w> Otherwise, given how relatively expensive Mac hardware is for the power you get, you might consider a PC gaming laptop (for powerful CPU and GPU) and run Linux on it ;-) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#349 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU7FGRYTAX2JG3EPABNZYLUMCOETANCNFSM5IAM7H5Q>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

mohawk2 · 2021-11-15T15:43:51Z

This shows a proper analysis of M1 for science: https://github.com/neurolabusc/AppleSiliconForNeuroimaging - for me, a bit of a bombshell is that Metal doesn't do double precision, which seems to limit its value for scientific purposes.

My reading indicates you're 100% right on M1 being vastly better per watt. My question: is it better per pound sterling?

karlglazebrook · 2021-11-17T00:24:55Z

I will try and not get dragged further in to a ‘A is better than B’ debate :-), I will just say my own experience in one year with an M1 Macbook Air has been entirely positive. Everything runs faster and cooler, with much better battery life and I find the cost premium (20%?) a net benefit. “Everything” still includes a lot of open source science code running under Rosetta (which all work fine BTW!), because it will take a while for the open source community to do architecture ports. The last seems esp. try of deep python stacks - still waiting on anaconda! On that note I do kind of think ‘how hard can it be’ as open source code should not in principle really care about your cpu, and I recompiled PDL myself just fine back in January and relatively easily. Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice. K

…

On 16 Nov 2021, at 2:44 am, mohawk2 ***@***.***> wrote: This shows a proper analysis of M1 for science: https://github.com/neurolabusc/AppleSiliconForNeuroimaging <https://github.com/neurolabusc/AppleSiliconForNeuroimaging> - for me, a bit of a bombshell is that Metal doesn't do double precision, which seems to limit its value for scientific purposes. My reading indicates you're 100% right on M1 being vastly better per watt. My question: is it better per pound sterling? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#349 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU7FGVXQB42ELP36QVSFXDUMETEDANCNFSM5IAM7H5Q>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

mohawk2 · 2021-11-17T02:21:41Z

I will try and not get dragged further in to a ‘A is better than B’ debate :-), I will just say my own experience in one year with an M1 Macbook Air has been entirely positive.

Glad to hear it!

Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice.

hashcat/hashcat#2976 indicates that the Apple Silicon doesn't have double precision in hardware:

  Device Name                                     Apple M1
[...]
Double-precision Floating-point support         (n/a)

so there's no reason for Apple to make a Metal API for it. I think that implies that massive parallel double-precision stuff will be much faster on NVIDIA hardware, but unless the processing is highly CPU-intensive, the performance limits will still be of memory bandwidth. Which also has implications for how much benefit there is to adding GPGPU support to PDL (possibly not much).

An alternative approach to increasing locality for more complex processing might be to make JIT operator-building. This would take the transformations being applied to ndarrays, and combine them into a single, JIT-created, operator, which would do all the steps in each "broadcastloop" while the data is still in cache, thereby limiting how many round-trips data takes out of main RAM and back.

karlglazebrook · 2021-11-17T02:46:09Z

On 17 Nov 2021, at 1:21 pm, mohawk2 ***@***.***> wrote: lad to hear it! Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice. hashcat/hashcat#2976 <hashcat/hashcat#2976> indicates that the Apple Silicon doesn't have double precision in hardware: Device Name Apple M1 [...] Double-precision Floating-point support (n/a) so there's no reason for Apple to make a Metal API for it. I think that implies that massive parallel double-precision stuff will be much faster on NVIDIA hardware, but unless the processing is highly CPU-intensive, the performance limits will still be of memory bandwidth. Which also has implications for how much benefit there is to adding GPGPU support to PDL (possibly not much).

Well this is OpenCL reporting that, and Apple now deprecates that, so it may not be accurate. But you are right until the situation settles down there is no benefit for trying to use GPUs to accelerate our array math

An alternative approach to increasing locality for more complex processing might be to make JIT operator-building. This would take the transformations being applied to ndarrays, and combine them into a single, JIT-created, operator, which would do all the steps in each "threadloop" while the data is still in cache, thereby limiting how many round-trips data takes out of main RAM and back.

This is an idea that has been expressed before in this mailing list, and remains a good one Karl

mohawk2 · 2021-11-17T06:46:04Z

Device Name Apple M1
[...]
Double-precision Floating-point support (n/a)

Well this is OpenCL reporting that, and Apple now deprecates that, so it may not be accurate. But you are right until the situation settles down there is no benefit for trying to use GPUs to accelerate our array math

Deprecating it doesn't mean they are not supporting it for some time to come. In particular, it in no way means the clinfo is incorrect. Even Apple aren't quite that arrogant.

mohawk2 · 2021-11-18T20:18:24Z

To expand a bit on the above-mentioned point about "kernels", @zmughal has correctly pointed out this is basically about "loop fusion".

Further thinking on that (as also discussed on IRC #native); PDL operations are referred to internally as "transformations". Tuomas's vision as I infer it from docs and code was there would be function-type ones as well as slice-type ones (EDIT in pdl.transtype there was also possibly a "retype"-type, but the field was never used so it was removed this year), though only slice-type ones were implemented. To achieve this loop-fusion there needs to be parsing of the code as supplied (PP already does this, so it could just be stored somewhere), then sellotaping the relevant loops together based on knowledge of the desired output. That knowledge would come from making function-type transforms a real thing, which would be done by lazy-evaluating more, such that a not-yet-evaluated multi-stage transform could, on evaluation, be queried for all its shape, and that could then be sellotaped together into one big bespoke transform.

The Wikipedia page linked above says that latest clang (12) and gcc (11.1) don't do redundant-allocation removal, and it seems to me that avoiding redundant allocation is very important. One way forward might be to use variable-length arrays (allocated on stack) so the compiler will know it doesn't necessarily ever need to go in main RAM.

mohawk2 · 2021-11-20T22:36:56Z

Actual measurement of performance with a simple loop-fusion, code:

use strict;
use warnings;
use Time::HiRes qw(gettimeofday tv_interval);
use PDL;
use Inline Pdlpp => 'DATA';

sub with_time (&) {
  my @t = gettimeofday();
  $_[0]->();
  printf "%g ms\n", tv_interval(\@t) * 1000;
}

$PDL::BIGPDL = $PDL::BIGPDL = 1;
my $N = $ARGV[0];
my ($a, $b, $c) = (ones($N), sequence($N), sequence($N));

print "with intermediates\n";
with_time { print +($a + $b * sin($c))->info, "\n" } for 1..5;
print "manual loop-fusion\n";
with_time { print PDL::a_plus_b_sin_c($a, $b, $c)->info, "\n" } for 1..5;

__DATA__
__Pdlpp__
pp_def('a_plus_b_sin_c',
  Pars => 'a(); b(); c(); [o]d()',
  GenericTypes => ['D'],
  Code => '$d() = $a() + $b() * sin($c());',
);

Results with $N of 60e6 (tuned to put real load on my setup, but not cause meaningful swapping), with the info removed:

with intermediates
1393.23 ms
1402.13 ms
1385.48 ms
1389.91 ms
1375.89 ms
manual loop-fusion
663.921 ms
646.778 ms
639.667 ms
641.872 ms
646.683 ms

In nice round numbers, the manually loop-fused operation above was a little over twice as fast as the naive version. This was also true with lower values of $N.

mohawk2 · 2021-12-31T02:51:36Z

I am merging #354 into this.

In an email in Dec 2015, Chris wrote things that could be incorporated in future PDL*:

ArrayFire GPGPU parallel computing:
http://www.arrayfire.com/docs/index.htm
This could give PDLng a quick start for
high performance using GPUs.

COS, the C Object System:
https://github.com/CObjectSystem/COS/blob/master/doc/cos-draft-dls09.pdf
https://github.com/CObjectSystem/COS
This could give PDLng an object oriented
interface that maintains a C FFI so that
PDLng could call or be called from other
languages. Matplotlib anyone?

HDF5 Attributes:
http://www.hdfgroup.org/HDF5/doc1.6/UG/13_Attributes.html
I was thinking that our PDLng implementation
should cover the bases from FITS, HDF5, named
dimensions,.. Ingo has already done some
work along these lines.

mohawk2 · 2021-12-31T02:53:33Z

Just been re-reading the COS docs. It is primarily about dynamic polymorphism, which I cannot see a great need for in any current or future PDL. The heart of PDL is to write some type-generic Code (with possible Bad, and/or Back analogues) which would do the right thing on any supported PDL types.

A need for PDL that I am starting to see is first-class dimensions that could be referred to in Code, which would allow dimension-updates to be loop-fused (and setdims etc to be a first-class operation rather than underlying API). That would imply RedoDims stuff to also be first-class, for the same reason. A thing that would help both RedoDims and BackCode would be a more mathematical (and potentially therefore reversible) way to express transformations. That suggests a more declarative style, which seems to be antithetical to the imperative style of C.

I am open to the idea that COS could be part of the solution.

Named dimensions (HDF5 etc style) would be a beneficial thing generally, and an "einops" style of expressing things might be part of that reversible dimension-translating.

It seems unlikely, but not impossible, that ArrayFire is going to help us much. However, they do support farming processing out to OpenCL and CUDA, so it could be an alternative to OpenMP. What would be needed is an Alien::ArrayFire similar to Alien::OpenMP, then a generalisation of broadcastloop to make our current pthread-only implementation be just one possibility. Given the language is a bit different from C, that might be challenging, and OpenMP/OpenCL would probably be easier. However, the generalisation would need to happen in any case. Also needed would be to store in the pdl_transvtable the Code etc, for accessing by OpenCL/ArrayFire/whatever.

zmughal · 2022-01-02T01:56:10Z

To expand a bit on the above-mentioned point about "kernels", @zmughal has correctly pointed out this is basically about "loop fusion".

Another optimisation to think about which isn't as well supported within PDL is "loop tiling/blocking". https://www.intel.com/content/www/us/en/developer/articles/technical/loop-optimizations-where-blocks-are-required.html. I noticed it being used as I was looking over the current matmult implementation. There are also cache-oblivious algorithms and libraries that implement more performant versions of these operations for particular sizes and kinds of matrices, so loop tiling may not be as necessary as loop fusion.

mohawk2 · 2022-01-02T02:46:49Z

Another optimisation to think about which isn't as well supported within PDL is "loop tiling/blocking". https://www.intel.com/content/www/us/en/developer/articles/technical/loop-optimizations-where-blocks-are-required.html. I noticed it being used as I was looking over the current matmult implementation. There are also cache-oblivious algorithms and libraries that implement more performant versions of these operations for particular sizes and kinds of matrices, so loop tiling may not be as necessary as loop fusion.

I've read the matmult code in the past, and at the time (for me) it was quite eye-opening! This does bring in the idea of enhancing >1-dimensional loop() constructs to use tiling (probably with a check that the contained $x() lookups all use no specified index-lookups, to avoid any out-of-order sequencing causing problems).

I'm also reading https://www.alcf.anl.gov/sites/default/files/2020-01/OpenMP45_Bertoni.pdf which is a really good summary of OpenMP 4.5 constructs, especially the map stuff. It would be conceivable to use that with a bit of inserted code to calculate tiles, to insert OpenMP #pragmas to achieve tiling.

mohawk2 · 2022-01-03T01:03:48Z

Thinking about how to make axisvalues a proper transformation, including the same for ndcoords. What might help with the need to loop over arbitrary numbers of dims would be a new multidimloop directive, something like (with the benefit of any ndarray "access" benefiting from the "context" so being able to just be $x() instead of spelling out the dimensions in some way):

multidimloop(PARENT, n) %{
  $CHILD() = n;
%}

This doesn't yet help with a proper reshape transformation, though as that is pure Perl of a clump(-1) then a setdims, that just needs a first-class setdims.

A more generalised multiloop is not yet clear to me but seems necessary.

mohawk2 · 2022-01-04T18:42:28Z

#324 acts as a reminder once this gets closer to completion that PDL::Dataflow will want a thorough update since it still has stuff in there about "families" and some quite complicated stuff that I don't think would be very useful.

mohawk2 · 2022-02-19T10:25:44Z

Note to self: when (not if) we complete this, we'd update the PDL::PP docs to revise this:

generate code that takes advantage of the vectorising/parallel computing features of your machine (this a project for the future)

mohawk2 · 2022-03-29T15:00:54Z

Following on from a long-running branch on main PDL (index_ops) which has the bare start of slightly-more-wide implementation of "index operations" (extending the idea of indadd to multiplication, copying, etc) and some discussion on pdl-porters (https://www.mail-archive.com/[email protected]/msg05754.html). That links to the High-Performance Fortran (HPF) 2.0 spec (http://dacnet.rice.edu/versions/hpf2/hpf-v20.pdf) which on pages 91-97 (PDF pages 101-107) discusses "scatter-combine" operations. Those can be considered equivalent to using indexND with plus etc.

indexND is implemented using the range operation, which sort of cheats in PDL terms by not having any Pars at all. This is alluded to in the "Practical Magick" paper.

It occurs to me that indadd, and the above "index operations", and indexND, can be considered a generalisation of "broadcasting". That broadcasts the given operation over the full range of its broadcasted-up inputs. One consequence of this is that read (and writeback with two-way dataflow) routinely copy the whole broadcasted range, which as noted in the docs of range is wasteful.

What we could have in addition is "narrowcasting". It would generate additional read (and if needed writeback) code, possibly in different functions with additional index-array parameters (these might or might not want to be proper ndarrays). This would be parallel with the "vaffine" form of transformation which is a rectangular view onto another ndarray's actual data block.

mohawk2 · 2022-03-29T15:26:03Z

Another idea that occurred while pondering the above can be considered moving in the opposite direction; small and lightweight vs increasingly magnificent abstractions. Currently PDL operations (in particular the XS wrapper functions) if given non-broadcasting inputs, and/or single elements, and/or Perl scalars, do the full conversion of those to ndarrays, then set up a full pdl_trans, then call the full broadcasting apparatus.

This has been noted as being expensive (i.e. slow). An alternative would be to have the XS wrappers have some extra generated code which detects when the full machinery is not required, and interpolates a modified version of the fundamental code (or "kernel"). That could even allow for just using IV/NV without further conversion, and even allow translating $x(n=>n1,m=>m1) into lookups into n-dimensional Perl arrays-of-arrays. One could think of this as making PDL function as a sort of Perl Language for operating on Data (not very catchy, probably needs a better name).

This would allow PDL libraries to compete directly with e.g. https://metacpan.org/pod/Math::GSL

mohawk2 · 2022-04-21T08:40:01Z

As noted in the updated PDL::Dataflow and on this PerlMonks thread, it would be interesting to shift PDL's threading model from "wait while all the threads return" to "start them off then react to their completion" using an event loop.

mohawk2 · 2022-05-22T01:42:29Z

Loop fusion note: using which, index, where, etc should be dealt with appropriately inline, rather than actually generating the index sets in a naive way. Observation from https://perlmonks.org/?node_id=11144070

mohawk2 · 2022-06-07T04:17:18Z

Discussion of parsing Code chunks in order to support MPFR, etc, which might also facilitate translating into OpenCL: https://perlmonks.org/?node_id=11144472

KJ7LNW · 2022-11-01T00:13:37Z

This might be a discussion for PDL::LinearAlgebra, but FYI in case it is relevant:

"The NVBLAS Library is built on top of the cuBLAS Library using only the CUBLASXT API (refer to the CUBLASXT API section of the cuBLAS Documentation for more details). NVBLAS also requires the presence of a CPU BLAS lirbary on the system. Currently NVBLAS intercepts only compute intensive BLAS Level-3 calls (see table below). Depending on the charateristics of those BLAS calls, NVBLAS will redirect the calls to the GPUs present in the system or to CPU. That decision is based on a simple heuristic that estimates if the BLAS call will execute for long enough to amortize the PCI transfers of the input and output data to the GPU"

[ https://docs.nvidia.com/cuda/nvblas/ ]

It currently supports these routines, all others are passed to the system BLAS:

gemm
syrk
herk
syr2k
her2k
trsm
trmm
symm
hemm

zmughal · 2022-12-26T00:13:12Z

As per discussion in IRC, the following are worth looking at:

XLA: Optimizing Compiler for Machine Learning - TensorFlow
Automatic differentiation - Wikipedia
- GitHub - google/jax: Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
- Enzyme AD

mohawk2 · 2024-11-06T03:41:39Z

Coming down to choosing which GPGPU tech to use:

https://github.com/DTolm/VkFFT/tree/develop is an FFT that runs on Vulkan, OpenCL, CUDA, Metal
https://github.com/Aransentin/RT-Engine is raytracing with interop between OpenCL and OpenGL
https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html latest OpenCL shows cl_khr_fp64 required
https://registry.khronos.org/OpenGL/extensions/ARB/ARB_gpu_shader_fp64.txt shows GL_ARB_gpu_shader_fp64 is OpenGL 3.2+
https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkPhysicalDeviceFeatures.html shows how to look up if a Vulkan device has shaderFloat64

mohawk2 mentioned this issue Dec 31, 2021

Next generation steps identified by CHM #354

Closed

mohawk2 mentioned this issue Dec 31, 2021

Make PDL be a C library with a Perl interface so it can be used from other C (or dynamic language) code #358

Open

10 tasks

mohawk2 mentioned this issue Jan 15, 2022

sequence() fails as instance method #370

Closed

This was referenced Apr 16, 2022

Tweaks moocow-the-bovine/PDL-CCS#5

Merged

segfault during window test with tiling window manager (i3) Perl-GPU/OpenGL-GLUT#8

Open

mohawk2 mentioned this issue May 8, 2022

How to prepare for forking? danaj/Math-Prime-Util#65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support also doing processing on GPUs #349

Support also doing processing on GPUs #349

mohawk2 commented Nov 14, 2021 •

edited

Loading

mohawk2 commented Nov 15, 2021

mohawk2 commented Nov 15, 2021

karlglazebrook commented Nov 15, 2021 via email

mohawk2 commented Nov 15, 2021

karlglazebrook commented Nov 15, 2021 via email

mohawk2 commented Nov 15, 2021

karlglazebrook commented Nov 17, 2021 via email

mohawk2 commented Nov 17, 2021 •

edited

Loading

karlglazebrook commented Nov 17, 2021 via email

mohawk2 commented Nov 17, 2021

mohawk2 commented Nov 18, 2021 •

edited

Loading

mohawk2 commented Nov 20, 2021

mohawk2 commented Dec 31, 2021

mohawk2 commented Dec 31, 2021 •

edited

Loading

zmughal commented Jan 2, 2022

mohawk2 commented Jan 2, 2022

mohawk2 commented Jan 3, 2022 •

edited

Loading

mohawk2 commented Jan 4, 2022 •

edited

Loading

mohawk2 commented Feb 19, 2022

mohawk2 commented Mar 29, 2022

mohawk2 commented Mar 29, 2022

mohawk2 commented Apr 21, 2022

mohawk2 commented May 22, 2022

mohawk2 commented Jun 7, 2022

KJ7LNW commented Nov 1, 2022

zmughal commented Dec 26, 2022

mohawk2 commented Nov 6, 2024

Support also doing processing on GPUs #349

Support also doing processing on GPUs #349

Comments

mohawk2 commented Nov 14, 2021 • edited Loading

mohawk2 commented Nov 15, 2021

mohawk2 commented Nov 15, 2021

karlglazebrook commented Nov 15, 2021 via email

mohawk2 commented Nov 15, 2021

karlglazebrook commented Nov 15, 2021 via email

mohawk2 commented Nov 15, 2021

karlglazebrook commented Nov 17, 2021 via email

mohawk2 commented Nov 17, 2021 • edited Loading

karlglazebrook commented Nov 17, 2021 via email

mohawk2 commented Nov 17, 2021

mohawk2 commented Nov 18, 2021 • edited Loading

mohawk2 commented Nov 20, 2021

mohawk2 commented Dec 31, 2021

mohawk2 commented Dec 31, 2021 • edited Loading

zmughal commented Jan 2, 2022

mohawk2 commented Jan 2, 2022

mohawk2 commented Jan 3, 2022 • edited Loading

mohawk2 commented Jan 4, 2022 • edited Loading

mohawk2 commented Feb 19, 2022

mohawk2 commented Mar 29, 2022

mohawk2 commented Mar 29, 2022

mohawk2 commented Apr 21, 2022

mohawk2 commented May 22, 2022

mohawk2 commented Jun 7, 2022

KJ7LNW commented Nov 1, 2022

zmughal commented Dec 26, 2022

mohawk2 commented Nov 6, 2024

mohawk2 commented Nov 14, 2021 •

edited

Loading

mohawk2 commented Nov 17, 2021 •

edited

Loading

mohawk2 commented Nov 18, 2021 •

edited

Loading

mohawk2 commented Dec 31, 2021 •

edited

Loading

mohawk2 commented Jan 3, 2022 •

edited

Loading

mohawk2 commented Jan 4, 2022 •

edited

Loading