GPUs

Two main approaches to leveraging GPUs:

High-level approach:
- Use a standard sequential algorithm (computing the support in a given direction; looping over all directions) that calls GPU libraries for heavy operations like matrix operations, etc.
- The gain here is limited by Amdahl's law and memory transfers.
Low-level approach:
- Decompose the problem (computing the support in a given direction; in parallel over all directions) and execute the algorithm as a kernel on the GPU.
- If there is no thread divergence and little memory transfer, the speedup is linear in the number of GPU cores. << major gain

Challenges to computing the support function in a kernel:

Potential obstacle: SIMD architecture: operations are only parallel on data, with all cores in a block either executing the same instruction or waiting.
The performance gain on the GPU directly hinges on thread divergence: any conditional statements will drastically slow down the GPU since threads in the same block will have to wait for each other until they're back in the same instruction.

Provide feedback