-
Notifications
You must be signed in to change notification settings - Fork 1
Logbook
Marco Antognini edited this page Jun 3, 2013
·
80 revisions
This is my logbook for this project. You'll find pretty much everything I read here. For code progression, see git log.
- MacBookPro10,1; Intel Core i7-3720QM @ 2.60 GHz with NVIDIA GeForce GT 650M
- OS X 10.8
- Homebrew
-
Scala 2.10 via
brew
- Java 1.7.0_15
- Thrust version master
- followed part of Heterogeneous Parallel Programming
- Installation of NVIDIA CUDA 5.0 on Config #1
- Download page
- Getting stated guide for OS X
- requires MPI
- installed mpich2 with brew :
brew install mpich2 --enable-shared
- Run examples on Config #1
- deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 650M"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 1024 MBytes (1073414144 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 900 MHz (0.90 GHz)
Memory Clock rate: 2508 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GT 650M
Note: some results look strange
- bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GT 650M
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5994.2
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6212.6
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 40200.3
- Go through C++AMP and its subpages
- Read about Scala Parallel Collections – Measuring Performance
- DONE: Find an alternative to deprecated
scala.testing.Benchmark
for Scala 2.10 ? -> Scalameter
- DONE: Find an alternative to deprecated
- Read about Anatomy of a flawed microbenchmark.
- Read about Java theory and practice: Dynamic compilation and performance measurement
- Found Nvidia's Get Started - Parallel Computing and Nvidia's GPU-Accelerated Libraries
-
Thrust is C++ general purpose, hight level, GPU accelerated library.
- Github repository
- Go through Quick Start Guide
- Requires
set -Ux DYLD_LIBRARY_PATH /Developer/NVIDIA/CUDA-5.0/lib
on Config #1 to be run from Terminal. - See this and that to set DYLD_LIBRARY_PATH for graphical apps.
- Thrust v1.5 was installed with CUDA 5 on Config #1.
- Current version is 1.6 and current progress can be found here.
- Documentation
- FAQ
- Went quickly through OpenCL 1.2 Spec
- OpenCL is waaay too low level/complex to be approached in a very short time.
- Also, the C++ wrapper doesn't look better.
- Read Microbenchmarking C++, C#, and Java; stats are old; nothing fancy
- Looking for an accurate way to micro benchmark small code paths written in C++ and running on Linux/OSX : a good source for performance analysis tools
- Micro-Benchmarking Done Wrong, And For The Wrong Reasons : not the most relevant article for this project but highlight some pitfall.
- Read Microbenchmark of SSE in C++ and Java; not very relevant here; mostly IO cache issues / SSE optimisations.
- Read about several scientific topics which could be interesting for the benchmarks.
- Artificial neural network
- Mandelbrot set
- Brute-force search
- Basic Local Alignment Search Tool (BLAST) and other Sequence alignment algorithms (Needleman–Wunsch and Smith–Waterman)
- Genetic algorithm, Evolutionary algorithm & Genetic programming
- Other Life Science topics (boids, ...)
- Some interesting Bachelor projects that can be used for inspiration :
- Improving Parallel Graph Processing through introduction of Parallel Collections : Page Rank algorithm
- Parallel Natural Language Processing Algorithms in Scala : Naive Bayes + Maximum Entropy
- An Expectation Maximization Algorithm forGaussian Mixture Models : Expectation Maximization Algorithm
- An interesting reading about Scala/CUDA actors
- Implement basic Mandelbrot algorithm in C++11
- 2000x2000 set requires ~30sec on Config #1
- Implement parallel Mandelbrot algorithm in C++98 with Thrust
- 2000x2000 set requires ~3sec on Config #1
- Scalameter is a good alternative to scala.benchmark.
-
Define more precise objectives for the project :
- Benchmarks implement will be (for most of them) done with :
- A. C++98 or C++11,
- B. C++98 with Thrust,
- C. Scala 2.10,
- D. Scala 2.10 with parallel collection,
- E. and Scala with the new parallel collection.
- The following topic will be use to make benchmarks :
- Implement Mandelbrot set with different set to test the load balance.
- Implement a basic Monte Carlo algorithm to compute π/4 to test the parallelisme of generating random number and
count_if
. - Implement a basic triangular matrix multiplication to test the load balance.
- Implement brute force TSP
- Implement brute force text segmentation
- Implement a genetic algorithm to find the best value according to a fitness value.
- Implement a few microbenchmark for basic operation like map, count, reduce, sort, ...
- Register for intense GPU computing service at Access
- Benchmarks implement will be (for most of them) done with :
- Read about Scalameter
- While loosing my hair on a simple matrix multiplication with Thrust, I found this nice article : Thrust: A Productivity-Oriented Library for CUDA
- Matrix Multiplication :
- It definitely appears not feasible to compute matrix multiplication with Thrust.
- See bottom of the roadmap.
- However, there is another library build on top of Thrust that handles matrix operations (and more) : cusplibrary.
- Changed SBT_OPTS as follow to prevent out of memory exception with MB :
> cat ~/.sbtconfig
SBT_OPTS="-Xmx8G -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=8G -Xss2M"
- An interesting issue about Thrust's syntax and this EXTREMELY COOL article preaching for ranges over iterators in C++
- Currently it's not possible to invoke Thrust algos from the GPU
- Multi GPUs or cluster support are not yet available.
- Debugging session :
- Got some crash with Thrust 1.5 (some hidden out of memory errors.. classic!). Let's update to 1.6.
- Thrust 1.6 doesn't change a thing.. however, there is something strange in this memory error : it occurs when copying the data back from the GPU to the CPU...
- Thrust 1.6 fires a lot of warning.. switching to the last development version -> master.
-
Apparently using
-g
option ofnvcc
is bad !
- List of people using Thrust
- Genetic Algorithm :
- TSP :
- Parallel algorithm - Solving the Traveling Salesman Problem using Branch and Bound : good with threads but not really collections.
- Recursive algorithm - Stack Overflow : Brute force algorithm for the Traveling Salesman Problem in Java
- GPU algorithm - High Performance GPU Accelerated Local Optimization in TSP : a paper on CUDA implementations of TSP.
- GPU algorithm - A PARALLEL ALGORITHM FOR FLIGHT ROUTE PLANNING ON GPU USING CUDA : another paper, algorithm based on GA.
- Read about boxing and how to avoid it :
- Java profiler :
jvisualvm
- Read about audio computing on GPU : Realtime GPU Audio
- The C++ GA implementation for solving equation is more precise than Mathematica !
- Mandelbrot is now fully implemented.
- At first sight, the new parallel collection seems much more faster ! A more detailed analysis is still required though.
- Update thrust to the last master version.
- AOS vs SOA by example with thrust
- A few points that should be discussed in the final report (brainstorming) :
- language complexity,
- framework(s) complexity,
- debugging complexity,
- compilation time,
- development time/cost,
- material cost,
- absolute performance,
- relative performance (according to some of the above topics)
- code scalability,
- software complexity
- kind of software