Logbook

This is my logbook for this project. You'll find pretty much everything I read here. For code progression, see git log.

Configurations

Config #1

MacBookPro10,1; Intel Core i7-3720QM @ 2.60 GHz with NVIDIA GeForce GT 650M
OS X 10.8
Homebrew
Scala 2.10 via brew
Java 1.7.0_15
Thrust version master

GPU stats & CPU stats

Logs

January 2013

followed part of Heterogeneous Parallel Programming

19/02/2013

Installation of NVIDIA CUDA 5.0 on Config #1
- Download page
- Getting stated guide for OS X
- requires MPI
- installed mpich2 with brew : brew install mpich2 --enable-shared
Run examples on Config #1
- deviceQuery

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 650M"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073414144 bytes)
  ( 2) Multiprocessors x (192) CUDA Cores/MP:    384 CUDA Cores
  GPU Clock rate:                                900 MHz (0.90 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GT 650M

Note: some results look strange

bandwidthTest

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 650M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			5994.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6212.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			40200.3

20/02/2013

Go through C++AMP and its subpages
Read about Scala Parallel Collections – Measuring Performance
- DONE: Find an alternative to deprecated scala.testing.Benchmark for Scala 2.10 ? -> Scalameter

22/02/2013

Read about Anatomy of a flawed microbenchmark.

26/02/2013

Read about Java theory and practice: Dynamic compilation and performance measurement
Found Nvidia's Get Started - Parallel Computing and Nvidia's GPU-Accelerated Libraries
Thrust is C++ general purpose, hight level, GPU accelerated library.
- Github repository
- Go through Quick Start Guide
- Requires set -Ux DYLD_LIBRARY_PATH /Developer/NVIDIA/CUDA-5.0/lib on Config #1 to be run from Terminal.
- See this and that to set DYLD_LIBRARY_PATH for graphical apps.
- Thrust v1.5 was installed with CUDA 5 on Config #1.
- Current version is 1.6 and current progress can be found here.
- Documentation
- FAQ
Went quickly through OpenCL 1.2 Spec
- OpenCL is waaay too low level/complex to be approached in a very short time.
- Also, the C++ wrapper doesn't look better.

05/03/2013

Read Microbenchmarking C++, C#, and Java; stats are old; nothing fancy
Looking for an accurate way to micro benchmark small code paths written in C++ and running on Linux/OSX : a good source for performance analysis tools
Micro-Benchmarking Done Wrong, And For The Wrong Reasons : not the most relevant article for this project but highlight some pitfall.

07/03/2013

Read Microbenchmark of SSE in C++ and Java; not very relevant here; mostly IO cache issues / SSE optimisations.

11-24/03/2013

Read about several scientific topics which could be interesting for the benchmarks.
- Artificial neural network
- Mandelbrot set
- Brute-force search
- Basic Local Alignment Search Tool (BLAST) and other Sequence alignment algorithms (Needleman–Wunsch and Smith–Waterman)
- Genetic algorithm, Evolutionary algorithm & Genetic programming
- Other Life Science topics (boids, ...)
Some interesting Bachelor projects that can be used for inspiration :
- Improving Parallel Graph Processing through introduction of Parallel Collections : Page Rank algorithm
- Parallel Natural Language Processing Algorithms in Scala : Naive Bayes + Maximum Entropy
- An Expectation Maximization Algorithm forGaussian Mixture Models : Expectation Maximization Algorithm
An interesting reading about Scala/CUDA actors

27/03/2013

Implement basic Mandelbrot algorithm in C++11
- 2000x2000 set requires ~30sec on Config #1
Implement parallel Mandelbrot algorithm in C++98 with Thrust
- 2000x2000 set requires ~3sec on Config #1

28/03/2013

Scalameter is a good alternative to scala.benchmark.
Define more precise objectives for the project :
- Benchmarks implement will be (for most of them) done with :
  - A. C++98 or C++11,
  - B. C++98 with Thrust,
  - C. Scala 2.10,
  - D. Scala 2.10 with parallel collection,
  - E. and Scala with the new parallel collection.
- The following topic will be use to make benchmarks :
  1. Implement Mandelbrot set with different set to test the load balance.
  2. Implement a basic Monte Carlo algorithm to compute π/4 to test the parallelisme of generating random number and count_if.
  3. Implement a basic triangular matrix multiplication to test the load balance.
  4. Implement brute force TSP
  5. Implement brute force text segmentation
  6. Implement a genetic algorithm to find the best value according to a fitness value.
  7. Implement a few microbenchmark for basic operation like map, count, reduce, sort, ...
- Register for intense GPU computing service at Access

29/03/2013

Read about Scalameter

02/04/2013

While loosing my hair on a simple matrix multiplication with Thrust, I found this nice article : Thrust: A Productivity-Oriented Library for CUDA

04/04/2013

Matrix Multiplication :
- It definitely appears not feasible to compute matrix multiplication with Thrust.
- See bottom of the roadmap.
- However, there is another library build on top of Thrust that handles matrix operations (and more) : cusplibrary.
Changed SBT_OPTS as follow to prevent out of memory exception with MB :

> cat ~/.sbtconfig
SBT_OPTS="-Xmx8G -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=8G -Xss2M"

05/04/2013

An interesting issue about Thrust's syntax and this EXTREMELY COOL article preaching for ranges over iterators in C++
Currently it's not possible to invoke Thrust algos from the GPU
Multi GPUs or cluster support are not yet available.
Debugging session :
- Got some crash with Thrust 1.5 (some hidden out of memory errors.. classic!). Let's update to 1.6.
- Thrust 1.6 doesn't change a thing.. however, there is something strange in this memory error : it occurs when copying the data back from the GPU to the CPU...
- Thrust 1.6 fires a lot of warning.. switching to the last development version -> master.
- Apparently using -g option of nvcc is bad !
List of people using Thrust

08/05/2013

Read about audio computing on GPU : Realtime GPU Audio

11/05/2013

The C++ GA implementation for solving equation is more precise than Mathematica !

12/05/2013

Mandelbrot is now fully implemented.
At first sight, the new parallel collection seems much more faster ! A more detailed analysis is still required though.

15/05/2013

Update thrust to the last master version.
AOS vs SOA by example with thrust

16/05/2013

A few points that should be discussed in the final report (brainstorming) :
- language complexity,
- framework(s) complexity,
- debugging complexity,
- compilation time,
- development time/cost,
- material cost,
- absolute performance,
- relative performance (according to some of the above topics)
- code scalability,
- software complexity
- kind of software

31/05/2013

Scala-LLVM

02/06/2013

The “New” Moore’s Law

03/06/2013

GTC 2012, featuring Thrust, Part 1, Part 2 and Part 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logbook

Configurations

Config #1

Logs

January 2013

19/02/2013

20/02/2013

22/02/2013

26/02/2013

05/03/2013

07/03/2013

11-24/03/2013

27/03/2013

28/03/2013

29/03/2013

02/04/2013

04/04/2013

05/04/2013

06/04/2013

08/04/2013

17/04/2013

08/05/2013

11/05/2013

12/05/2013

15/05/2013

16/05/2013

31/05/2013

02/06/2013

03/06/2013

Clone this wiki locally