Skip to content

Logbook

Marco Antognini edited this page Jun 3, 2013 · 80 revisions

This is my logbook for this project. You'll find pretty much everything I read here. For code progression, see git log.

Configurations

Config #1

GPU stats & CPU stats

Logs

January 2013

19/02/2013

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 650M"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073414144 bytes)
  ( 2) Multiprocessors x (192) CUDA Cores/MP:    384 CUDA Cores
  GPU Clock rate:                                900 MHz (0.90 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GT 650M

Note: some results look strange

  • bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 650M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			5994.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6212.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			40200.3

20/02/2013

22/02/2013

26/02/2013

05/03/2013

07/03/2013

11-24/03/2013

27/03/2013

  • Implement basic Mandelbrot algorithm in C++11
    • 2000x2000 set requires ~30sec on Config #1
  • Implement parallel Mandelbrot algorithm in C++98 with Thrust
    • 2000x2000 set requires ~3sec on Config #1

28/03/2013

  • Scalameter is a good alternative to scala.benchmark.
  • Define more precise objectives for the project :
    • Benchmarks implement will be (for most of them) done with :
      • A. C++98 or C++11,
      • B. C++98 with Thrust,
      • C. Scala 2.10,
      • D. Scala 2.10 with parallel collection,
      • E. and Scala with the new parallel collection.
    • The following topic will be use to make benchmarks :
      1. Implement Mandelbrot set with different set to test the load balance.
      2. Implement a basic Monte Carlo algorithm to compute π/4 to test the parallelisme of generating random number and count_if.
      3. Implement a basic triangular matrix multiplication to test the load balance.
      4. Implement brute force TSP
      5. Implement brute force text segmentation
      6. Implement a genetic algorithm to find the best value according to a fitness value.
      7. Implement a few microbenchmark for basic operation like map, count, reduce, sort, ...
    • Register for intense GPU computing service at Access

29/03/2013

02/04/2013

04/04/2013

  • Matrix Multiplication :
    • It definitely appears not feasible to compute matrix multiplication with Thrust.
    • See bottom of the roadmap.
    • However, there is another library build on top of Thrust that handles matrix operations (and more) : cusplibrary.
  • Changed SBT_OPTS as follow to prevent out of memory exception with MB :
> cat ~/.sbtconfig
SBT_OPTS="-Xmx8G -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=8G -Xss2M"

05/04/2013

06/04/2013

08/04/2013

17/04/2013

08/05/2013

11/05/2013

  • The C++ GA implementation for solving equation is more precise than Mathematica !

12/05/2013

  • Mandelbrot is now fully implemented.
  • At first sight, the new parallel collection seems much more faster ! A more detailed analysis is still required though.

15/05/2013

16/05/2013

  • A few points that should be discussed in the final report (brainstorming) :
    • language complexity,
    • framework(s) complexity,
    • debugging complexity,
    • compilation time,
    • development time/cost,
    • material cost,
    • absolute performance,
    • relative performance (according to some of the above topics)
    • code scalability,
    • software complexity
    • kind of software

31/05/2013

02/06/2013

03/06/2013