Skip to content
zhaotianjing edited this page Feb 11, 2019 · 32 revisions

The following functions will use openblas in julia to enable parallel computing:

  1. matrix * matrix (*)

  2. matrix * vector (*)

  3. dot (dot())

  • The maximum number of cores openblas allows is #.
  • distributed for loop
  • dot() is speed up by BLAS. dot(), which is a function in BLAS.

Progress:

  1. change . to @., speed almost same.

  2. test:

    X'y

    BLAS.gemv('T',X,y) #<-faster

  3. test:

    Xa*[mu;α]: when Xa is UpperTriangular Array.

    BLAS.trmv('U', 'N', 'N', Xa, [mu;α]): where Xa is normal Array.

    Result: same speed (because Julia use BLAS.trmv for UpperTriangular matrix)

  4. test:

    Xa'ya: when Xa is UpperTriangular Array.

    BLAS.trmv('U', 'T', 'N', Xa, ya): where Xa is normal Array

    Result: same speed (because Julia use BLAS.trmv for UpperTriangular matrix)

  5. speed up cholesky decomposition by BLAS:

    LAPACK.potrf!('U', BB)

    Xa = UpperTriangular(BB)

  6. speed up deriving max eigenvalue by BLAS:

    tmp = muX'muX

    LAPACK.syev!('N', 'U', tmp)[end]

  7. test BLAS on windows(IIBLMM_BLAS.jl).

    BLAS.vendor() -> openblas64

    set_num_threads(1) : 60s

    set_num_threads(2) : 57s

    set_num_threads(3) : 57s

    set_num_threads(4) : 62s

    set_num_threads(8) : 57s

  8. test BLAS on server(farm).

    BLAS.vendor() ->

  9. test BLAS on server(Gausi).

Next step:

  1. BLAS is multi-thread function, how to speed up BLAS by setting more threads?

    In windows, setting different BLAS threads seems have no difference. This may be the problem of my laptop. I need to test in server.(try both gausi and farm server)

Ideas on paper:

  1. Xa is UpperTriangular matrix. So we can use BLAS function to achieve faster speed than normal matrix. In fact, Julia will use BLAS if input matrix is UpperTriangular.

    If type of Xa is normal matrix, time: 93s

    If type of Xa is UpperTriangular, time: 58s

    If type of Xa is normal matrix+BLAS, time:57s

  2. Correct typo in paper.

  3. can also use BLAS function to get max eigenvalue and do cholesky decomposition. (In paper, you only mentioned GPU to speed up.)

BLAS:

1.8 Q Is number of thread limited?

   Basically, there is no limitation about number of threads. You
   can specify number of threads as many as you want, but larger
   number of threads will consume extra resource. I recommend you to
   specify minimum number of threads.
Clone this wiki locally