misc

The following functions will use openblas in julia to enable parallel computing:

matrix * matrix (*)
matrix * vector (*)
dot (dot())

The maximum number of cores openblas allows is #.
distributed for loop
dot() is speed up by BLAS. dot(), which is a function in BLAS.

Progress:

change . to @., speed almost same.
test:

X'y

BLAS.gemv('T',X,y) #<-faster
test:

Xa*[mu;α]: when Xa is UpperTriangular Array.

BLAS.trmv('U', 'N', 'N', Xa, [mu;α]): where Xa is normal Array.

Result: same speed (because Julia use BLAS.trmv for UpperTriangular matrix)
test:

Xa'ya: when Xa is UpperTriangular Array.

BLAS.trmv('U', 'T', 'N', Xa, ya): where Xa is normal Array

Result: same speed (because Julia use BLAS.trmv for UpperTriangular matrix)
speed up cholesky decomposition by BLAS:

LAPACK.potrf!('U', BB)

Xa = UpperTriangular(BB)
speed up deriving max eigenvalue by BLAS:

tmp = muX'muX

LAPACK.syev!('N', 'U', tmp)[end]
test BLAS on windows(IIBLMM_BLAS.jl).

BLAS.vendor() -> openblas64

set_num_threads(1) : 60s

set_num_threads(2) : 57s

set_num_threads(3) : 57s

set_num_threads(4) : 62s

set_num_threads(8) : 57s
test BLAS on server(farm).

BLAS.vendor() ->
test BLAS on server(Gausi).

Next step:

BLAS is multi-thread function, how to speed up BLAS by setting more threads?

In windows, setting different BLAS threads seems have no difference. This may be the problem of my laptop. I need to test in server.(try both gausi and farm server)

Ideas on paper:

Xa is UpperTriangular matrix. So we can use BLAS function to achieve faster speed than normal matrix. In fact, Julia will use BLAS if input matrix is UpperTriangular.

If type of Xa is normal matrix, time: 93s

If type of Xa is UpperTriangular, time: 58s

If type of Xa is normal matrix+BLAS, time:57s
Correct typo in paper.
can also use BLAS function to get max eigenvalue and do cholesky decomposition. (In paper, you only mentioned GPU to speed up.)

BLAS:

1.8 Q Is number of thread limited?

   Basically, there is no limitation about number of threads. You
   can specify number of threads as many as you want, but larger
   number of threads will consume extra resource. I recommend you to
   specify minimum number of threads.