Roundtrip performance issue #465

mfagerlund · 2021-11-28T09:55:01Z

mfagerlund
Nov 28, 2021

I'm testing out a recurrent network architecture called Star, and it requires me to call torch a whole bunch of times - I hadn't realized it, but that's incredibly slow. I mean, my test takes 4.1s (TorchSharp) compared to 0.8s on pytorch.

This simulates my load;
[Test]
public void RoundtripTimes()
{
var device = torch.CUDA;
var sum = torch.rand(10, 10, device: device);
var addor = torch.rand(10, 10, device: device);

        // Warm it up
        addor += sum;
        
        // Sync
        WriteLine(sum.sum().ToSingle());
        var sw = Stopwatch.StartNew();
        for (int j = 0; j < 600; j++)
        {
            for (int i = 0; i < 28 * 28; i++)
            {
                sum += addor;
            }
        }
        WriteLine(sum.sum().ToSingle());
        WriteLine($"Elapsed={sw.Elapsed.TotalSeconds:0.0}s");
    }

Elapsed=4.1s

And in pytorch:

import torch
import timeit
import time

sum=torch.randn(10,10)
addor=torch.randn(10,10)

addor+=sum
print(sum.sum())

tic = time.perf_counter()
for epoch in range(600):
    for batch in range(28*28):
        sum += addor
print(sum.sum())
toc = time.perf_counter()
print(f"Finished in {toc - tic:0.4f} seconds")

Finished in 0.7994 seconds

Any ideas, or will I have to write tight loops in C/C++?

/m

Answered by pkese

Nov 29, 2021

Are you sure your Python code is also running on CUDA?
If not, then that'd be expected (there's a lot of latency when dealing with GPU, whereas the CPU path is just a C-call away).

View full answer

mfagerlund · 2021-11-28T15:14:13Z

mfagerlund
Nov 28, 2021
Author

Turns out my assumption was false - I thought the time delay was related to P/Invoke and marshalling. Seems it isn't related to that. If I short-circuit the call in THSTensor_add_ then it's super fast:

Tensor THSTensor_add_(const Tensor left, const Tensor right, const Scalar alpha)
{
	//CATCH_TENSOR(left->add_(*right, *alpha));
	return left;
}

=> 0,0096s

I hacked THSTensor_add_ to be public to only measure the P/Invoke time

        for (int j = 0; j < 600; j++) {
            for (int i = 0; i < 28 * 28; i++) {
                torch.Tensor.THSTensor_add_(sum.Handle, addor.Handle, one.Handle);
            }
        }

The code obviously doesn't work, but it performs 470k calls in 0,01 seconds. If add_ really was extremely slow, then it should be super slow when running in torch as well? I'm thinking the delay is somewhere later in the chain, perhaps I'm stuck with some kind of debug binaries? But the package is slow and the TorchSharp project is three times slower than the package. I haven't been able to run the tests in release mode in the TorchSharp git version. It fails with " build.proj(53, 5): [MSB3073] The command ""C:\Dev\TorchSharp\src\Native\build.cmd" Release x64 --libtorchpath C:\Dev\TorchSharp\bin/obj/AnyCPU.Release\libtorch-cpu\libtorch-win-shared-with-deps-1.10.0cpu\libtorch\share\cmake\Torch" exited with code 1."

Is there a synchronize being called at every step of the way, somewhere?

cheers,
m

0 replies

pkese · 2021-11-29T00:05:21Z

pkese
Nov 29, 2021

Are you sure your Python code is also running on CUDA?
If not, then that'd be expected (there's a lot of latency when dealing with GPU, whereas the CPU path is just a C-call away).

3 replies

mfagerlund Nov 29, 2021
Author

You're correct, it was due to CUDA vs CPU. In the .git version of TorchSharp, both CPU and GPU version takes 15s, which is why I didn't originally think it was CPU vs CUDA. But there's still an unexplained difference.

.git version is very slow:

Device: cuda
Elapsed=15,6494s

Device: cpu
Elapsed=15,6295s

The nuget version is much faster for both, but especially CPU, but both versions lag behind the jupyter (pytorch) version:

Device=cuda
Elapsed=4.4s

Device=cpu
Elapsed=1.1s

The jupyter version is fastest (to be expected):

cuda
Finished in 2.2131 seconds

cpu
Finished in 0.8139 seconds

So there's still an overhead, but it's not as large as I had thought. But the overhead can't be due to P/Invoke, at least I don't think so, since the same number of P/Invoke calls takes 0,0096s, which isn't anywhere near 1.1s-0.8s=0.3s. But chasing 0.3s seems pointless.

mfagerlund Nov 29, 2021
Author

What really gets me, though, is that the C RNN/GRU/LSTM implementation, which should have the same roundtrip cost, runs waaay faster, but it essentially has to do the same thing as other RNN/GRU/LSTM have to do. But it might be that they send part of the loop off to the GPU or they do away with synchronizations until it's truly required.

The issue with an RNN doing for instance MNIST pixel by pixel classification is that it has to loop over 784 pixels and make 784 GPU calls (x the number of calls for the type of RNN/GRU/LSTM unit you're using) to get the hidden state at each iteration, so that the next iteration can use that data. There's no way that I can see where it could be done in a single matrix operation unless that was hand coded for that purpose.

Anyway, thanks for the help!

pkese Nov 29, 2021

I think it's worth exploring (and understanding) why Python turns out faster.
What's the point of a compiled language otherwise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roundtrip performance issue #465

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Roundtrip performance issue #465

mfagerlund Nov 28, 2021

Replies: 2 comments · 3 replies

mfagerlund Nov 28, 2021 Author

pkese Nov 29, 2021

mfagerlund Nov 29, 2021 Author

mfagerlund Nov 29, 2021 Author

pkese Nov 29, 2021

mfagerlund
Nov 28, 2021

Replies: 2 comments 3 replies

mfagerlund
Nov 28, 2021
Author

pkese
Nov 29, 2021

mfagerlund Nov 29, 2021
Author

mfagerlund Nov 29, 2021
Author