Roundtrip performance issue #465
-
I'm testing out a recurrent network architecture called Star, and it requires me to call torch a whole bunch of times - I hadn't realized it, but that's incredibly slow. I mean, my test takes 4.1s (TorchSharp) compared to 0.8s on pytorch. This simulates my load;
Elapsed=4.1s And in pytorch:
Finished in 0.7994 seconds Any ideas, or will I have to write tight loops in C/C++? /m |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Turns out my assumption was false - I thought the time delay was related to P/Invoke and marshalling. Seems it isn't related to that. If I short-circuit the call in THSTensor_add_ then it's super fast:
=> 0,0096s I hacked THSTensor_add_ to be public to only measure the P/Invoke time
The code obviously doesn't work, but it performs 470k calls in 0,01 seconds. If add_ really was extremely slow, then it should be super slow when running in torch as well? I'm thinking the delay is somewhere later in the chain, perhaps I'm stuck with some kind of debug binaries? But the package is slow and the TorchSharp project is three times slower than the package. I haven't been able to run the tests in release mode in the TorchSharp git version. It fails with " build.proj(53, 5): [MSB3073] The command ""C:\Dev\TorchSharp\src\Native\build.cmd" Release x64 --libtorchpath C:\Dev\TorchSharp\bin/obj/AnyCPU.Release\libtorch-cpu\libtorch-win-shared-with-deps-1.10.0cpu\libtorch\share\cmake\Torch" exited with code 1." Is there a synchronize being called at every step of the way, somewhere? cheers, |
Beta Was this translation helpful? Give feedback.
-
Are you sure your Python code is also running on CUDA? |
Beta Was this translation helpful? Give feedback.
Are you sure your Python code is also running on CUDA?
If not, then that'd be expected (there's a lot of latency when dealing with GPU, whereas the CPU path is just a C-call away).