You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I profiled the training and it seems that GPU operation and CPU operation are not done in parrallel as they should with the line "dataset.prefetch(4)" in train.py
Here is a screenshot of what I mean: https://imgur.com/a/tHHQ3OK
So I tried something simple. I converted the dataset into a TFRecordDataset and read from that instead of the current pipeline. The resulting pipeline was 1) much faster 2) executed in parallel, here is the new profile: https://imgur.com/a/Vlh2VmG
On a K80 it roughtly doubled (EDIT: after removing the profiler it is x4.7) the pos/s on a 6x64 network.
Here is some code to transform into a TFRecordDataset:
The reason I am interested in this is that I would like to train very small networks to try different architectures but the bottleneck becomes the input pipeline for small networks.
I think pre-read the data to write them in TFRecords format is worth. Or there is simpler solution?
Do you have any thought on that? I have no idea how the current pipeline works.
Thanks
EDIT: Actually when removing the profiler the gain was even bigger.
With current pipeline: (773.638 pos/s)
With TFRecordDataset: (3661.37 pos/s)
This is still with K80, 6x64 network, batch_size: 1024 and no batch split.
The text was updated successfully, but these errors were encountered:
I'm getting ~5000 pos/s on a single GTX 1080Ti with our current architecture (6 CPU cores with HT)
The TFRecord will probably blow up in memory as it's fed into the shufflebuffer compared to our binary format
Using the tensorflow default functions is good to achieve better parallelism
I'm not CPU bottlenecked, but it's very likely that our implementation isn't parallel wrt CPU / GPU usage. As you said on discord, you have 2 CPUs, this is likely the issue.
I'm implementing a multigpu version that utilizes standard tensorflow API better which should help with this. Thank you for reporting!
Hello,
I profiled the training and it seems that GPU operation and CPU operation are not done in parrallel as they should with the line "dataset.prefetch(4)" in train.py
Here is a screenshot of what I mean:
https://imgur.com/a/tHHQ3OK
So I tried something simple. I converted the dataset into a TFRecordDataset and read from that instead of the current pipeline. The resulting pipeline was 1) much faster 2) executed in parallel, here is the new profile:
https://imgur.com/a/Vlh2VmG
On a K80 it roughtly doubled (EDIT: after removing the profiler it is x4.7) the pos/s on a 6x64 network.
Here is some code to transform into a TFRecordDataset:
And here to read it :
The reason I am interested in this is that I would like to train very small networks to try different architectures but the bottleneck becomes the input pipeline for small networks.
I think pre-read the data to write them in TFRecords format is worth. Or there is simpler solution?
Do you have any thought on that? I have no idea how the current pipeline works.
Thanks
EDIT: Actually when removing the profiler the gain was even bigger.
With current pipeline: (773.638 pos/s)
With TFRecordDataset: (3661.37 pos/s)
This is still with K80, 6x64 network, batch_size: 1024 and no batch split.
The text was updated successfully, but these errors were encountered: