You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the spark-tensorflow-distributor package to run TensorFlow jobs on our Spark-on-YARN 3-node-cluster. We are running another cluster, with the exact same specs, but with native TensorFlow distribution on the cluster, not using Spark-on-YARN. Both clusters feature 64 core CPUs, with 188 GB usable RAM, and 12 GPUs with 10 GB RAM each.
Both clusters are running on Python 3.7.3, with tensorflow==2.4.1. The Spark-cluster also has spark-tensorflow-distributor==0.1.0 installed.
To get a little insight in performance differences, we ran the ResNet152 network with the CIFAR-10 dataset on both of them, as both are included out of the box in TF packages. I'll attach the code below.
Although we are using the exact same code on both clusters, with the same dataset and the same network, the one on Spark eats WAY more RAM than the one that's being distributed by TF itself: While Spark initially uses up to 137 GB RAM and stays there most of the time (with peaks of 148 GB RAM usage), the TF-distributed model only uses a maximum of 28 GB RAM at it's peak, slowly starting from 17 GB.
Everything else (we compared GPU memory, usage, CPU usage, network I/O, etc.) seems to be somewhat comparable to each other, but the RAM usage differs extremely. When using a bigger dataset, it does even overflow the RAM at some point in the calculations, causing an EOF Exception at some point - while the natively-distributed one does only use about 50 GBs RAM and smoothly succeeds.
Hello!
I am using the
spark-tensorflow-distributor
package to run TensorFlow jobs on our Spark-on-YARN 3-node-cluster. We are running another cluster, with the exact same specs, but with native TensorFlow distribution on the cluster, not using Spark-on-YARN. Both clusters feature 64 core CPUs, with 188 GB usable RAM, and 12 GPUs with 10 GB RAM each.Both clusters are running on
Python 3.7.3
, withtensorflow==2.4.1
. The Spark-cluster also hasspark-tensorflow-distributor==0.1.0
installed.To get a little insight in performance differences, we ran the ResNet152 network with the CIFAR-10 dataset on both of them, as both are included out of the box in TF packages. I'll attach the code below.
Although we are using the exact same code on both clusters, with the same dataset and the same network, the one on Spark eats WAY more RAM than the one that's being distributed by TF itself: While Spark initially uses up to 137 GB RAM and stays there most of the time (with peaks of 148 GB RAM usage), the TF-distributed model only uses a maximum of 28 GB RAM at it's peak, slowly starting from 17 GB.
Everything else (we compared GPU memory, usage, CPU usage, network I/O, etc.) seems to be somewhat comparable to each other, but the RAM usage differs extremely. When using a bigger dataset, it does even overflow the RAM at some point in the calculations, causing an EOF Exception at some point - while the natively-distributed one does only use about 50 GBs RAM and smoothly succeeds.
This is the code I am using:
Any clue on this strange behaviour, or what causes it?
Many thanks in advance! :-)
The text was updated successfully, but these errors were encountered: