Memory requirements #16

jchook · 2017-12-10T04:30:31Z

Hello, I am attempting to run this code:

python3 experiment.py --settings_file test

But I am running out of memory (OOM error):

2017-12-09 23:17:18.540786: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x
2017-12-09 23:17:18.540796: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[3988,3988]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "experiment.py", line 221, in <module>
    mmd2, that_np = sess.run(mix_rbf_mmd2_and_ratio(eval_test_real, eval_test_sample,biased=False, sigmas=sigma))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'mul_790', defined at:
  File "experiment.py", line 221, in <module>
    mmd2, that_np = sess.run(mix_rbf_mmd2_and_ratio(eval_test_real, eval_test_sample,biased=False, sigmas=sigma))
  File "/home/jchook/dev/RGAN/mmd.py", line 71, in mix_rbf_mmd2_and_ratio
    K_XX, K_XY, K_YY, d = _mix_rbf_kernel(X, Y, sigmas, wts)
  File "/home/jchook/dev/RGAN/mmd.py", line 52, in _mix_rbf_kernel
    K_YY += wt * tf.exp(-gamma * (-2 * YY + c(Y_sqnorms) + r(Y_sqnorms)))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1117, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2726, in _mul
    "Mul", x=x, y=y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

What are the minimum GPU memory requirements?

The text was updated successfully, but these errors were encountered:

corcra · 2017-12-18T11:28:12Z

Sorry for the delayed response - to give you a partial answer, we use GTX 1080s for some of the experiments, and sometimes we used the CPU (with 16-32GB of RAM).

In case it's helpful at all, the particular bit of code you're getting stuck on here originally came from this repository: https://github.com/dougalsutherland/opt-mmd

jchook · 2018-01-07T19:16:10Z

I tried this with a 1080Ti 11GB of VRAM and 32GB of RAM and still getting "Out of Memory" error. Here is a full output log.

Is there a parameter in the settings file I can change to reduce the memory requirements?

UPDATE

Yay fixed the issue! Here is what I did in case it helps someone else (or me again haha):

Uninstall Tensorflow (previously installed via pip)
Uninstall cuda and cudnn
Re-install Cuda 8.0 and cudnn 7.0.5 (for cuda 8) using .deb packages. Note: I installed all 3 cudnn packages: lib, dev, and doc, then ran all the tests available to ensure I had properly installed everything.
Compile/install Tensorflow from source

Some notes from my tensorflow configuration in case it's useful:

On my distro I had to enter /usr/bin/python3 for my python path
Told it I was using cuda 8 and cudnn 7.0.5
Used dpkg -L libcudnn7 to find out where the .deb installed cudnn (in my case it was /usr/lib/x86_64-linux-gnu) and entered that path into config
Enabled CUDA, but chose default for most other "enable [y/N]" steps

jchook · 2018-01-22T05:33:17Z

Dammit. Something happened on reboot that caused the problem to re-appear.

I have completely uninstalled and re-installed various versions of CUDA + cuDNN + Nvidia drivers + Tensorflow in as many permutations as I thought might work... getting the same exact error every time.

I wrote a custom settings file (based on the mnist example) with custom data and am also getting the same exact error right around 50 epochs. Really wish I understood this problem. I have also tried varying many of the settings.

corcra · 2018-01-22T16:00:38Z

What happens if you turn off all MMD-related calculations? You could do this by setting the "if" statement on this line: https://github.com/ratschlab/RGAN/blob/master/experiment.py#L188 to never be true.

corcra · 2018-01-22T16:06:13Z

You could also vary the size of the set used in evaluation (which gets fed into the MMD calculation), which is set on this line: https://github.com/ratschlab/RGAN/blob/master/experiment.py#L75 batch_multiplier is how many batches worth of data we want to include in the evaluation set.

The problem with reducing the evaluation set size is that it reduces the accuracy of the MMD calculation, but depending on your use case that may be an acceptable price to pay for the code actually running on your hardware. (I'm assuming based on your error log that the OOM is happening due to the MMD calculation, which is quadratic in the number of samples.)

jchook · 2018-01-24T00:14:07Z

You are a saint! Removing the MMD calculations allowed the script to finish. Thank you.

Reducing eval_size is also working.

The problem with reducing the evaluation set size is that it reduces the accuracy of the MMD calculation...

Does this affect training performance or only post-training evaluation?

corcra · 2018-01-24T10:25:18Z

The MMD score is only used for evaluation, so it shouldn't affect training.

The main way it might affect you is that we use the MMD score (on the validation set) to decide when to save model parameters (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L227), so without it you will default to the normal frequency, which is every 50 epochs (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L273).

…16\#issuecomment-359470046

dmortem · 2018-09-28T16:16:39Z

Hi,
@corcra
On the line https://github.com/ratschlab/RGAN/blob/master/experiment.py#L75 , what does '5000' mean? is it the size of the validation set? If the size of my own dataset is less than 5000, should I change this constant?
Thanks!

corcra · 2018-09-28T20:25:39Z

Hi @dmortem : yes, 5000 is the (approximate) size of the validation set we use to compute the MMD during training (technically, we use up to 5000 examples, because we use multiples of the batch size). So if your validation set is smaller than this, or if you just want to have cheaper (but noisier) evaluations, you can change this number.

dmortem · 2018-09-28T22:21:20Z

Thank you for your explanation! @corcra
I notice that when I train the model on my own dataset, 'mmd' and 'that' will become inf first after several epochs, and then become nan. I have replaced the constant '5000' by the size of the trainset of my own. Have you ever met this problem?

corcra · 2018-09-29T07:58:50Z

@dmortem It sounds like you're getting numerical issues/overflow in either the MMD calculation or the t-hat calculation. I guess it might be coming from different things, but as a first sanity check you could try checking the values of the computed kernel for strange things (e.g. look at the output of this function: https://github.com/ratschlab/RGAN/blob/master/mmd.py#L21).

Another thing: 5000 (or whatever other constant you set it to) is referring to the validation set in our code. I guess you could use your training data at that point as well, but then you're checking how similar your generated data is to the training data, which may be overly optimistic.

dmortem · 2018-09-30T01:09:13Z

Thank you @corcra , I will check the values you mentioned.

For the constant '5000', I think it should be the size of the training set (e.g. MNIST_train.csv), and this training set is further divided into another 'training set', validation set and test set with the ratio of [0.6, 0.2, 0.2]. According to the line https://github.com/ratschlab/RGAN/blob/master/experiment.py#L77, I think 5000 should be the size or smaller than the size of the original training set? (in MNIST case, it should be 60000 or less than 60000?)

diogofm · 2019-04-12T21:05:32Z

Hi guys,
I'm trying to reproduce the paper's experiments as well.
So, I'm running it with 64GB of RAM. It was supposed to run fine without the MMD calculation work-around.
The MNIST data set isn't that big and I still can't run it with. I'm afraid of trying to run the eICU and have the same problem.
Can you suggest anything that I can try?
Which are the other variables in this script that influence the memory usage? batch_size maybe?
I didn't feel comfortable to change the eval_size. If you guys can post a working script I'd appreciate.

Thanks in advance.

jchook closed this as completed Jan 10, 2018

jchook reopened this Jan 22, 2018

agerlach referenced this issue in agerlach/RGAN Aug 16, 2018

commented out MMD check per https://github.com/ratschlab/RGAN/issues/…

eeb81dd

…16\#issuecomment-359470046

This was referenced Sep 20, 2019

TypeError: must be real number, not str #31

Open

My Experience with the MNIST Data Set with Some Questions #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory requirements #16

Memory requirements #16

jchook commented Dec 10, 2017

corcra commented Dec 18, 2017

jchook commented Jan 7, 2018 •

edited

Loading

jchook commented Jan 22, 2018 •

edited

Loading

corcra commented Jan 22, 2018

corcra commented Jan 22, 2018

jchook commented Jan 24, 2018

corcra commented Jan 24, 2018

dmortem commented Sep 28, 2018

corcra commented Sep 28, 2018

dmortem commented Sep 28, 2018

corcra commented Sep 29, 2018

dmortem commented Sep 30, 2018

diogofm commented Apr 12, 2019

Memory requirements #16

Memory requirements #16

Comments

jchook commented Dec 10, 2017

corcra commented Dec 18, 2017

jchook commented Jan 7, 2018 • edited Loading

UPDATE

jchook commented Jan 22, 2018 • edited Loading

corcra commented Jan 22, 2018

corcra commented Jan 22, 2018

jchook commented Jan 24, 2018

corcra commented Jan 24, 2018

dmortem commented Sep 28, 2018

corcra commented Sep 28, 2018

dmortem commented Sep 28, 2018

corcra commented Sep 29, 2018

dmortem commented Sep 30, 2018

diogofm commented Apr 12, 2019

jchook commented Jan 7, 2018 •

edited

Loading

jchook commented Jan 22, 2018 •

edited

Loading