-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory requirements #16
Comments
Sorry for the delayed response - to give you a partial answer, we use GTX 1080s for some of the experiments, and sometimes we used the CPU (with 16-32GB of RAM). In case it's helpful at all, the particular bit of code you're getting stuck on here originally came from this repository: https://github.com/dougalsutherland/opt-mmd |
I tried this with a 1080Ti 11GB of VRAM and 32GB of RAM and still getting "Out of Memory" error. Here is a full output log. Is there a parameter in the settings file I can change to reduce the memory requirements? UPDATEYay fixed the issue! Here is what I did in case it helps someone else (or me again haha):
Some notes from my tensorflow configuration in case it's useful:
|
Dammit. Something happened on reboot that caused the problem to re-appear. I have completely uninstalled and re-installed various versions of CUDA + cuDNN + Nvidia drivers + Tensorflow in as many permutations as I thought might work... getting the same exact error every time. I wrote a custom settings file (based on the mnist example) with custom data and am also getting the same exact error right around 50 epochs. Really wish I understood this problem. I have also tried varying many of the settings. |
What happens if you turn off all MMD-related calculations? You could do this by setting the "if" statement on this line: https://github.com/ratschlab/RGAN/blob/master/experiment.py#L188 to never be true. |
You could also vary the size of the set used in evaluation (which gets fed into the MMD calculation), which is set on this line: https://github.com/ratschlab/RGAN/blob/master/experiment.py#L75 The problem with reducing the evaluation set size is that it reduces the accuracy of the MMD calculation, but depending on your use case that may be an acceptable price to pay for the code actually running on your hardware. (I'm assuming based on your error log that the OOM is happening due to the MMD calculation, which is quadratic in the number of samples.) |
You are a saint! Removing the MMD calculations allowed the script to finish. Thank you. Reducing
Does this affect training performance or only post-training evaluation? |
The MMD score is only used for evaluation, so it shouldn't affect training. The main way it might affect you is that we use the MMD score (on the validation set) to decide when to save model parameters (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L227), so without it you will default to the normal frequency, which is every 50 epochs (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L273). |
Hi, |
Hi @dmortem : yes, 5000 is the (approximate) size of the validation set we use to compute the MMD during training (technically, we use up to 5000 examples, because we use multiples of the batch size). So if your validation set is smaller than this, or if you just want to have cheaper (but noisier) evaluations, you can change this number. |
Thank you for your explanation! @corcra |
@dmortem It sounds like you're getting numerical issues/overflow in either the MMD calculation or the t-hat calculation. I guess it might be coming from different things, but as a first sanity check you could try checking the values of the computed kernel for strange things (e.g. look at the output of this function: https://github.com/ratschlab/RGAN/blob/master/mmd.py#L21). Another thing: 5000 (or whatever other constant you set it to) is referring to the validation set in our code. I guess you could use your training data at that point as well, but then you're checking how similar your generated data is to the training data, which may be overly optimistic. |
Thank you @corcra , I will check the values you mentioned. For the constant '5000', I think it should be the size of the training set (e.g. MNIST_train.csv), and this training set is further divided into another 'training set', validation set and test set with the ratio of [0.6, 0.2, 0.2]. According to the line https://github.com/ratschlab/RGAN/blob/master/experiment.py#L77, I think 5000 should be the size or smaller than the size of the original training set? (in MNIST case, it should be 60000 or less than 60000?) |
Hi guys, Thanks in advance. |
Hello, I am attempting to run this code:
python3 experiment.py --settings_file test
But I am running out of memory (OOM error):
What are the minimum GPU memory requirements?
The text was updated successfully, but these errors were encountered: