Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #8

Open
mapengsen opened this issue May 10, 2024 · 3 comments
Open

CUDA out of memory #8

mapengsen opened this issue May 10, 2024 · 3 comments

Comments

@mapengsen
Copy link

When the dataset is larger,raise error:

Datasets files num is: 1344726
Datasets path is: /root/autodl-tmp/MDT/log_chekpoints/sampleImage/2024_05_10_9
Datasets files num is: 50000
Traceback (most recent call last):
File "evaluations/fld/eval_image.py", line 81, in
main()
File "evaluations/fld/eval_image.py", line 54, in main
Precision_value = PrecisionRecall(mode="Precision").compute_metric(train_feat, None, gen_feat) # Default precision
File "/root/autodl-tmp/MDT/evaluations/fld/fld/metrics/PrecisionRecall.py", line 57, in compute_metric
return self.pct_in_manifold(gen_feat, train_feat).item()
File "/root/autodl-tmp/MDT/evaluations/fld/fld/metrics/PrecisionRecall.py", line 33, in pct_in_manifold
nn_dists = self.get_nn_dists(manifold_feat)
File "/root/autodl-tmp/MDT/evaluations/fld/fld/metrics/PrecisionRecall.py", line 24, in get_nn_dists
curr_dists = torch.cdist(feat[start:end], feat)
File "/root/miniconda3/envs/MDT/lib/python3.8/site-packages/torch/functional.py", line 1315, in cdist
return _VF.cdist(x1, x2, p, None) # type: ignore[attr-defined]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.10 GiB. GPU 0 has a total capacty of 23.70 GiB of which 996.56 MiB is free. Process 148558 has 22.72 GiB memory in use. Of the allocated memory 21.07 GiB is allocated by PyTorch, and 216.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@marcojira
Copy link
Owner

For precision, the distance computation is batched for the gen_feat but not for the train_feat. Does it work if you take a subset of your train_feat?

@mapengsen
Copy link
Author

Now, why did I end up with recall being 0? Is this normal

@marcojira
Copy link
Owner

That would be unlikely unless your generated data has very low variance or is very out of distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants