Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error in training process #40

Closed
camel2000 opened this issue Dec 20, 2024 · 2 comments
Closed

OOM error in training process #40

camel2000 opened this issue Dec 20, 2024 · 2 comments

Comments

@camel2000
Copy link

camel2000 commented Dec 20, 2024

my gpu memory is 80G, How to avoid oom ?

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/camel2000/project/protenix/./runner/train.py", line 574, in <module>
[rank0]:     main()
[rank0]:   File "/home/camel2000/project/protenix/./runner/train.py", line 570, in main
[rank0]:     trainer.run()
[rank0]:   File "/home/camel2000/project/protenix/./runner/train.py", line 541, in run
[rank0]:     self.evaluate()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/camel2000/project/protenix/./runner/train.py", line 329, in evaluate
[rank0]:     self._evaluate()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/camel2000/project/protenix/./runner/train.py", line 373, in _evaluate
[rank0]:     batch, _ = self.model_forward(batch, mode=mode)
[rank0]:   File "/home/camel2000/project/protenix/./runner/train.py", line 283, in model_forward
[rank0]:     batch["pred_dict"], batch["label_dict"], log_dict = self.model(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/protenix.py", line 714, in forward
[rank0]:     pred_dict, log_dict, time_tracker = self.main_inference_loop(
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/protenix.py", line 322, in main_inference_loop
[rank0]:     pred_dict, log_dict, time_tracker = self._main_inference_loop(
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/protenix.py", line 416, in _main_inference_loop
[rank0]:     pred_dict["coordinate"] = self.sample_diffusion(
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/protenix.py", line 276, in sample_diffusion
[rank0]:     return autocasting_disable_decorator(self.configs.skip_amp.sample_diffusion)(
[rank0]:   File "/home/camel2000/project/protenix/protenix/utils/torch_utils.py", line 119, in new_func
[rank0]:     return func(
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/generator.py", line 240, in sample_diffusion
[rank0]:     chunk_x_l = _chunk_sample_diffusion(
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/generator.py", line 208, in _chunk_sample_diffusion
[rank0]:     x_denoised = denoise_net(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/modules/diffusion.py", line 514, in forward
[rank0]:     r_update = self.f_forward(
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/modules/diffusion.py", line 386, in f_forward
[rank0]:     s_single, z_pair = self.diffusion_conditioning(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/modules/diffusion.py", line 115, in forward
[rank0]:     pair_z = self.linear_no_bias_z(self.layernorm_z(pair_z))
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/layer_norm/layer_norm.py", line 129, in forward
[rank0]:     return self.kernel_forward(input)
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/layer_norm/layer_norm.py", line 132, in kernel_forward
[rank0]:     return FusedLayerNormAffineFunction.apply(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/home/camel2000/project/protenix/protenix/model/layer_norm/layer_norm.py", line 66, in forward
[rank0]:     output, mean, invvar = fastfold_layer_norm_cuda.forward_affine(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 21.97 GiB. GPU
@zhangyuxuann
Copy link
Collaborator

zhangyuxuann commented Dec 21, 2024

Hi @camel2000, can you give the command for training, or it's the train_demo.sh in this repo? cc @yangyanpinghpc According to the error message, I suspect that a relatively long token may appear in the evaluation set? we recommend to use evaluation dataset with token less than 1536 both for efficiency and to avoid OOM. Can you check if the token filter is added for posebusters dataset?
image

@camel2000
Copy link
Author

@zhangyuxuann thanks, problem solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants