Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Bad_malloc" in train_2D.py after followed the instruction. #82

Open
paperplane03 opened this issue Jan 16, 2025 · 14 comments
Open

"Bad_malloc" in train_2D.py after followed the instruction. #82

paperplane03 opened this issue Jan 16, 2025 · 14 comments
Assignees
Labels
Partipant Issue A code issue/bug encountered by challenge participant(s).

Comments

@paperplane03
Copy link

Dear staff,
I update the repo and followed the instruction, but I meet bad malloc problem.
$ python train_2D.py Training device: cuda Reading csv file... Constructing datasets... Training datasets: 100%|██████████████████████████████████████████████████████████████████████████████| 231/231 [00:00<00:00, 1087.41it/s] Validation datasets: 100%|██████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 2241.66it/s] Verifying datasets... Training datasets: 100%|███████████████████████████████████████████████████████████████████████████████| 231/231 [00:00<00:00, 687.97it/s] Validation datasets: 100%|██████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 1000.40it/s] Training: 0%| | 0/1000 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
I don't met it before. Is there some new updates leading to the error?

@rhoadesScholar
Copy link
Member

@paperplane03 My guess is that you are running out of memory. Try lowering the batch_size in train_2D.py, and let us know if that solves the issue.

@rhoadesScholar rhoadesScholar self-assigned this Jan 16, 2025
@rhoadesScholar rhoadesScholar added the Partipant Issue A code issue/bug encountered by challenge participant(s). label Jan 16, 2025
@paperplane03
Copy link
Author

paperplane03 commented Jan 19, 2025

I changed the batch_size to 1, but I still get the error.
I started debugging the code, and I add two lines in train.py

Image
The output is:
length of dataset: [43218, 43218, 65268, 392, 392, 492, 336, 336, 336, 336, 336, 336, 432, 336, 336, 336, 336, 432, 316, 158, 316, 316, 316, 316, 48400, 404, 404, 28116, 28116, 998118, 10771848, 392, 1429218, 392, 998118, 998118, 392, 998118, 241968, 998118, 2130568, 392, 392, 998118, 2130568, 998118, 998118, 252, 516, 516, 648, 516, 922383, 1310660, 280, 160, 816642, 516, 648, 67914, 280, 392, 232, 392, 392, 392, 392, 276, 448, 448, 184, 220, 448, 65268, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 392, 392, 392, 392, 8713818, 392, 392, 392, 65268, 392, 392, 392, 8713818, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 392, 392, 392, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 392, 388, 392, 392, 376, 392, 392, 392, 340, 340, 340, 340, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 420, 328, 328, 328, 328, 328, 420, 328, 50752, 244, 328, 357773300, 6961400880, 6961400880, 357773300, 29175236416, 3437263280, 313022916, 59770888880, 121581528880, 7973970480, 29848040880, 59770888880, 59587609536, 59770888880, 206428376880, 206428376880, 533237680, 204888768, 1363045680]

I notice some length are too long (especially the last elements), so I print the last datasets, it is:

Image

However, default fetching data will not download the data, the 's0' folder in 'jrc_mus-liver-3/recon-1/em/' is empty.
Image

That's the problem I found.

@paperplane03
Copy link
Author

It this means that if I just download the 36.7G data, it will not able to download all datas needed to train the example model?

@Luchixiang
Copy link

Met the same problem on our 4x4090 server with even batch_size = 1 in train2D.py.

@rhoadesScholar
Copy link
Member

Some of the datasets are indeed incredibly large. I am looking into the issue of the data not downloading for jrc_mus-liver-3 at the moment to see if this is related. Thank you for your patience.

@rhoadesScholar
Copy link
Member

@paperplane03 and @Luchixiang, what OS is each of you using?

@Luchixiang
Copy link

ubuntu x86_64 with ~100G free memory.

@paperplane03
Copy link
Author

@rhoadesScholar We use ubuntu x86_64. Thank you for help!

@rhoadesScholar
Copy link
Member

It this means that if I just download the 36.7G data, it will not able to download all datas needed to train the example model?

Regarding finding empty fibsem-uint8/s0 folders: If you do not opt to download all resolutions of the raw data (csc fetch-data -all-res) you will only download the raw data up to the resolution matching the associated annotations. This aims to save space, as downloading all of the full resolution data is around ~600GB.

@rhoadesScholar
Copy link
Member

Still working on the "Bad_malloc" unfortunately...

@rhoadesScholar
Copy link
Member

@Luchixiang @paperplane03 Can you add 'device="cpu"' to the training script, start training, and then watch 'top' in the terminal to observe the memory usage? I'm having trouble replicating the bug consistently.

@paperplane03
Copy link
Author

We will do it. Do you mean that in your server the "bad malloc" task will not exist?

@paperplane03
Copy link
Author

I use psutil in train.py, which I record Real and Virtual memory every 0.01s.
Image

I run the train_2D.py (add device='cpu'), and here is the output:
Image

@Luchixiang
Copy link

Any update of it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Partipant Issue A code issue/bug encountered by challenge participant(s).
Projects
None yet
Development

No branches or pull requests

3 participants