"Bad_malloc" in train_2D.py after followed the instruction. #82

paperplane03 · 2025-01-16T15:56:27Z

Dear staff,
I update the repo and followed the instruction, but I meet bad malloc problem.
$ python train_2D.py Training device: cuda Reading csv file... Constructing datasets... Training datasets: 100%|██████████████████████████████████████████████████████████████████████████████| 231/231 [00:00<00:00, 1087.41it/s] Validation datasets: 100%|██████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 2241.66it/s] Verifying datasets... Training datasets: 100%|███████████████████████████████████████████████████████████████████████████████| 231/231 [00:00<00:00, 687.97it/s] Validation datasets: 100%|██████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 1000.40it/s] Training: 0%| | 0/1000 [00:00<?, ?it/s]terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
I don't met it before. Is there some new updates leading to the error?

The text was updated successfully, but these errors were encountered:

rhoadesScholar · 2025-01-16T17:18:49Z

@paperplane03 My guess is that you are running out of memory. Try lowering the batch_size in train_2D.py, and let us know if that solves the issue.

paperplane03 · 2025-01-19T01:40:50Z

I changed the batch_size to 1, but I still get the error.
I started debugging the code, and I add two lines in train.py

The output is:
length of dataset: [43218, 43218, 65268, 392, 392, 492, 336, 336, 336, 336, 336, 336, 432, 336, 336, 336, 336, 432, 316, 158, 316, 316, 316, 316, 48400, 404, 404, 28116, 28116, 998118, 10771848, 392, 1429218, 392, 998118, 998118, 392, 998118, 241968, 998118, 2130568, 392, 392, 998118, 2130568, 998118, 998118, 252, 516, 516, 648, 516, 922383, 1310660, 280, 160, 816642, 516, 648, 67914, 280, 392, 232, 392, 392, 392, 392, 276, 448, 448, 184, 220, 448, 65268, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 392, 392, 392, 392, 8713818, 392, 392, 392, 65268, 392, 392, 392, 8713818, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 392, 392, 392, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 392, 388, 392, 392, 376, 392, 392, 392, 340, 340, 340, 340, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 998118, 420, 328, 328, 328, 328, 328, 420, 328, 50752, 244, 328, 357773300, 6961400880, 6961400880, 357773300, 29175236416, 3437263280, 313022916, 59770888880, 121581528880, 7973970480, 29848040880, 59770888880, 59587609536, 59770888880, 206428376880, 206428376880, 533237680, 204888768, 1363045680]

I notice some length are too long (especially the last elements), so I print the last datasets, it is:

However, default fetching data will not download the data, the 's0' folder in 'jrc_mus-liver-3/recon-1/em/' is empty.

That's the problem I found.

paperplane03 · 2025-01-19T01:42:30Z

It this means that if I just download the 36.7G data, it will not able to download all datas needed to train the example model?

Luchixiang · 2025-01-21T04:30:58Z

Met the same problem on our 4x4090 server with even batch_size = 1 in train2D.py.

rhoadesScholar · 2025-01-21T22:10:01Z

Some of the datasets are indeed incredibly large. I am looking into the issue of the data not downloading for jrc_mus-liver-3 at the moment to see if this is related. Thank you for your patience.

rhoadesScholar · 2025-01-21T22:11:01Z

@paperplane03 and @Luchixiang, what OS is each of you using?

Luchixiang · 2025-01-22T02:24:46Z

ubuntu x86_64 with ~100G free memory.

paperplane03 · 2025-01-23T03:21:53Z

@rhoadesScholar We use ubuntu x86_64. Thank you for help!

rhoadesScholar · 2025-01-23T22:51:03Z

It this means that if I just download the 36.7G data, it will not able to download all datas needed to train the example model?

Regarding finding empty fibsem-uint8/s0 folders: If you do not opt to download all resolutions of the raw data (csc fetch-data -all-res) you will only download the raw data up to the resolution matching the associated annotations. This aims to save space, as downloading all of the full resolution data is around ~600GB.

rhoadesScholar · 2025-01-23T22:51:37Z

Still working on the "Bad_malloc" unfortunately...

rhoadesScholar · 2025-01-24T14:37:55Z

@Luchixiang @paperplane03 Can you add 'device="cpu"' to the training script, start training, and then watch 'top' in the terminal to observe the memory usage? I'm having trouble replicating the bug consistently.

paperplane03 · 2025-01-24T15:08:29Z

We will do it. Do you mean that in your server the "bad malloc" task will not exist?

paperplane03 · 2025-01-25T05:48:40Z

I use psutil in train.py, which I record Real and Virtual memory every 0.01s.

I run the train_2D.py (add device='cpu'), and here is the output:

Luchixiang · 2025-02-03T14:51:54Z

Any update of it?

rhoadesScholar self-assigned this Jan 16, 2025

rhoadesScholar added the Partipant Issue A code issue/bug encountered by challenge participant(s). label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Bad_malloc" in train_2D.py after followed the instruction. #82

"Bad_malloc" in train_2D.py after followed the instruction. #82

paperplane03 commented Jan 16, 2025

rhoadesScholar commented Jan 16, 2025

paperplane03 commented Jan 19, 2025 •

edited

Loading

paperplane03 commented Jan 19, 2025

Luchixiang commented Jan 21, 2025

rhoadesScholar commented Jan 21, 2025

rhoadesScholar commented Jan 21, 2025

Luchixiang commented Jan 22, 2025

paperplane03 commented Jan 23, 2025

rhoadesScholar commented Jan 23, 2025

rhoadesScholar commented Jan 23, 2025

rhoadesScholar commented Jan 24, 2025

paperplane03 commented Jan 24, 2025

paperplane03 commented Jan 25, 2025

Luchixiang commented Feb 3, 2025

"Bad_malloc" in train_2D.py after followed the instruction. #82

"Bad_malloc" in train_2D.py after followed the instruction. #82

Comments

paperplane03 commented Jan 16, 2025

rhoadesScholar commented Jan 16, 2025

paperplane03 commented Jan 19, 2025 • edited Loading

paperplane03 commented Jan 19, 2025

Luchixiang commented Jan 21, 2025

rhoadesScholar commented Jan 21, 2025

rhoadesScholar commented Jan 21, 2025

Luchixiang commented Jan 22, 2025

paperplane03 commented Jan 23, 2025

rhoadesScholar commented Jan 23, 2025

rhoadesScholar commented Jan 23, 2025

rhoadesScholar commented Jan 24, 2025

paperplane03 commented Jan 24, 2025

paperplane03 commented Jan 25, 2025

Luchixiang commented Feb 3, 2025

paperplane03 commented Jan 19, 2025 •

edited

Loading