Core dump "Could not decode datum" during training #763

YaYaB · 2020-07-24T10:34:33Z

If Ok, please give as many details as possible to help us solve the problem more efficiently.

Configuration

Version of DeepDetect:
- Locally compiled on:
  - Ubuntu 14.04 LTS
  - Mac OSX
  - Other:
- Docker
- Amazon AMI
Commit (shown by the server when starting):
ecdfad8

Your question / the problem you're facing:

I've launched a training for an image model. Everything went well during the lmdb creation (no errors seen). However at some point during the training I got a core dump.
Note that it was during the second epoch of my training so all the data has been seen and the test set has been predicted one time.

Error message (if any) / steps to reproduce the problem:

Here are the logs I obtained when it core dumped/

Server log output:

libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng error: IDAT: CRC error
[2020-07-24 10:06:14.222] [caffe] [error] Could not decode datum 
terminate called after throwing an instance of 'CaffeErrorException'
  what():  src/caffe/data_transformer.cpp:895 / Check failed (custom): cv_cropped_image.data
[1]    5337 abort (core dumped)  ./dede --port 8081

I've searched a bit, it might be due to a corrupted image but I don't understand how it worked correctly in the first epoch if it is the case.

The text was updated successfully, but these errors were encountered:

beniz · 2020-07-27T06:16:30Z

Hi, libpng says it, there's an issue with an image somewhere. Best way is to write a script that decodes all images to decode all images.

To debug if it's an object detector being trained, you can also try setting this check_size variable to true: https://github.com/jolibrain/deepdetect/blob/master/src/backends/caffe/caffeinputconns.cc#L871

If the two tests above do not show anything wrong, you can try deactivating all the pragma in this layer, starting here: https://github.com/jolibrain/caffe/blob/master/src/caffe/layers/annotated_data_layer.cpp#L164

But my hunch is you have a bad png somewhere. I don't know about epochs or so, data augmentation is randomized and datum are prefetched with three threads.

YaYaB · 2020-07-28T09:10:24Z

Yeah I may have some weird pngs, I tried decode all those but it seemed okay.. I'll try again to see

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core dump "Could not decode datum" during training #763

Core dump "Could not decode datum" during training #763

YaYaB commented Jul 24, 2020

beniz commented Jul 27, 2020

YaYaB commented Jul 28, 2020

Core dump "Could not decode datum" during training #763

Core dump "Could not decode datum" during training #763

Comments

YaYaB commented Jul 24, 2020

Configuration

Your question / the problem you're facing:

Error message (if any) / steps to reproduce the problem:

beniz commented Jul 27, 2020

YaYaB commented Jul 28, 2020