training failure - how to fix it? #70

eps696 · 2023-02-27T11:31:47Z

we have attempted to reproduce your results by training the models from scratch with provided data and settings (pirounet_dance, pirounet_watch, and the commented hyperparams in the default_config.py). unfortunately, we didn't succeed: generations from the trained models were far from the shown examples (that is, no movements, random point cloud in the latent space, etc.). we would greatly appreciate any advice on how to reach proper results with your inputs, before going further with our data.
loss plots from the training looked like this:

do you get different picture when running the published code on your side?

more comments:

if we use the commented settings from the default_config.py for training, there's an error about labels_train_true (which is not used further); should it be renamed to labels_train (which is needed for train.run_train_dgm)?
what is the reason for using different classifier architecture in the trained model vs provided separately one?
there's an absolute path hardcoded at datasets.py:133, which required direct manual editing.

The text was updated successfully, but these errors were encountered:

mathildepapillon · 2023-03-06T23:53:34Z

I am excited that you are trying to reproduce the results! It is indeed quite unexpected that this training is failing, considering our plots look quite similar to yours but do produce movement (see attached). By "no movement", do you mean that generated movement gifs are static? Or do not look like the training data?

With respect to the latent space: when we visualize our latent space with PCA (see an example below), we see it encodes the movement sequences in a continuous and smooth path. While it is not disentangled, we see some level of organization. Could you give more detail about what results you are getting in the latent space?

In response to your additional comments:

thank you for spotting that! I've made the fix.
the separately provided classifier is for the purpose of evaluating the model's quantitative metrics. It is not actually a part of the model like the linear classifier in DeepGenerativeModel, but rather a tool to measure its performance.
thank you for spotting that! I switched it to a relative path

eps696 · 2023-03-07T00:50:20Z

thank you so much for the response!
below are few samples from our training
generated "animations" for the labels 0,1,2 (i guess you see what i meant by "no movement" ::)

latent space

confusion matrices

eps696 · 2023-03-07T01:06:01Z

regarding the plots - they do look very similar, except for the labelling recon loss.
on your side it keeps decreasing more or less all the time, while in our case it did quite a weird step, after which nearly didn't change till the end. here is the enlarged image:

i guessed there might be issues with reading/applying the labels during the training, but on the first glance their values looked
reasonable (while i don't know how exactly they should've looked..)

can we run the training without labels - to possibly isolate the rest of the process for troubleshooting?

mathildepapillon · 2023-03-07T23:59:16Z

Something bizarre is definitely happening! I agree that it seems to be an issue with the labels. To check, you could try training with only one label given to all the labeled sequences (just replace the file with a single-valued array for example, or manually set the label to be 0.) That would avoid having to change input data size. If the training still fails, I would try performing a hyperparameter sweep to see if you see different results with different settings.

eps696 · 2023-03-13T11:50:51Z

thanks for a hint!
unfortunately, i had troubles with following it. i presumed that:

amount_of_labels in config was for the axis/column count in csv file. that should've been 2 for the provided csv file, but it's filtered by time/space in load_labels, therefore we leave only 1. that means we already use only one label for the training..
label_dim was the count of values which a label could take, and which were transformed into onehots of label_dim size. this understanding has also failed, as the provided labels values range is 4, yet label_dim is fixed to 3 in the config, and any change leads to an error in the generate_cond from some harcoded reshape (which on the first glance relates to the coords).

so, could you please explain the logics behind labels control & processing a bit more?

mathildepapillon self-assigned this Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training failure - how to fix it? #70

training failure - how to fix it? #70

eps696 commented Feb 27, 2023

mathildepapillon commented Mar 6, 2023

eps696 commented Mar 7, 2023

eps696 commented Mar 7, 2023

mathildepapillon commented Mar 7, 2023

eps696 commented Mar 13, 2023

training failure - how to fix it? #70

training failure - how to fix it? #70

Comments

eps696 commented Feb 27, 2023

mathildepapillon commented Mar 6, 2023

eps696 commented Mar 7, 2023

eps696 commented Mar 7, 2023

mathildepapillon commented Mar 7, 2023

eps696 commented Mar 13, 2023