Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training failure - how to fix it? #70

Open
eps696 opened this issue Feb 27, 2023 · 5 comments
Open

training failure - how to fix it? #70

eps696 opened this issue Feb 27, 2023 · 5 comments
Assignees

Comments

@eps696
Copy link

eps696 commented Feb 27, 2023

we have attempted to reproduce your results by training the models from scratch with provided data and settings (pirounet_dance, pirounet_watch, and the commented hyperparams in the default_config.py). unfortunately, we didn't succeed: generations from the trained models were far from the shown examples (that is, no movements, random point cloud in the latent space, etc.). we would greatly appreciate any advice on how to reach proper results with your inputs, before going further with our data.
loss plots from the training looked like this:
image
do you get different picture when running the published code on your side?

more comments:

  • if we use the commented settings from the default_config.py for training, there's an error about labels_train_true (which is not used further); should it be renamed to labels_train (which is needed for train.run_train_dgm)?
  • what is the reason for using different classifier architecture in the trained model vs provided separately one?
  • there's an absolute path hardcoded at datasets.py:133, which required direct manual editing.
@mathildepapillon
Copy link
Collaborator

I am excited that you are trying to reproduce the results! It is indeed quite unexpected that this training is failing, considering our plots look quite similar to yours but do produce movement (see attached). By "no movement", do you mean that generated movement gifs are static? Or do not look like the training data?
image
With respect to the latent space: when we visualize our latent space with PCA (see an example below), we see it encodes the movement sequences in a continuous and smooth path. While it is not disentangled, we see some level of organization. Could you give more detail about what results you are getting in the latent space?
image

In response to your additional comments:

  • thank you for spotting that! I've made the fix.
  • the separately provided classifier is for the purpose of evaluating the model's quantitative metrics. It is not actually a part of the model like the linear classifier in DeepGenerativeModel, but rather a tool to measure its performance.
  • thank you for spotting that! I switched it to a relative path

@eps696
Copy link
Author

eps696 commented Mar 7, 2023

thank you so much for the response!
below are few samples from our training
generated "animations" for the labels 0,1,2 (i guess you see what i meant by "no movement" ::)
anim-0-ep499-test4
anim-1-ep499-test4
anim-2-ep499-test4

latent space
encoded_test00-499

confusion matrices
confusion_acc_test
confusion_acc_valid

@eps696
Copy link
Author

eps696 commented Mar 7, 2023

regarding the plots - they do look very similar, except for the labelling recon loss.
on your side it keeps decreasing more or less all the time, while in our case it did quite a weird step, after which nearly didn't change till the end. here is the enlarged image:
image

i guessed there might be issues with reading/applying the labels during the training, but on the first glance their values looked
reasonable (while i don't know how exactly they should've looked..)

can we run the training without labels - to possibly isolate the rest of the process for troubleshooting?

@mathildepapillon mathildepapillon self-assigned this Mar 7, 2023
@mathildepapillon
Copy link
Collaborator

Something bizarre is definitely happening! I agree that it seems to be an issue with the labels. To check, you could try training with only one label given to all the labeled sequences (just replace the file with a single-valued array for example, or manually set the label to be 0.) That would avoid having to change input data size. If the training still fails, I would try performing a hyperparameter sweep to see if you see different results with different settings.

@eps696
Copy link
Author

eps696 commented Mar 13, 2023

thanks for a hint!
unfortunately, i had troubles with following it. i presumed that:

  • amount_of_labels in config was for the axis/column count in csv file. that should've been 2 for the provided csv file, but it's filtered by time/space in load_labels, therefore we leave only 1. that means we already use only one label for the training..
  • label_dim was the count of values which a label could take, and which were transformed into onehots of label_dim size. this understanding has also failed, as the provided labels values range is 4, yet label_dim is fixed to 3 in the config, and any change leads to an error in the generate_cond from some harcoded reshape (which on the first glance relates to the coords).

so, could you please explain the logics behind labels control & processing a bit more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants