Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch's evaluation takes an unusual long time #55

Open
PonteIneptique opened this issue Oct 3, 2017 · 22 comments
Open

Epoch's evaluation takes an unusual long time #55

PonteIneptique opened this issue Oct 3, 2017 · 22 comments
Labels

Comments

@PonteIneptique
Copy link
Member

Compared to previous Keras implementation, the evaluation takes a really long time (on GPU). It takes few minutes to evaluate while it takes ~ 30/35 seconds to train...

This could be related to #54. Could it be things are run twice ?

@PonteIneptique
Copy link
Member Author

PonteIneptique commented Oct 4, 2017

So I definitely can confirm there is an huge issue of performance here : it usually took me 30 to 40 minutes on GPU to train a whole network for medieval French (200k tokens, 150 epochs). The PyTorch model ran for the whole night (9PM-7.18AM) and I am only at epoch 88.

@PonteIneptique
Copy link
Member Author

Additional notes : the overall epoch fitting takes a little more time, 27s instead of 17s. But I don't think this explain the whole efficiency drop.

@PonteIneptique
Copy link
Member Author

PonteIneptique commented Oct 4, 2017

Configuration for more details :

# Configuration file for the Pandora system
[global]
nb_encoding_layers = 2
nb_dense_dims = 1000
batch_size = 100
nb_left_tokens = 2
nb_right_tokens = 1
nb_embedding_dims = 100
model_dir = models/chrestien
postcorrect = False
include_token = True
include_context = True
include_lemma = label
include_pos = True
include_morph = False
include_dev = True
include_test = True
nb_filters = 150
min_token_freq_emb = 5
filter_length = 3
focus_repr = convolutions
dropout_level = 0.15
nb_epochs = 150
halve_lr_at = 75
max_token_len = 20
min_lem_cnt = 1
model = PyTorch
max_lemma_len = 32

@PonteIneptique
Copy link
Member Author

PonteIneptique commented Oct 4, 2017

It takes more or less 20 minutes to eval scores.

Ideas of where we might lose performances :

  • Batch Size. I am thinking we do not need batch size on testing/eval. Maybe by running everything at once, it'd perform a little better
  • GPU to CPU conversion. Memory transfer could be an issue. I have no idea how to deal with this one. Right now we are converting each value one by one. Maybe there is a way to deal with them in group ?
  • Maybe predict would benefit to be CPU only ? Need to see if it would make things faster

@emanjavacas
Copy link

I will try to reproduce this. I haven't encountered the issue (same with #54) running on train.py. Could you try to debug a bit starting from there?

With respect to your ideas, I have already referred to a) and c) somewhere else. Basically, during inference you want as high a batch_size as you can afford, since it doesn't have an effect on the output (tagging), same doesn't apply during training.
b) shouldn't be an issue per se.

One bottleneck is that the entire pipeline still feels too handcrafted with the keras model in mind. The pytorch model could benefit (in terms of speed and performance) from changes in the way the data is loaded, but that would need some considerable amount of refactoring in the client code.

@PonteIneptique
Copy link
Member Author

PonteIneptique commented Oct 4, 2017

If you have not encountered this issue, I could think it comes from the test corpora, as if you were using train.py, there was no use of test corpora...

@PonteIneptique
Copy link
Member Author

I added a branch to keep track of what's going on with predict :
predict.txt
I force batches of size 1000 on purpose. Predict takes gradually more time apparently and stabilize around 0.40 which amounts to 69s more or less.

I guess having batches of size 100 is pretty bad for predict... Maybe we introduce a batch_size_predict argument ? The same computation stabilizes around 0.20s for 100 sized batches at 1k batches (and still grows after that).

@PonteIneptique
Copy link
Member Author

@PonteIneptique
Copy link
Member Author

Note : I moved some issues into #64 and #65

Please make sure to open issues when a separate bug arises. This keeps discussion clean and understandable

@mikekestemont
Copy link

@emanjavacas I can reproduce your error and will look into now.

@PonteIneptique
Copy link
Member Author

@mikekestemont Could it be this comment is about #65 ?

@mikekestemont
Copy link

mikekestemont commented Oct 18, 2017 via email

@PonteIneptique
Copy link
Member Author

To add a little more background on this issue :

  • 100-sized batches are processed really really fast in Keras
  • 1000-sized batches are processed with the same amount of time in PyTorch

@PonteIneptique
Copy link
Member Author

Regarding this, I think the best thing to do is to open a new parameter : test_batch_size for enhancing speed of eval without affecting the training batch_size

@emanjavacas
Copy link

So I gather the issue was that evaluation was done with a very small batch size?

@PonteIneptique
Copy link
Member Author

It was definitely a question of batch size. The weird thing is that this was much much more efficient in Keras for some reason.

@emanjavacas
Copy link

Mhh, could you elaborate? I still can't see why a small batch size would lead to exponentially increasing running time during evaluation (as shown in your file predict.txt).

@PonteIneptique
Copy link
Member Author

Keras somehow did not take as much time to eval with the same batch size. I have absolutely no idea what's the cause of it.

The one thing I did not check is if the network for eval was CPU or GPU. But even then... I don't see how that would create that much of a difference...

I know it's one way to fix this time consuming issue, I don't know where it comes from. Mostly because the training time is about the same...

@emanjavacas
Copy link

emanjavacas commented Oct 23, 2017 via email

@PonteIneptique
Copy link
Member Author

I know understand your question. It was both slower and growing. I did not evaluate if this growing eval time happened with the new test batch size. I am pretty sure it might. But it might not be as important because of the number of batches (?)

I am running everything on GPU though. Gotta use the best available part of the PC :)

@emanjavacas
Copy link

emanjavacas commented Oct 23, 2017 via email

@emanjavacas emanjavacas reopened this Oct 28, 2017
@emanjavacas
Copy link

I've traced back an issue with pytorch training, which might be the culprit for what you were seeing. It affects training and not evaluation though. It is related to Adam and can't see right now why this is happening. Basically after a number of epoch, there is a sudden explosion in the size of the gradient and training slows down by a factor of 5. I've opened an issue in the pytorch discussion forum to see if somebody can shed light on it.

https://discuss.pytorch.org/t/considerable-slowdown-in-adam-step-after-a-number-of-epochs/9185

For now, switching to Adagrad solves it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants