-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epoch's evaluation takes an unusual long time #55
Comments
So I definitely can confirm there is an huge issue of performance here : it usually took me 30 to 40 minutes on GPU to train a whole network for medieval French (200k tokens, 150 epochs). The PyTorch model ran for the whole night (9PM-7.18AM) and I am only at epoch 88. |
Additional notes : the overall epoch fitting takes a little more time, 27s instead of 17s. But I don't think this explain the whole efficiency drop. |
Configuration for more details : # Configuration file for the Pandora system
[global]
nb_encoding_layers = 2
nb_dense_dims = 1000
batch_size = 100
nb_left_tokens = 2
nb_right_tokens = 1
nb_embedding_dims = 100
model_dir = models/chrestien
postcorrect = False
include_token = True
include_context = True
include_lemma = label
include_pos = True
include_morph = False
include_dev = True
include_test = True
nb_filters = 150
min_token_freq_emb = 5
filter_length = 3
focus_repr = convolutions
dropout_level = 0.15
nb_epochs = 150
halve_lr_at = 75
max_token_len = 20
min_lem_cnt = 1
model = PyTorch
max_lemma_len = 32 |
It takes more or less 20 minutes to eval scores. Ideas of where we might lose performances :
|
I will try to reproduce this. I haven't encountered the issue (same with #54) running on With respect to your ideas, I have already referred to a) and c) somewhere else. Basically, during inference you want as high a batch_size as you can afford, since it doesn't have an effect on the output (tagging), same doesn't apply during training. One bottleneck is that the entire pipeline still feels too handcrafted with the keras model in mind. The pytorch model could benefit (in terms of speed and performance) from changes in the way the data is loaded, but that would need some considerable amount of refactoring in the client code. |
If you have not encountered this issue, I could think it comes from the test corpora, as if you were using train.py, there was no use of test corpora... |
I added a branch to keep track of what's going on with predict : I guess having batches of size 100 is pretty bad for predict... Maybe we introduce a batch_size_predict argument ? The same computation stabilizes around 0.20s for 100 sized batches at 1k batches (and still grows after that). |
@emanjavacas I can reproduce your error and will look into now. |
@mikekestemont Could it be this comment is about #65 ? |
yes, i'll take to there.
…On Wed, Oct 18, 2017 at 1:09 PM, Thibault Clérice ***@***.***> wrote:
@mikekestemont <https://github.com/mikekestemont> Could it be this
comment is about #65
<#65> ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#55 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AELJLyIQ2bjLD9NzTph6NoMfNYCjFwksks5stdxMgaJpZM4Prqng>
.
|
To add a little more background on this issue :
|
Regarding this, I think the best thing to do is to open a new parameter : test_batch_size for enhancing speed of eval without affecting the training batch_size |
So I gather the issue was that evaluation was done with a very small batch size? |
It was definitely a question of batch size. The weird thing is that this was much much more efficient in Keras for some reason. |
Mhh, could you elaborate? I still can't see why a small batch size would lead to exponentially increasing running time during evaluation (as shown in your file predict.txt). |
Keras somehow did not take as much time to eval with the same batch size. I have absolutely no idea what's the cause of it. The one thing I did not check is if the network for eval was CPU or GPU. But even then... I don't see how that would create that much of a difference... I know it's one way to fix this time consuming issue, I don't know where it comes from. Mostly because the training time is about the same... |
So, but then the running time wasn't growing exponentially but it was just
slower? If you were running on the CPU one reason might be that pytorch
(especially older versions) has been shown to be slower than other
engines...
2017-10-23 17:06 GMT+02:00 Thibault Clérice <[email protected]>:
… Keras somehow did not take as much time to eval with the same batch size.
I have absolutely no idea what's the cause of it.
The one thing I did not check is if the network for eval was CPU or GPU.
But even then... I don't see how that would create that much of a
difference...
I know it's one way to fix this time consuming issue, I don't know where
it comes from. Mostly because the training time is about the same...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#55 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF6Ho2g3xEoP2zkah1BL9lMVvCFzvQ0Mks5svKtagaJpZM4Prqng>
.
--
Enrique Manjavacas.
|
I know understand your question. It was both slower and growing. I did not evaluate if this growing eval time happened with the new test batch size. I am pretty sure it might. But it might not be as important because of the number of batches (?) I am running everything on GPU though. Gotta use the best available part of the PC :) |
Ok, then if running time is growing exponentially, we definitely need to
debug it. That should not happen. There is no reason why two consecutive
batches of the same size should take different time. Could you check if the
memory usage is also increasing?
2017-10-23 17:24 GMT+02:00 Thibault Clérice <[email protected]>:
… I know understand your question. It was both slower and growing. I did not
evaluate if this growing eval time happened with the new test batch size. I
am pretty sure it might. But it might not be as important because of the
number of batches (?)
I am running everything on GPU though. Gotta use the best available part
of the PC :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#55 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF6Ho5GaFdkyguUnINHhoS8nCyb3MHmgks5svK-3gaJpZM4Prqng>
.
--
Enrique Manjavacas.
|
I've traced back an issue with pytorch training, which might be the culprit for what you were seeing. It affects training and not evaluation though. It is related to Adam and can't see right now why this is happening. Basically after a number of epoch, there is a sudden explosion in the size of the gradient and training slows down by a factor of 5. I've opened an issue in the pytorch discussion forum to see if somebody can shed light on it. https://discuss.pytorch.org/t/considerable-slowdown-in-adam-step-after-a-number-of-epochs/9185 For now, switching to Adagrad solves it. |
Compared to previous Keras implementation, the evaluation takes a really long time (on GPU). It takes few minutes to evaluate while it takes ~ 30/35 seconds to train...
This could be related to #54. Could it be things are run twice ?
The text was updated successfully, but these errors were encountered: