Epoch's evaluation takes an unusual long time #55

PonteIneptique · 2017-10-03T06:56:41Z

Compared to previous Keras implementation, the evaluation takes a really long time (on GPU). It takes few minutes to evaluate while it takes ~ 30/35 seconds to train...

This could be related to #54. Could it be things are run twice ?

PonteIneptique · 2017-10-04T05:19:04Z

So I definitely can confirm there is an huge issue of performance here : it usually took me 30 to 40 minutes on GPU to train a whole network for medieval French (200k tokens, 150 epochs). The PyTorch model ran for the whole night (9PM-7.18AM) and I am only at epoch 88.

PonteIneptique · 2017-10-04T05:23:19Z

Additional notes : the overall epoch fitting takes a little more time, 27s instead of 17s. But I don't think this explain the whole efficiency drop.

PonteIneptique · 2017-10-04T05:26:00Z

Configuration for more details :

# Configuration file for the Pandora system
[global]
nb_encoding_layers = 2
nb_dense_dims = 1000
batch_size = 100
nb_left_tokens = 2
nb_right_tokens = 1
nb_embedding_dims = 100
model_dir = models/chrestien
postcorrect = False
include_token = True
include_context = True
include_lemma = label
include_pos = True
include_morph = False
include_dev = True
include_test = True
nb_filters = 150
min_token_freq_emb = 5
filter_length = 3
focus_repr = convolutions
dropout_level = 0.15
nb_epochs = 150
halve_lr_at = 75
max_token_len = 20
min_lem_cnt = 1
model = PyTorch
max_lemma_len = 32

PonteIneptique · 2017-10-04T05:50:28Z

It takes more or less 20 minutes to eval scores.

Ideas of where we might lose performances :

Batch Size. I am thinking we do not need batch size on testing/eval. Maybe by running everything at once, it'd perform a little better
GPU to CPU conversion. Memory transfer could be an issue. I have no idea how to deal with this one. Right now we are converting each value one by one. Maybe there is a way to deal with them in group ?
Maybe predict would benefit to be CPU only ? Need to see if it would make things faster

emanjavacas · 2017-10-04T08:02:39Z

I will try to reproduce this. I haven't encountered the issue (same with #54) running on train.py. Could you try to debug a bit starting from there?

With respect to your ideas, I have already referred to a) and c) somewhere else. Basically, during inference you want as high a batch_size as you can afford, since it doesn't have an effect on the output (tagging), same doesn't apply during training.
b) shouldn't be an issue per se.

One bottleneck is that the entire pipeline still feels too handcrafted with the keras model in mind. The pytorch model could benefit (in terms of speed and performance) from changes in the way the data is loaded, but that would need some considerable amount of refactoring in the client code.

PonteIneptique · 2017-10-04T08:20:10Z

If you have not encountered this issue, I could think it comes from the test corpora, as if you were using train.py, there was no use of test corpora...

PonteIneptique · 2017-10-04T09:52:24Z

I added a branch to keep track of what's going on with predict :
predict.txt
I force batches of size 1000 on purpose. Predict takes gradually more time apparently and stabilize around 0.40 which amounts to 69s more or less.

I guess having batches of size 100 is pretty bad for predict... Maybe we introduce a batch_size_predict argument ? The same computation stabilizes around 0.20s for 100 sized batches at 1k batches (and still grows after that).

PonteIneptique · 2017-10-04T09:53:33Z

Code for eval : https://github.com/hipster-philology/pandora/blob/eval-drop-perfomance/pandora/impl/pytorch/model.py#L368-L370

PonteIneptique · 2017-10-10T13:24:06Z

Note : I moved some issues into #64 and #65

Please make sure to open issues when a separate bug arises. This keeps discussion clean and understandable

mikekestemont · 2017-10-18T11:07:53Z

@emanjavacas I can reproduce your error and will look into now.

PonteIneptique · 2017-10-18T11:09:00Z

@mikekestemont Could it be this comment is about #65 ?

mikekestemont · 2017-10-18T11:10:03Z

yes, i'll take to there.

…

On Wed, Oct 18, 2017 at 1:09 PM, Thibault Clérice ***@***.***> wrote: @mikekestemont <https://github.com/mikekestemont> Could it be this comment is about #65 <#65> ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#55 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AELJLyIQ2bjLD9NzTph6NoMfNYCjFwksks5stdxMgaJpZM4Prqng> .

PonteIneptique · 2017-10-23T06:35:40Z

To add a little more background on this issue :

100-sized batches are processed really really fast in Keras
1000-sized batches are processed with the same amount of time in PyTorch

PonteIneptique · 2017-10-23T06:56:23Z

Regarding this, I think the best thing to do is to open a new parameter : test_batch_size for enhancing speed of eval without affecting the training batch_size

emanjavacas · 2017-10-23T14:45:26Z

So I gather the issue was that evaluation was done with a very small batch size?

PonteIneptique · 2017-10-23T14:50:57Z

It was definitely a question of batch size. The weird thing is that this was much much more efficient in Keras for some reason.

emanjavacas · 2017-10-23T15:02:29Z

Mhh, could you elaborate? I still can't see why a small batch size would lead to exponentially increasing running time during evaluation (as shown in your file predict.txt).

PonteIneptique · 2017-10-23T15:06:02Z

Keras somehow did not take as much time to eval with the same batch size. I have absolutely no idea what's the cause of it.

The one thing I did not check is if the network for eval was CPU or GPU. But even then... I don't see how that would create that much of a difference...

I know it's one way to fix this time consuming issue, I don't know where it comes from. Mostly because the training time is about the same...

emanjavacas · 2017-10-23T15:21:10Z

So, but then the running time wasn't growing exponentially but it was just slower? If you were running on the CPU one reason might be that pytorch (especially older versions) has been shown to be slower than other engines... 2017-10-23 17:06 GMT+02:00 Thibault Clérice <[email protected]>:

…

Keras somehow did not take as much time to eval with the same batch size. I have absolutely no idea what's the cause of it. The one thing I did not check is if the network for eval was CPU or GPU. But even then... I don't see how that would create that much of a difference... I know it's one way to fix this time consuming issue, I don't know where it comes from. Mostly because the training time is about the same... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#55 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF6Ho2g3xEoP2zkah1BL9lMVvCFzvQ0Mks5svKtagaJpZM4Prqng> .

-- Enrique Manjavacas.

PonteIneptique · 2017-10-23T15:24:39Z

I know understand your question. It was both slower and growing. I did not evaluate if this growing eval time happened with the new test batch size. I am pretty sure it might. But it might not be as important because of the number of batches (?)

I am running everything on GPU though. Gotta use the best available part of the PC :)

emanjavacas · 2017-10-23T15:32:47Z

Ok, then if running time is growing exponentially, we definitely need to debug it. That should not happen. There is no reason why two consecutive batches of the same size should take different time. Could you check if the memory usage is also increasing? 2017-10-23 17:24 GMT+02:00 Thibault Clérice <[email protected]>:

…

I know understand your question. It was both slower and growing. I did not evaluate if this growing eval time happened with the new test batch size. I am pretty sure it might. But it might not be as important because of the number of batches (?) I am running everything on GPU though. Gotta use the best available part of the PC :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#55 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF6Ho5GaFdkyguUnINHhoS8nCyb3MHmgks5svK-3gaJpZM4Prqng> .

-- Enrique Manjavacas.

emanjavacas · 2017-10-28T10:06:55Z

I've traced back an issue with pytorch training, which might be the culprit for what you were seeing. It affects training and not evaluation though. It is related to Adam and can't see right now why this is happening. Basically after a number of epoch, there is a sudden explosion in the size of the gradient and training slows down by a factor of 5. I've opened an issue in the pytorch discussion forum to see if somebody can shed light on it.

https://discuss.pytorch.org/t/considerable-slowdown-in-adam-step-after-a-number-of-epochs/9185

For now, switching to Adagrad solves it.

PonteIneptique added the PyTorch label Oct 3, 2017

emanjavacas mentioned this issue Oct 9, 2017

Test scores are printed twice (once with Logger, once alone) #54

Closed

PonteIneptique mentioned this issue Oct 10, 2017

Additional corpora #64

Closed

hipster-philology deleted a comment from Jean-Baptiste-Camps Oct 10, 2017

PonteIneptique mentioned this issue Oct 10, 2017

Train.py not working anymore : self.max_token_len = len(max(tokens, key=len)) + 1 #65

Closed

hipster-philology deleted a comment from emanjavacas Oct 10, 2017

hipster-philology deleted a comment from Jean-Baptiste-Camps Oct 10, 2017

hipster-philology deleted a comment from emanjavacas Oct 10, 2017

PonteIneptique mentioned this issue Oct 23, 2017

Adding an Eval-Specific Batch Size Parameter #79

Merged

Jean-Baptiste-Camps closed this as completed in #79 Oct 23, 2017

emanjavacas reopened this Oct 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epoch's evaluation takes an unusual long time #55

Epoch's evaluation takes an unusual long time #55

PonteIneptique commented Oct 3, 2017

PonteIneptique commented Oct 4, 2017 •

edited

Loading

PonteIneptique commented Oct 4, 2017

PonteIneptique commented Oct 4, 2017 •

edited

Loading

PonteIneptique commented Oct 4, 2017 •

edited

Loading

emanjavacas commented Oct 4, 2017

PonteIneptique commented Oct 4, 2017 •

edited

Loading

PonteIneptique commented Oct 4, 2017

PonteIneptique commented Oct 4, 2017

PonteIneptique commented Oct 10, 2017

mikekestemont commented Oct 18, 2017

PonteIneptique commented Oct 18, 2017

mikekestemont commented Oct 18, 2017 via email

PonteIneptique commented Oct 23, 2017

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017 via email

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017 via email

emanjavacas commented Oct 28, 2017

Epoch's evaluation takes an unusual long time #55

Epoch's evaluation takes an unusual long time #55

Comments

PonteIneptique commented Oct 3, 2017

PonteIneptique commented Oct 4, 2017 • edited Loading

PonteIneptique commented Oct 4, 2017

PonteIneptique commented Oct 4, 2017 • edited Loading

PonteIneptique commented Oct 4, 2017 • edited Loading

emanjavacas commented Oct 4, 2017

PonteIneptique commented Oct 4, 2017 • edited Loading

PonteIneptique commented Oct 4, 2017

PonteIneptique commented Oct 4, 2017

PonteIneptique commented Oct 10, 2017

mikekestemont commented Oct 18, 2017

PonteIneptique commented Oct 18, 2017

mikekestemont commented Oct 18, 2017 via email

PonteIneptique commented Oct 23, 2017

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017 via email

PonteIneptique commented Oct 23, 2017

emanjavacas commented Oct 23, 2017 via email

emanjavacas commented Oct 28, 2017

PonteIneptique commented Oct 4, 2017 •

edited

Loading

PonteIneptique commented Oct 4, 2017 •

edited

Loading

PonteIneptique commented Oct 4, 2017 •

edited

Loading

PonteIneptique commented Oct 4, 2017 •

edited

Loading