Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation / training scores mismatch #16

Open
arilmad opened this issue Dec 9, 2020 · 9 comments
Open

Validation / training scores mismatch #16

arilmad opened this issue Dec 9, 2020 · 9 comments

Comments

@arilmad
Copy link

arilmad commented Dec 9, 2020

Hi,

I have run your network based on the notbook in a project of mine. However, I pondered quite a bit over my validation Jaccard scores outperforming the training score by a large margin. I suspect the answer lies in the rounding of yp that you perform in evaluateModel. From what I can tell, this rounding is not done in the function that is used during training. After removing this rounding the scores matched as expected.

Please let me know if I'm missing the point somewhere, or if you agree with the observation.

Thanks for a superb piece of work!

Arild

@saskra
Copy link

saskra commented Dec 9, 2020

I also noticed that. In which line of the code did you do the changes?

@arilmad
Copy link
Author

arilmad commented Dec 9, 2020

Removed yp = np.round(yp,0) in evaluateModel()

@saskra
Copy link

saskra commented Dec 9, 2020

Ah, okay, I think I meant something else. Your change would only have an effect on what I would have called the test scores, whereas I was concerned with the validation scores, which are determined in parallel with the training scores: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7 Because, strangely enough, even these validation values are always higher than the training values in my case, although they should logically not be, see below. But the test results after leave-one-out cross-validation are actually also significantly higher.

train_history

Can you simply do without rounding, most scores are only defined for binary decisions, aren't they?

@arilmad
Copy link
Author

arilmad commented Dec 9, 2020

It is in fact the same thing. Notice that evaluateModel() is run after every epoch, and that the best model is determined based on the Jaccard score in evaluateModel(). Hence, the "test" data in the notebook does function as a validation set because it is indirectly used to pick a "best" model.

With that being said, the problem, so to say, is that the function jacard() which is used during training does not round the prediction.
image

Thus, if you score your test data after training and still round your prediction before scoring it, you will get a score which is not completely comparable to that calculated during training.

To address your last question, since the output layer of the model contains sigmoids it does make sense not to round the prediction during training, however, I don't know how the model would handle it if one changed jacard() to apply rounding.

@saskra
Copy link

saskra commented Dec 9, 2020

Notice that evaluateModel() is run after every epoch, and that the best model is determined based on the Jaccard score in evaluateModel(). Hence, the "test" data in the notebook does function as a validation set because it is indirectly used to pick a "best" model.

I forgot that I had changed that in my copy, I am using a trainStep() with a maximum number of epochs, early stopping and a predefined validation data set. Afterwards I use the evaluateModel() only on the separate test set.

You are right, I am also interested about the effect of rounding or not rounding.

@Jderuijter
Copy link

Hi, I think the issue can also be solved by removing the batchnormalization in the output layer.

Your paper states:
"All the convolutional layers in this network, except for the output layer, are activated by the ReLU (Rectified Linear Unit) activation function (LeCun et al., 2015), and are batch-normalized (Ioffe & Szegedy, 2015). "

by removing the batch-normalization at the output layer to a more standard output layer:
line 119: conv10 = conv2d_bn(mresblock9, n_labels, 1, 1, activation=self.activation)

suggestion:
conv10 = Conv2D(n_labels, (1, 1), activation=self.activation)(mresblock9)

(also applicable to the 3D net)

This resolved similar issue's for me.

@saskra
Copy link

saskra commented Dec 10, 2020

btw: I tested rounding during training and the results now look much more like shown in Figure 6 of the paper. I would be interested in a learning curve including training and validation score like in my example.

@nibtehaz
Copy link
Owner

Thank you @arilmad for your interest in our project, and thanks to @saskra and @Jderuijter for keeping the conversation running . Apologies for my late response as I was occupied with some other stuff for the last few days.

First thing first, since Dice Coef or Jaccard Index are defined for binary values, we should round the values to compute them.

In my notebook, honestly speaking I didn't used the metrics computed during the training procedure, so the fact that the values were not rounded was ignored by me. As it has been pointed in this thread, I have used the evaluateModel() function for my purpose instead.

If you wish to compute the dice or jaccard values during training, it would be proper to round the values.

Also, another thing may be noted, regarding why I didn't include the rounding in computing those metrics in the first place. Actually, I used those functions to compute dice or jaccard based loss functions, i.e. jaccard loss = - jaccard index. Now, when we compute them as metrics we must round them to obtain the actual value, by definition. But, when we are treating them loss functions, we should not round them, rather keep them as floating number, as it would help to improve the model. For example, suppose in one epoch a certain value was 0.67 and in the next epoch it becomes 0.78. If we don't round them, the improvement will be reflected in the loss value, but if we round them the improvement get's lost as round(0.67) = round(0.78) = 1. Since, I actually used those functions to experiment with dice or jaccard based loss function, I didn't do the rounding there.

@saskra
Copy link

saskra commented Jan 19, 2021

But, when we are treating them loss functions, we should not round them, rather keep them as floating number, as it would help to improve the model. For example, suppose in one epoch a certain value was 0.67 and in the next epoch it becomes 0.78. If we don't round them, the improvement will be reflected in the loss value, but if we round them the improvement get's lost as round(0.67) = round(0.78) = 1. Since, I actually used those functions to experiment with dice or jaccard based loss function, I didn't do the rounding there.

To add one more aspect, I once observed how the Jaccard score on the validation data set behaves during training, recording both the relative values as in the original source code and the rounded values during the same run. Interestingly, on this dataset, it looks like the relative values continue to increase for a while after a few epochs, while the rounded values, on the other hand, decrease again. I repeated that >100 times as part of a leave-one-out cross-validation and could observe the same pattern every time. (btw: The y label should be "Jaccard" and not "Loss".)

Relative values:
train_history_relative

Rounded values:
train_history_round

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants