Why stair-like loss curve? #262
Replies: 11 comments 2 replies
-
I can't speak to the training run for the graph as I didn't do them, @mitchellnw would have a better idea... but looks like it could be a shuffling issue (as in not properly shuffling) |
Beta Was this translation helpful? Give feedback.
-
My guess is also a shuffling issue with webdataset when these were run |
Beta Was this translation helpful? Give feedback.
-
If the data is not preshuffled you need both shards shuffling and local shuffling |
Beta Was this translation helpful? Give feedback.
-
Perhaps it is not a shuffling issue with webdataset, since I train the model on CC3M (csv dataset), and observed the following curves. which looks very similar to this curve in Loss increases within each epoch, then decrease after each epoch... |
Beta Was this translation helpful? Give feedback.
-
@ChenDelong1999 did you preshuffle the dataset (sort randomly the dataset) ? |
Beta Was this translation helpful? Give feedback.
-
CsvDataset should be shuffled every epoch, pre-shuffle isn' really relevant. Might be worth checking that open_clip/src/training/train.py Line 62 in d9ee4aa |
Beta Was this translation helpful? Give feedback.
-
Looking at this again I wonder if it is caused by the I would expect that stair-like But I have no guesses for why To test this hypothesis I would use a 10x smaller learning rate on the |
Beta Was this translation helpful? Give feedback.
-
@mitchellnw I've noticed that the scale param has interesting relationship with the LR/loss, I wonder if it's almost behaving in a slightly oscillatory control systems fashion. The scale is strongly impacted by the LR as well, if the LR is high enough the scale will not converge to 100 until it lowers |
Beta Was this translation helpful? Give feedback.
-
Interesting. I wonder how accuracy/loss would be impacted if this learnable param was replaced by a scheduled param---something like 100 - k*cosine_decay(iteration). |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for your answer, but the |
Beta Was this translation helpful? Give feedback.
-
Hi @mitchellnw @rom1504, I observed a similar stair-like loss curve when training a ViT-B-32 model on the LAION400M. Do you think this is normal? Or could you share one loss curve of any of your models on LAION400M? Any information would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
As in here as well as my own implementation, stair-like loss curves are observed. Any possible reason for this?
Beta Was this translation helpful? Give feedback.
All reactions