Replies: 4 comments
-
Most likely it would, just more costly since the visual encoder is then trained. https://twitter.com/rom1504/status/1593719037808320513?t=8s946794nkQOi1um4nrQQg&s=19 Do you have any specific questions? |
Beta Was this translation helpful? Give feedback.
-
Well, I tried ViT-L/14 with xlm roberta large, both from scratch and the loss did not converge at all for quite a few iterations. Training data is laion-2B multi. |
Beta Was this translation helpful? Give feedback.
-
What learning rate and BS are you using?
If using something like batch size 90000 and learning rate 0.001 and
precision amp bfloat16
Then after 7500 iterations you should see it go down significantly
Note that you would need like 512 GPUs to train it from scratch in less
than 2 weeks.
…On Wed, Feb 15, 2023, 04:31 Jiahui Du ***@***.***> wrote:
Well, I tried ViT-L/14 with xlm roberta large, both from scratch and the
loss did not converge at all for quite a few iterations. Training data is
laion-2B multi.
—
Reply to this email directly, view it on GitHub
<#423 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437UDHHOKOQEVOC3MI43WXREYRANCNFSM6AAAAAAUZWSQNI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I used 58,880 bs(128 a100 80G with grad checkpointing) and 5e-4 lr, and tried both amp and amp bfloat16. After that, I tried both initialized from pretrained models, vit from lain-2b en pretrained, with amp bf16 it starts to converge quickly. However, glad to know that. |
Beta Was this translation helpful? Give feedback.
-
As mentioned in README, ViT-B/32 xlm roberta base is trained from scratch, while ViT-H/14 xlm roberta large using LiT style training. Does ViT-H/14 xlm roberta large training from scratch ever converge?
Beta Was this translation helpful? Give feedback.
All reactions