-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can overfitting lead to high-norm patches? #419
Comments
There does seem to already be high-norm artifacts in vit-b (more info in issue #373), though they present a bit differently than the larger model. Specifically for vit-b, there's always (weirdly) a bunch of high norm tokens in the top-left patches. I'm not sure about finetuning on small datasets, but the vit-b model was also used within Depth-Anything, which would've involved training on a large dataset for a different task, and it still shows similar artifacts. I'd guess that the artifacts you're seeing may just be the original ones, especially if they're concentrated in the top-left patches, and not directly related to finetuning. |
Thanks for your response, @heyoeyo. In my case, the artefacts appear along the left and top edges of the image (and not just the top-left corner). What is also interesting is that I am getting low norm values for the artefacts, but high values for the first principal component. Norm of last layer patch tokens: PCA(n=1) of last layer patch tokens: The values are overlayed in red. (ignore the mismatch in the image orientation). I am not sure how to explain the low norm values for the artefacts. |
That seems very surprising! It's probably worth double checking if the original (not fine tuned) vit-b model produces similar artifacts (if you haven't already). Another thing worth checking is whether the artifacts appear on earlier layers, and what that pattern looks like. In all cases I've seen, they aren't present on the earlier layers, but tend to appear and stay consistent on later layers, with the final layer being somewhat different from all others. |
Those results from the original model look a bit more similar to what I've seen, though it's strange that it seems inverted and that there are other tokens (not just the top-left) that seem out-of-place. It actually resembles the result from the larger models... As for visualizing the other layers, there's some code here which can at least give a qualitative result. |
@heyoeyo you are right regarding the final layer norm. Upon commenting it out, I get high norms as expected: Norm of last layer patch tokens from official vitb14: Norm of last layer patch tokens from our fine-tuned vitb14: The values are now similar to the ones discussed in the registers paper. However, I still observe artefacts along the entirety of the top and left edges. I have a couple of queries regarding the impact on performance:
|
I'd guess this depends a lot on how you're using the cls token. If you've trained another model to use the cls token (and included the vitb model in this training), then I'd imagine it's ok. The cls token has the chance to 'attend' to these weird high norm tokens throughout the model, so even if they include global info (as the registers paper suggests), end-to-end training involving the cls token should be able to account for this (to some extent) I think.
I think this again depends on how the model is used. If the downstream processing is trained in conjunction with the vit output, then it's likely to outperform any hand-picked mask/thresholding settings (given how hard it would be to do this with these odd patterns). |
A short update regarding the issue: |
@amundra15 would you please comment on how you modified your loss function and why/how that leads to more stable training? thank you! |
I have made a couple of major changes to the loss function tailored to the problem at hand. One of them was difficult to optimize (and also did not make sense intuitively), leading to increasing loss values for the original losses in the paper. Upon removing that loss, the training became more stable and the original losses also decreased consistently. |
I want to finetune vitb14 on domain-specific data, and as a proof-of-concept, I am doing so on a fairly small dataset in the beginning. The resultant patch features show high-norm artefacts similar to the ones discussed in "Vision transformers need registers".
What confuses me is that the paper highlights that such artefacts were not noticed for vitb14 but only larger more-representative models. This makes me wonder if I am seeing those artefacts for vitb14 as a sign of model overfitting.
Any thoughts on this?
The text was updated successfully, but these errors were encountered: