Can overfitting lead to high-norm patches? #419

amundra15 · 2024-05-22T14:22:14Z

I want to finetune vitb14 on domain-specific data, and as a proof-of-concept, I am doing so on a fairly small dataset in the beginning. The resultant patch features show high-norm artefacts similar to the ones discussed in "Vision transformers need registers".

What confuses me is that the paper highlights that such artefacts were not noticed for vitb14 but only larger more-representative models. This makes me wonder if I am seeing those artefacts for vitb14 as a sign of model overfitting.

Any thoughts on this?

heyoeyo · 2024-05-22T18:42:55Z

There does seem to already be high-norm artifacts in vit-b (more info in issue #373), though they present a bit differently than the larger model. Specifically for vit-b, there's always (weirdly) a bunch of high norm tokens in the top-left patches.

I'm not sure about finetuning on small datasets, but the vit-b model was also used within Depth-Anything, which would've involved training on a large dataset for a different task, and it still shows similar artifacts. I'd guess that the artifacts you're seeing may just be the original ones, especially if they're concentrated in the top-left patches, and not directly related to finetuning.

amundra15 · 2024-05-23T11:36:11Z

Thanks for your response, @heyoeyo.

In my case, the artefacts appear along the left and top edges of the image (and not just the top-left corner). What is also interesting is that I am getting low norm values for the artefacts, but high values for the first principal component.

Input RGB:

Norm of last layer patch tokens:

PCA(n=1) of last layer patch tokens:

The values are overlayed in red. (ignore the mismatch in the image orientation).

I am not sure how to explain the low norm values for the artefacts.

heyoeyo · 2024-05-23T20:18:40Z

That seems very surprising! It's probably worth double checking if the original (not fine tuned) vit-b model produces similar artifacts (if you haven't already).

Another thing worth checking is whether the artifacts appear on earlier layers, and what that pattern looks like. In all cases I've seen, they aren't present on the earlier layers, but tend to appear and stay consistent on later layers, with the final layer being somewhat different from all others.
Not that this will explain your results specifically, but if you see a similar pattern it may be more of a hint that it's the same phenomena at least, even though you're getting low norm tokens.

amundra15 · 2024-05-28T09:00:34Z

The original model also produces similar low-norm artefacts (though not as evidently).

Norm of last layer patch tokens from official vitb14:

PCA(n=1) of last layer patch tokens from official vitb14:

It's interesting to note that the original model shows regions of high as well as low norms. The fine-tuning is exacerbating the low-norm problem already present in the top-left region. Is this phenomenon studied and documented somewhere?

I will also add the visualizations from the other layers once I have them.

heyoeyo · 2024-05-28T18:53:07Z

Those results from the original model look a bit more similar to what I've seen, though it's strange that it seems inverted and that there are other tokens (not just the top-left) that seem out-of-place. It actually resembles the result from the larger models...
Are these norm results showing the patch tokens as-is, or does it also include the final layernorm step? If the layer norm is included, I wonder if that might explain the inversion of low/high norms at least?

As for visualizing the other layers, there's some code here which can at least give a qualitative result.

amundra15 · 2024-06-03T14:59:05Z

@heyoeyo you are right regarding the final layer norm. Upon commenting it out, I get high norms as expected:

Norm of last layer patch tokens from official vitb14:

Norm of last layer patch tokens from our fine-tuned vitb14:

The values are now similar to the ones discussed in the registers paper. However, I still observe artefacts along the entirety of the top and left edges.

I have a couple of queries regarding the impact on performance:

Does this artefact somehow affect the cls token performance as well?
What happens if I mask/threshold the patch token outputs for the artefacts? Can this ensure better downstream performance compared to using it as is?

heyoeyo · 2024-06-04T15:27:37Z

Does this artefact somehow affect the cls token performance as well?

I'd guess this depends a lot on how you're using the cls token. If you've trained another model to use the cls token (and included the vitb model in this training), then I'd imagine it's ok. The cls token has the chance to 'attend' to these weird high norm tokens throughout the model, so even if they include global info (as the registers paper suggests), end-to-end training involving the cls token should be able to account for this (to some extent) I think.
On the other hand, if you're attaching a separate model/other classifier onto the vitb cls token without further training of the vitb model, it may perform more poorly since there is nothing guiding the model to place the most relevant info into the cls token specifically.

What happens if I mask/threshold the patch token outputs for the artefacts? Can this ensure better downstream performance compared to using it as is?

I think this again depends on how the model is used. If the downstream processing is trained in conjunction with the vit output, then it's likely to outperform any hand-picked mask/thresholding settings (given how hard it would be to do this with these odd patterns).
The Depth-Anything models have a substantial amount of post-processing after the vit encoding and seem to perform fine at least. Though of course there's always a chance that it could be even better if not for these weird patterns, though I think the only way to know that is to try with different models (especially the ones with registers in this case).

amundra15 · 2024-06-26T09:33:35Z

A short update regarding the issue:
The issue is related to training instability as well (somewhat expected). After making a modification to our (custom) loss function, we observe more stable training. This leads to no high norm patches in the feature space, as well as better downstream performance.

Supersak80 · 2024-07-09T16:59:26Z

@amundra15 would you please comment on how you modified your loss function and why/how that leads to more stable training? thank you!

amundra15 · 2024-07-16T14:59:47Z

I have made a couple of major changes to the loss function tailored to the problem at hand. One of them was difficult to optimize (and also did not make sense intuitively), leading to increasing loss values for the original losses in the paper. Upon removing that loss, the training became more stable and the original losses also decreased consistently.

amundra15 closed this as completed Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can overfitting lead to high-norm patches? #419

Can overfitting lead to high-norm patches? #419

amundra15 commented May 22, 2024

heyoeyo commented May 22, 2024

amundra15 commented May 23, 2024

heyoeyo commented May 23, 2024

amundra15 commented May 28, 2024

heyoeyo commented May 28, 2024

amundra15 commented Jun 3, 2024

heyoeyo commented Jun 4, 2024

amundra15 commented Jun 26, 2024

Supersak80 commented Jul 9, 2024

amundra15 commented Jul 16, 2024

Can overfitting lead to high-norm patches? #419

Can overfitting lead to high-norm patches? #419

Comments

amundra15 commented May 22, 2024

heyoeyo commented May 22, 2024

amundra15 commented May 23, 2024

heyoeyo commented May 23, 2024

amundra15 commented May 28, 2024

heyoeyo commented May 28, 2024

amundra15 commented Jun 3, 2024

heyoeyo commented Jun 4, 2024

amundra15 commented Jun 26, 2024

Supersak80 commented Jul 9, 2024

amundra15 commented Jul 16, 2024