Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook training classifier on custom dataset #305

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

legel
Copy link

@legel legel commented Nov 11, 2023

It was difficult and time-consuming for me to get a "hello world" training of a DinoV2 classifier on a custom dataset.

Existing options involve diving deep into complex classes and APIs that appear to be designed especially for ImageNet.

In any case, a simple starter notebook is likely to prove useful for many others.

Inspired by successful working code and a tutorial from here, I developed and tested code for downloading a DinoV2 model and training classifier layers from scratch on a custom dataset. This is contained in a classification.ipynb notebook.

There are several github issues that this notebook will serve, which readers / authors from there may wish to be aware of:

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2023
@legel legel force-pushed the main branch 2 times, most recently from abd14c2 to 332d37d Compare November 12, 2023 07:21
@tungts1101
Copy link

When I used only the backbone torch.hub.load("facebookresearch/dinov2", 'dinov2_vits14') and my own classification head torch.nn.Linear(in_features=1920, out_features=1000, bias=True), I got an error RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x384 and 1920x1000), which I don't understand since it shares the same architecture with the classification model on the docs torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_lc'). Do you have any idea what could go wrong here?

@legel
Copy link
Author

legel commented Nov 15, 2023

@tungts1101 if you want to use your own head (versus what's defined at https://github.com/legel/dinov2/blob/main/notebooks/classification.ipynb) then note that your mismatch is between the dimensionality of the output of backbone (384) and the input dimensionality of your head (1920). If you change in_features=384 that would probably resolve your problem. Although, note, it wasn't until I used the sequence of layers defined in that notebook that the classifier worked properly for me...

@legel
Copy link
Author

legel commented Nov 15, 2023

PS I've upgraded my classification.ipynb script to now work for multi-class problems (rather than just binary), also have added in some better automation for automatic downloads of the pre-trained .pth files, which I'll upload shortly.

@tungts1101
Copy link

@legel Doing what you said will work. The part that I do not understand is the linear head of the classification model has input features of 1920 and output features of 1000, yet it still can work with the backbone.

@legel
Copy link
Author

legel commented Nov 19, 2023

@tungts1101 I'm not sure what linear head you're using...

I have great news though. There was a pretty significant "error" in my approach to classification, where basically I was retraining all of the DinoV2 weights, instead of just the head.

On the surface this may not seem to be a big deal. But it's certainly dramatically more expensive to compute gradients for all of the original weights, versus just the head -- in terms of compute and memory. And this ends up preventing increasing the batch size -- which can greatly help performance -- among other major benefits.

One of the main goals of the Meta DinoV2 was to contribute a "foundational model" for vision, where you don't have to retrain the core (very much inspired by what OpenAI's ChatGPT did for NLP).

I'm now suddenly seeing fantastic evidence of this for the first time. I'm training a plant species classifier on a dataset close in size and complexity to iNaturalist 2021. And wow! Versus "fine-tuning" the weights of the DinoV2 vision transformer, I'm seeing convergence time increase by at least a factor of 1,000x!

I'll share several large classifier notebook updates ASAP, meanwhile I'm delighted to share this news, and key insight for making the most of the DinoV2 pre-trained weights (representing information from 142 million images).

@hoominchu
Copy link

Hey @legel! I was looking for something to do exactly this. Did you update the notebook with multi-label classification? I don't see a commit later than your comment here. Would love to get hold of it though. Thanks!

@yhsmiley
Copy link

yhsmiley commented Apr 9, 2024

@legel I'm also interested in the updated notebook! Will you be updating the code here too? Thanks!

@legel
Copy link
Author

legel commented Apr 19, 2024

@yhsmiley @hoominchu I've pushed a notebook that demonstrates multi-class classification with DinoV2. I didn't have time to clean it up, it is very raw, but the core techniques work well.

I should've shared the one two lines of code for the key innovation, that ensures DinoV2 pre-training is fully utilized, i.e. we don't train transformer weights:

for param in model.transformer.parameters():
    param.requires_grad = False

So, the key is to only train the classifier weights.

@BOX-LEO
Copy link

BOX-LEO commented Apr 29, 2024

Does anyone know why the input feature dimension for the linear head, as used by DINO with or without registration, is 1920 instead of 384, which corresponds to the output dimension of the backbone?
Are features from multiple layers concatenated together?

@legel
Copy link
Author

legel commented Apr 29, 2024

@BOX-LEO I'm not sure what model specifically you're referring to or what the architecture is.

Could you print out

model.eval() 

and share here?

@OrangeNo42
Copy link

@legel Hi, why you set img_size=526? I am confused about the img_size, because the image_dimension is resized to 256, and in the official code, the image will be CenterCrop to 224.

@joker-bian
Copy link

joker-bian commented Oct 26, 2024

@legel hi,I would like to ask if I want to train this model on image data other than three channels, do I need to train it from the beginning? What should I do specifically?

@larosi
Copy link

larosi commented Oct 27, 2024

to ask if I want to train this model on image data other than three channels, do I need to train it from the beginning? What should I do specifically?

@joker-bian I recommend you to adapt your images to 3 channels. For example, medical images mainly use a single channel and a typical approach is repeating the image three times. Another idea is train a shallow autoencoder to map N ch to 3 Ch, and preprocess your dataset with that as previous step before train your classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants