-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notebook training classifier on custom dataset #305
base: main
Are you sure you want to change the base?
Conversation
abd14c2
to
332d37d
Compare
When I used only the backbone |
@tungts1101 if you want to use your own head (versus what's defined at https://github.com/legel/dinov2/blob/main/notebooks/classification.ipynb) then note that your mismatch is between the dimensionality of the output of backbone (384) and the input dimensionality of your head (1920). If you change |
PS I've upgraded my classification.ipynb script to now work for multi-class problems (rather than just binary), also have added in some better automation for automatic downloads of the pre-trained .pth files, which I'll upload shortly. |
@legel Doing what you said will work. The part that I do not understand is the linear head of the classification model has input features of 1920 and output features of 1000, yet it still can work with the backbone. |
@tungts1101 I'm not sure what linear head you're using... I have great news though. There was a pretty significant "error" in my approach to classification, where basically I was retraining all of the DinoV2 weights, instead of just the head. On the surface this may not seem to be a big deal. But it's certainly dramatically more expensive to compute gradients for all of the original weights, versus just the head -- in terms of compute and memory. And this ends up preventing increasing the batch size -- which can greatly help performance -- among other major benefits. One of the main goals of the Meta DinoV2 was to contribute a "foundational model" for vision, where you don't have to retrain the core (very much inspired by what OpenAI's ChatGPT did for NLP). I'm now suddenly seeing fantastic evidence of this for the first time. I'm training a plant species classifier on a dataset close in size and complexity to iNaturalist 2021. And wow! Versus "fine-tuning" the weights of the DinoV2 vision transformer, I'm seeing convergence time increase by at least a factor of 1,000x! I'll share several large classifier notebook updates ASAP, meanwhile I'm delighted to share this news, and key insight for making the most of the DinoV2 pre-trained weights (representing information from 142 million images). |
Hey @legel! I was looking for something to do exactly this. Did you update the notebook with multi-label classification? I don't see a commit later than your comment here. Would love to get hold of it though. Thanks! |
@legel I'm also interested in the updated notebook! Will you be updating the code here too? Thanks! |
@yhsmiley @hoominchu I've pushed a notebook that demonstrates multi-class classification with DinoV2. I didn't have time to clean it up, it is very raw, but the core techniques work well. I should've shared the one two lines of code for the key innovation, that ensures DinoV2 pre-training is fully utilized, i.e. we don't train transformer weights:
So, the key is to only train the classifier weights. |
Does anyone know why the input feature dimension for the linear head, as used by DINO with or without registration, is 1920 instead of 384, which corresponds to the output dimension of the backbone? |
@BOX-LEO I'm not sure what model specifically you're referring to or what the architecture is. Could you print out
and share here? |
@legel Hi, why you set img_size=526? I am confused about the img_size, because the image_dimension is resized to 256, and in the official code, the image will be CenterCrop to 224. |
@legel hi,I would like to ask if I want to train this model on image data other than three channels, do I need to train it from the beginning? What should I do specifically? |
@joker-bian I recommend you to adapt your images to 3 channels. For example, medical images mainly use a single channel and a typical approach is repeating the image three times. Another idea is train a shallow autoencoder to map N ch to 3 Ch, and preprocess your dataset with that as previous step before train your classifier. |
It was difficult and time-consuming for me to get a "hello world" training of a DinoV2 classifier on a custom dataset.
Existing options involve diving deep into complex classes and APIs that appear to be designed especially for ImageNet.
In any case, a simple starter notebook is likely to prove useful for many others.
Inspired by successful working code and a tutorial from here, I developed and tested code for downloading a DinoV2 model and training classifier layers from scratch on a custom dataset. This is contained in a
classification.ipynb
notebook.There are several github issues that this notebook will serve, which readers / authors from there may wish to be aware of: