Can a custom dataset/tokenizer class be used without forking the project and manually splicing it in? #452
-
Hi! I'm working on training CLIP on a specialized medical imagery dataset, and we've developed a custom tokenizer to efficiently tokenize the long medical reports attached to our images. I've already successfully used OpenCLIP on this project once, but the way I went about it was messy and required me to fork the project, add in my own custom dataset class and tokenizer class to the Now that I'm on the second iteration of the project, starting fresh, I figured I'd ask if there's any way to use an custom dataset and/or tokenizer class without needing to explicitly modify OpenCLIP's source code? Is there a command-line option somewhere I'm missing? Totally okay if the answer is no—if that is the answer I may end up making a pull request! Just seemed it would be good to check I'm not reinventing the wheel first. Thank you all for your hard work on this project! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hello, regarding tokenizers, if you add a new model config under src/open_clip/model_configs then point to a tokenizer on huggingface that should work (see https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/factory.py#L76) Regarding datasets, if you can convert your dataset to webdataset format or csv then it is supported via the train data flag, otherwise you'll have to manually add it for now |
Beta Was this translation helpful? Give feedback.
Hello, regarding tokenizers, if you add a new model config under src/open_clip/model_configs then point to a tokenizer on huggingface that should work (see https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/factory.py#L76)
Regarding datasets, if you can convert your dataset to webdataset format or csv then it is supported via the train data flag, otherwise you'll have to manually add it for now