Can a custom dataset/tokenizer class be used without forking the project and manually splicing it in? #452

mkaic · 2023-02-28T20:53:34Z

mkaic
Feb 28, 2023

Hi! I'm working on training CLIP on a specialized medical imagery dataset, and we've developed a custom tokenizer to efficiently tokenize the long medical reports attached to our images. I've already successfully used OpenCLIP on this project once, but the way I went about it was messy and required me to fork the project, add in my own custom dataset class and tokenizer class to the training folder, and patch any bugs that showed up while not having a great understanding of how the project was structured.

Now that I'm on the second iteration of the project, starting fresh, I figured I'd ask if there's any way to use an custom dataset and/or tokenizer class without needing to explicitly modify OpenCLIP's source code? Is there a command-line option somewhere I'm missing? Totally okay if the answer is no—if that is the answer I may end up making a pull request! Just seemed it would be good to check I'm not reinventing the wheel first.

Thank you all for your hard work on this project!

Answered by mitchellnw

Feb 28, 2023

Hello, regarding tokenizers, if you add a new model config under src/open_clip/model_configs then point to a tokenizer on huggingface that should work (see https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/factory.py#L76)

Regarding datasets, if you can convert your dataset to webdataset format or csv then it is supported via the train data flag, otherwise you'll have to manually add it for now

View full answer

mitchellnw · 2023-02-28T22:42:08Z

mitchellnw
Feb 28, 2023
Maintainer

Hello, regarding tokenizers, if you add a new model config under src/open_clip/model_configs then point to a tokenizer on huggingface that should work (see https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/factory.py#L76)

Regarding datasets, if you can convert your dataset to webdataset format or csv then it is supported via the train data flag, otherwise you'll have to manually add it for now

1 reply

mkaic Feb 28, 2023
Author

Great, thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can a custom dataset/tokenizer class be used without forking the project and manually splicing it in? #452

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Can a custom dataset/tokenizer class be used without forking the project and manually splicing it in? #452

mkaic Feb 28, 2023

Replies: 1 comment · 1 reply

mitchellnw Feb 28, 2023 Maintainer

mkaic Feb 28, 2023 Author

mkaic
Feb 28, 2023

Replies: 1 comment 1 reply

mitchellnw
Feb 28, 2023
Maintainer

mkaic Feb 28, 2023
Author