-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T5 fine-tuning special tokens #158
Comments
I have to apologize for inconsistent/missing documentation on pre-training ProtT5. We will try to improve documentation in the future to avoid wasting people's time.
However, depending on the amount of data you got available, you can also pre-train on the original T5 pre-training task (span-corruption with span-lengths>1). I did this successfully in ProstT5 finetuning (See link above). |
Thank you for the quick and complete answer! And could you confirm that I am correct about the postprocessor please (cf last paragraph of my message)? |
Sorry for the delayed response - to briefly answer your Q about "Does it mean all those extra tokens need to be added manually, before calling tokenizer()?": no you do not need to add those tokens manually. All tokens that are needed for continued pre-training are part of the tokenizer. |
Hello,
Firstly, thanks you all for your work.
I am struggling to understand how to fine-tune T5.
In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.
#113 also references another answer from #137 which is strange:
There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.
Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):
E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>
.<pad> E V Q L V E S G A E </s>
.Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.
edit: Another question: does the tokenizer include a postprocessor? It seems not:
(Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'
. Does it mean all those extra tokens need to be added manually, before callingtokenizer()
?The text was updated successfully, but these errors were encountered: