Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 fine-tuning special tokens #158

Open
tombosc opened this issue Nov 1, 2024 · 3 comments
Open

T5 fine-tuning special tokens #158

tombosc opened this issue Nov 1, 2024 · 3 comments

Comments

@tombosc
Copy link

tombosc commented Nov 1, 2024

Hello,

Firstly, thanks you all for your work.

I am struggling to understand how to fine-tune T5.

In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token.

(Pdb) tokenizer
T5Tokenizer(name_or_path='Rostlab/prot_t5_xl_uniref50', vocab_size=28, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special
_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': [...]

#113 also references another answer from #137 which is strange:

  • no pad token (problem, because then the first token is not modelled)
  • no eos token at all (problem in the decoder, because end of sequence token is not modelled)
  • the masked token embeddings have the same ID

There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.

Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):

  • Input: E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>.
  • label: <pad> E V Q L V E S G A E </s>.

Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.

edit: Another question: does the tokenizer include a postprocessor? It seems not: (Pdb) tokenizer.post_processor *** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'. Does it mean all those extra tokens need to be added manually, before calling tokenizer()?

@mheinzinger
Copy link
Collaborator

mheinzinger commented Nov 4, 2024

I have to apologize for inconsistent/missing documentation on pre-training ProtT5. We will try to improve documentation in the future to avoid wasting people's time.
Let me clarify:

  • regarding EOS: as you pointed out: there is only a single EOS token </s> which I used recently successfully for finetuning by appending it to both, input sequence and output sequence.

  • no pad token: as you pointed out, without the pad token, there is no way to model the very first token. This can be acceptable if you aim for representation learning (effectively dropping the decoder after finetuning anyway) but this is absolutely inacceptable if you want to use the model for generation purpose. This is also why I added the <pad> token as a very first token to the decoder input when doing ProstT5 finetuning . So irrespective of the original ProtT5 pre-training, I would add it if you need it for your use-case, i.e., if you aim for generative capability.

  • That the mask tokens always have the same ID only works because a) we set mask-length=1 during ProtT5 pre-training and, b) we always model the full sequence in the decoder. This way, there is no need to tell the model which mask-token to fill at a specific generation-step because the model simply always models the full sequence and each mask token corresponds to exactly one token in the prediction (no collapsing of multiple tokens into a single span).

  • Re postprocessor: I simply took the huggingface T5-pre-training example and used the dataloader from there: https://github.com/mheinzinger/ProstT5/blob/main/scripts/pretraining_scripts/pretraining_stage1_MLM.py .
    So in summary, if you want to stick closely to the original pre-training, you can use:

    Input: E V <extra_id_0> L <extra_id_0> E S G <extra_id_0> E </s>
    label: <pad> E V Q L V E S G A E </s>

However, depending on the amount of data you got available, you can also pre-train on the original T5 pre-training task (span-corruption with span-lengths>1). I did this successfully in ProstT5 finetuning (See link above).
one thing I would recommend if you do not aim to use your finetuned model for generation: simply take the ProtT5-encoder-only part and use the huggingface MLM-example with MASK-token=<extra_id_0>

@tombosc
Copy link
Author

tombosc commented Nov 4, 2024

Thank you for the quick and complete answer! And could you confirm that I am correct about the postprocessor please (cf last paragraph of my message)?

@mheinzinger
Copy link
Collaborator

Sorry for the delayed response - to briefly answer your Q about "Does it mean all those extra tokens need to be added manually, before calling tokenizer()?": no you do not need to add those tokens manually. All tokens that are needed for continued pre-training are part of the tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants