Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to resume checkpoint? #151

Open
mrsempress opened this issue Jan 21, 2025 · 0 comments
Open

How to resume checkpoint? #151

mrsempress opened this issue Jan 21, 2025 · 0 comments

Comments

@mrsempress
Copy link

Due to resource limitations, the program will be interrupted during training, and I want to continue training.
So I set save_strategy="epoch" in TrainingArguments() to save the checkpoint.
After the interruption, I will get the previous epoch, such as checkpoint-1166, and the files inside are: optimizer.pt/rng_state.pth/scheduler.pt/trainer_state.json/intervenable_model, where intervenable_model has a suffix of .bin.

But when I change trainer.train() to trainer.train(resume_from_checkpoint=True), I will get the error:

AttributeError: 'ReftModel' object has no attribute '_keys_to_ignore_on_save'

How can I achieve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant