Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safer Checkpoint File Format #3181

Open
mbway opened this issue Apr 5, 2024 · 1 comment
Open

Safer Checkpoint File Format #3181

mbway opened this issue Apr 5, 2024 · 1 comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.

Comments

@mbway
Copy link
Contributor

mbway commented Apr 5, 2024

🚀 Feature Request

I have been investigating if it is possible to store checkpoints in a format that does not depend on pickle, as pickles are inherently unsafe as they allow for arbitrary code execution. The .pt file format created by torch.save uses pickle internally.

Would this functionality be something that you would be interested in? If so I can post a PR with my implementation for consideration.

Implementation

My solution is a custom .safe format that is a .tar archive containing:

my_checkpoint.safe
├── tensors.safetensors
├── ndarrays.npz
└── other.json

To save a state dictionary, I pre-process it so anything not json-serializable gets converted to a dictionary like {"__safe_obj_type": "tensor", "id": "123"} for example. The __safe_obj_type key is then used during loading to restore the object back to what it was before. The gathered tensors and numpy arrays are stored as safetensors and npz respectively and anything else is stored in the json file.

At least for the state-dict I was trying to serialize I found that only a few non-json-serializable data types needed to be supported: torch.tensor, np.ndarray, datetime.timedelta, tuple Though a few others may be needed if this was implemented beyond a proof of concept.

@mbway mbway added the enhancement New (engineering) enhancements, such as features or API changes. label Apr 5, 2024
@mvpatel2000
Copy link
Contributor

@mbway Thanks for flagging this -- we're currently actively working on a major redesign of checkpointing that will include migrating to a safer format. We've been hesitant to change how checkpointing works as we deeply care about backwards compatibility, so we have been buffering up a long list of features we will now include in the revamp. You should start seeing things roll out this month.

CC: @eracah it might be good to publicly share a version of the roadmap as a RFC when its ready

If you already have a PR, please feel free to post it -- it's probably helpful for us to view a design and gain some inspiration. Given the refactor though, we likely wouldn't accept it as this seems like a big change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.
Projects
None yet
Development

No branches or pull requests

2 participants