-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add export --output-snapshot-path snap.tc
, and --snapshot-path snap.tc
#1465
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1465
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit da16f6a with merge base 5684175 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
export --output-snapshot-path snap.tc
, and --snapshot-path snap.tc
The snapshot load path may need some python imports to pull in all quantization custom ops and custom kernels that quantize.py may make available, so that they are available when a model snapshot gets reloaded. The best way may be to wholesale import all of them, rather than saving additional info in the snapshot and doing selective import, because there just aren't enough custom ops for wholesale import on reloading a snapshot prohibitive |
# helper that generate just model.config | ||
with measure_time("Time to load model: {time:.02f} seconds"): | ||
model = _load_model(builder_args) | ||
device_sync(device=builder_args.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Does the saved artifact still work if the device has changed? I recall this being an issue with AO (one of the reasons why we didn't add saving earlier)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it might not. Most likely. It depends really on what quantizations are performed and whether they're implemented on the multiple platforms, and in the same way. I.e., if it's the same pytorch/python code, for doing computation with quantized numbers and the same quantization formats are supported, then yes.
If it's a C/C++/CUDA operator, it needs to be supported, with the same name, or with a suitable if/then/else (i.e., don't bake. the "device" setting in)
Quantization weight format layouts need to be consistent, or the loader needs to repack them at load time. (This is totally plausible to do, but I don't think we do that today. I think in the 4b case ""we just know". I tried to change that, but the need/priority wasn't similarly perceived by everybody.)
If it's saved, and reloaded, most (all?) decisions you made are set in stone, like quantization schemes etc (Otherwise, you'd be loading from scratch?). In some sense that's similar to how dso/aoti/pte-output-path / load-dso/aoti/pte-path work, and that's why it's modeled after that export and reload facility. You don't get to change the PTE target on reload, or the device that an aoti model has been compiled for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think expecting the exporting conditions to be the the same as the executing conditions is a fair start
|
||
def export_snapshot( | ||
model: nn.Module, | ||
device: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a cut and paste thing, and wanted to keep args consistent. Mind you, we could put the device in the file or some other such, and check on reload that it's the same. (I think MPS/CPU are sorta fungible, which might help on Macs with quantization when you run out of kernel memory to quantize large models. CPU could use paging to ssd for that. eg discussion on #1483)
The path to the exported model. | ||
""" | ||
assert output_path.endswith(".tc"), "use .tc extension for snapshots" | ||
torch.save(model, output_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the whole model or can we get away with just the state_dict? https://github.com/pytorch/torchchat/pull/1280/files
That said if we go with the slimmer state_dict, that's dependent on migration to the AO quant that supports this saving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You rewrite the code as part of the quantization. If you don't save the code, then you must exactly replicate what quantization options you used, create an empty model, quantize it, and then load the state dict over it. Either you transfer the whole responsibility on the user (good luck, you'll die the death of many cuts when users make mistakes and complain that this facility does not work), or you need to save an ungodly amount of information about options used for the original quantization process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update the serialization logic as the AO migration finishes (i.e. this PR is good), but I'm not sure that's the case anymore with AO. I was under the impression that the model itself is unaffected and that only the weights are changed
cc: @vmpuri @HDCharles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to get by with just the state dict, there are a few apis that don't work that way, but all the subclass ones do. Thats like 75% of the reason that we went with subclasses instead of module swaps.
jack, you're link is a good resource, an alternate reference are our serialization tests to see what is explicitly tests
Love the PR. This has been on the backlog for a bit (and unblocks a potential path to quantize at download time + discard the unquantized snapshot) Left some comments |
you're welcome :) Glad you like it! PS: If you quantize at download, you need to have a way to specify quantization options. Totally doable, but not sure about discarding the unquantized snapshot. As a user, if I change my mind, or forgot device, and if I want to change quantization, it has to re-download. Even on my corpnet in Silicon Valley, the larger models are painful to download -- even worse for the rest of the world. But then again, I have a fairly large disk on my laptop (but on the tradeoff of download bw vs ssd size, i think ssd size wins for most? Then again, if the user thinks s/he might change her mind, she should just download and quantize separately... |
Definitely want to spin up a RFC before we push the quantized download through
Our reference here is Ollama which gives a "it just works" experience due to their default to quantized inference (GGUF format aside). We'd give a default quantization setting when downloading in this fashion.
This will always be an option :D |
…p.tc` (#1465) * support model snapshots to save quantized models * import set backend --------- Co-authored-by: Michael Gschwind <[email protected]>
Add ability to save and restore quantized models #1032