-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the HFOnnx pipeline to use Hugging Face Optimum rather than onnxruntime directly #371
Comments
I'll have ONNX be a focus in the 5.2 release. About to release 5.1. |
I made this comment in #369, but I'll put it here as well since it seems to be more focused on ONNX improvements: The hfonnx.call() method uses opset=12 as its default value. But ONNX v1.12 has added opset v17. I wonder whether it might be possible/prudent to add some sort of version check to this method that says:
Here's the ONNX versions and their respective opsets https://onnxruntime.ai/docs/reference/compatibility.html |
Sounds good. I've taken a preliminary look and I think the transformers.onnx package can replace some but not all of the code in HFOnnx. For whatever reason, the default opset in transformers is 11. I wonder if the thinking behind this is picking the lowest opset necessary to maximize compatibility (i.e. support more versions). |
Probably. But if there's a check for the installed version, then that should solve any compatibility issues - the highest supported version will always be used as the default. I look forward to whatever you're able to put together to make onnx model usage more accessible! From everything I've read about it, it provides a massive performance improvement. |
Rather than simply implement ONNX for Seq2Seq models, as discussed in this Slack thread, it would be beneficial and prudent to outsource Some relevant links:
Please don't feel any pressure to implement this immediately on account of me! But I do think it would be quite helpful to txtai users to be able to make full use of ONNX, and would lighten the load on you to monitor and implement changes in onnxruntime. |
I've made some good progress on building a mechanism to seamlessly auto-convert any HF transformers model to ONNX with HF Optimum, which also allows for the user to configure the optimization level and other important parameters. All combinations of optimization level (0-4) and quantize (True/False) would generate its own model which can be saved to a models directory of choice for quicker re-loading. I'd like to make it available to txtai via PR such that anyone could start reaping the massive performance and resource (RAM usage) improvements from a fully optimized ONNX model in any of their existing txtai pipelines/workflows/applications, with as little friction as possible. However, I am not sure what the preferred approach would be. The easiest option, it seems to me, would be to create Perhaps a better approach would be creating Any thoughts? I'll probably proceed with the first option today as it should be relatively seamless. But I'm happy to discuss and modify things to accommodate whatever approach you think would be best. |
On second thought, I'm not sure that the first pipeline approach would be all that useful - you'd have to run a separate one for each txtai pipeline that you want to use, and probably also have to know which task (e.g. The only reasonable way to do it would be to allow for setting arguments in any pipeline (e.g. Any thoughts on any of this? I'm happy to do all of the work if you can point me in the right direction, such that all you'd need to do is review the code and modify things to accommodate your style, and be more robust, efficient etc... |
For this change to be most effective, the best path is to replace this class - https://github.com/neuml/txtai/blob/master/src/python/txtai/models/onnx.py with an optimum version. And this method https://github.com/neuml/txtai/blob/master/src/python/txtai/models/models.py#L118 to detect loading an ONNX model. Last I checked with Optimum it doesn't support loading streaming models. It expects everything to work with files. That will cause issues with some existing functionality. The other piece would be changing HFOnnx to use Optimum to convert models. From there you shouldn't need to add these extra arguments as that would be done at model conversion time. |
Thanks! I'll explore those files and submit something for your review when ready. Yeah, Optimum works off of files. The workflow I've set up is:
I've built it such that it checks for existing files as early as possible to minimize processing. But can also store many versions depending on what combination of optimization and quantization is desired. Once the initial download is processing is done, then it should all load quickly from the disk. I don't know anything about streaming - could you please point me to the relevant txtai code where streaming is used, so that I can dig in and see what it does and what options might exist for this implementation? At the very least, it seems like this could be implemented as a sort of optional alternative. It seems far easier and more future-maintainable to offload all the work to Hugging Face than to build custom mechanisms directly on top of onnxruntime to do optimization and quantization on all the different transformer model types (Seq2Seq, FeatureExtraction etc...). |
Did you try using txtai pipelines with ORTModelForXYZ.from_pretrained? from optimum.onnxruntime import ORTModelForSequenceClassification, ORTModelForSeq2Seq
from transformers import AutoTokenizer
from txtai.pipeline import Labels
model = ORTModelForSequenceClassification.from_pretrained("path", from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("path")
labels = Labels((model, tokenizer), gpu=False)
labels("Text to label") And for Seq2Seq from optimum.onnxruntime import ORTModelForSeq2SeqLM
from txtai.pipeline import Sequences
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("t5-small")
sequences = Sequences((model, tokenizer), gpu=False)
sequences("translate English to French: Hello") This should all work with quantized models as well. from optimum.onnxruntime.configuration import AutoQuantizationConfig
quantizer = ORTQuantizer.from_pretrained(model)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="quant", quantization_config=qconfig) So while I like the idea of making this more integrated, it seems like it's pretty straightforward to do run Optimum models with txtai as it stands right now. |
Yes, I believe I mentioned somewhere above (or in Slack) that it works quite well already by passing a But, it resulted in a lot of redundant code for me when I started testing it with different pipelines. So, I started refactoring and eventually arrived at something generalized and flexible - along with something that makes use of saving the various converted/optimized/quantized models. So, I figured that it would be worth going the extra mile to make something nice and integrated - both for my sake as well as for everyone else who don't have time/interest to figure all of this out. If accompanied by a good colab example, I think it would be well received by txtai users - it should make it simple for anyone to experiment with any of this by adding/tweaking some arguments in their existing code. It will also make it easy for people to make use of any hardware-specific acceleration that they have available to them (via the ONNX Execution Providers, such as the OpenVINO EP - which seems to be highly compatible with most systems and even more powerful than raw onnxruntime) I'm going to build it anyway, so would you be willing to review a PR for this? To start, I'll leave |
If it can be done with changing HFOnnx, models/onnx.py and models/models.py, then yes. I don't want to go down the path of adding a bunch of optimum options to the pipelines. That should be a one time conversion via a process similar to HFOnnx. Sounds like you're going to work on a lot of this regardless for your own work. But in terms of a PR, that is the type of code above that would make sense for core txtai. |
Thanks! I'll change those files directly then. Though, in order to allow for people to tweak the onnxruntime and onnx execution providers parameters (optimization level, model save path, and plenty more), it would definitely be necessary to add parameters to the pipelines. Again, I suspect that the cleanest approach would be to add a single kwargs parameter to each pipeline which can then be processed in models/onnx.py, so I'll give that a shot first to keep things as clean and unchanged as possible. But its ultimately a small detail to iron out later upon review. |
I don't think we're on the same page with the idea. I envision the HFOnnx pipeline to import the Optimum ForXYZ models and load the transformers models there. Then any logic to optimize, quantize or whatever would happen and the model saved to a directory. That would be what is loaded by the pipelines. I don't see a valid use case for anything to happen at runtime in the pipelines, it should just load the model created by optimum. I would leave models/onnx.py for backwards compatibility and make a very small change to Models.load that checks if the model path is a directory and it contains *.onnx files. If that is the case, it would load it with the appropriate Optimum ForXYZ model. |
Ok thanks. I'll do my best to meet what you're looking for. Any required changes should be relatively simple to implement when I submit the PR. |
Sounds good, I appreciate the efforts in giving it a try and you sharing your plan. |
My pleasure - its the least I can do to give back something to this fantastic tool! Just to clarify - after thinking some more, I think it is now fully clear to me what you envision.
Is that correct? If so, that's completely fine with me. But my only outstanding question is with regards to the last step - normally Models.load loads onnx.py, but you'd like to leave onnx.py alone for backwards compatibility. I agree. So, where does the ORTforXYZ get loaded? Another method within onnx.py? A separate models/optimum.py->OptimumModel class? |
I'm just coming back to this now and find it very confusing why you don't want any of this to be implemented at runtime. I really do think that implementing through a standalone HFOnnx pipeline to pre-generate onnx models is the wrong approach.
I feel pretty strongly its the implementation that I would personally want to use, and surely others as well. As such, I think I'm going to go ahead with building it in this way. I'll submit it as a PR, which you'll be perfectly welcome to reject if you don't like it. But, any collaboration would be much appreciated, so I hope you'll consider the above points and be willing to at least review the PR with an open mind! At that point, if you can show me why it really is a dealbreaker, then I'll be happy to modify it to suit your needs. (As mentioned above, I'd do it in as non-obtrusive way as possible - I think |
I wasn't proposing The current implementation does support saving ONNX files and reloading from storage. Perhaps you should consider making this functionality a separate package to suit your specific needs. We might need to agree to disagree on this one. |
Right, but if someone wants to test/have/use different versions of the model - combinations of optimization levels, quantization methods, etc... - then the path has to be different for each. That seems very cumbersome for someone to track and implement as compared to just putting args in the pipeline call and then the code finds the right folder and file. Likewise, how does one use the other Execution providers (along with the other dozen onnxruntime parameters, should they so choose?). As it stands or as proposed, they can't. Anyway, I'll disappointedly respect your decision - I'll close this issue and carry on with my own implementation for this. I'll share the code somewhere - be it in an immediately-closed PR or another repo - for you or someone else to consider incorporating into txtai. No hard feelings though - thanks again for everything. |
Going to re-open this to keep a placeholder to implement the way I've proposed. I think you'll see it does most of what you want. While people may test a bunch of different options, there will be a final model in most cases. The framework will be able to detect if the model is ONNX or OpenVINO, which are different formats not different execution providers within ONNX. I consider model optimization in the same family of operations as compiling a static program or training a model. While you toggle the options at development time, once finished, it's a single model or set per platform architecture. |
Fair enough. Actually, OpenVINO is its own execution provider, of which there appear to be a couple dozen. To use it, you need to install a separate onnxruntime package, onnxruntime-openvino. Perhaps I'm misunderstanding things, but given that there's a parameter for specifying the provider, surely it needs to be used. But txtai doesn't have a mechanism to specify the provider - it only supports CPU and CUDA. There's another dozen parameters as well that people might want to tweak, but they're not currently available in txtai. There's also parameters for optimization and quantization. Perhaps you want to keep txtai as simple/streamlined as possible, but what I'm building shouldn't add any confusion/hindrance while also allowing people to experiment as much as they want. So, I'll still build what I'm envisioning, because I'm quite sure that it'll be far easier for prototyping/testing (in fact, I intend to add some mechanism to leverage HF Evaluate as well). I'll submit what I end up building to use as a starting point for you/me/others to build what you are looking for. Interestingly, I'm having strange results so far. I only used the Labels pipeline though. Basic Onnx is faster than transformers, but a quantized transformers |
Keeping this issue open, still a good issue to consider. |
This issue is still on the radar. A new pipeline called HFOptimum will be added along with logic to detect these models in the |
Ive been away from developing for 6+ months due to life, so didn't get very far past initial work that was mentioned above. I'm hoping to get back to it in the coming weeks and will be happy to provide feedback on this if needed. |
The HF documentation says that you can now export seq2seq to ONNX with the OnnxSeq2SeqConfigWithPast class.
https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/onnx#onnx-configurations
This was added with this PR in March huggingface/transformers#14700
Perhaps it is sufficient to be incorporated into txtai now? It would be great to be able to use ONNX versions of the various HF models, for their increased performance.
Additionally, it seems to support ViT models, along with other enhancements that have been made since then. Here's the history for that class https://github.com/huggingface/transformers/commits/main/src/transformers/onnx/config.py
The text was updated successfully, but these errors were encountered: