-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OpenWeb UI instructions #6
Comments
This an excellent set of resources! We don't need IPEX-LLM to make this work; it's out of scope for now. However being a project which supports all things intel is a long term goal and I am very grateful you are contributing. I will add some docs laterOpenVINO natively supports NPU albiet with oneapi toolkit dependencies. So this issue requires a lot of test code to evaluate different high level performance hints exposed by ov genai for each different pipeline class, then to decide how to build an api to make enabling set-and-forget it style optimization testing Also, I drafted an implementation of a proxy specifically for openwebui that I think will be pretty close to what's used for LM Studio so we don't need to mess around with anything lower level than that. Leveraging other openwebui features is TBD. In this way adding an openai proxy is more about integrating community tooling than changing OpenArc to use what tools others have discovered work when hacked together From a high level I want OpenArc to be self contained to reflect the design intention of realizing performance gains simply not possible with llama.cpp and ipex, mostly due to how OpenVINO runtime manages memory. Im still building my understanding of how these work. Some strategies include baked in handling of NUMA architecture and expimental datatypes which offer better accuracy/speed tradeoffs for production deployments across different generations of hardware most PCs have. That's why you don't see lower than INT4 which isn't even the lowest. Support was recently merged for two fp8 datatype, as well as mxfp4 and nf4 which are only really in the literature and benefit the latest chips like yours and the recent renamed (ugh) xeon. That's the utilities notebook contains the device query snippets, they provide some information on what's appropriate to choose for datatypes and the performance hints. Without setting arguments the runtime automatically selects appropriate settings based on- I think- what's stored the model xml, at least for decoder only compared against hardware 'facts' Another example- there are hints that let you allocate between performance and efficiency cores and it works from python... might even work with heterogeneous between cpu and gpu, npu I'm not sure yet. It works without pinning or assigning cores through docker compose... just as parameters exposed in ov_config. It gets tricky here though; if it isn't exposed in that stub file which defines the python api layer over cpp then they don't recommend using it as the openvino code which instantiates core isn't tooled for generative ai. It will work but performance might be terrible? Not sure. Basically Ollama integrations leave performance on the table and THATs what's out of scope. When I was a physics major it was a running joke that engineers would plug and chug |
You might find this repo interesting: https://github.com/xanderlent/intel-npu-driver-rpm BTW Feel free to create a shared discord chat for this project, much better than using those random Gh issues |
The text was updated successfully, but these errors were encountered: