-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial support for concurency #2
base: ea/stateful
Are you sure you want to change the base?
Conversation
if self.compiled_model is None: | ||
super().compile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just call self.compile()
def forward( | ||
self, | ||
input_ids: torch.LongTensor, | ||
infer_request: openvino.runtime.InferRequest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it truly backward compatible to add another positional argument to forward
that is exposed externally and can be potentially used in some not so trivial examples bypassing generate method? Would it be safer to add it to kwargs
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the safest way would be to pass it as keyword-only argument. It is more descriptive and clear approach than kwargs.
tests/openvino/gen_threads.py
Outdated
prompt1 = [" The weather is "] | ||
x = threading.Thread(target=gen_thread, args=(prompt1,)) | ||
x.start() | ||
prompt2 = [" Openvino is a "] | ||
y = threading.Thread(target=gen_thread, args=(prompt2,)) | ||
y.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use more stressful test with beam search that would trigger a race on beam_idx
model object field update. Or run multiple generates with different batch sizes.
@@ -436,30 +446,29 @@ def forward( | |||
inputs["beam_idx"] = self.next_beam_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most critical comment: self.next_beam_idx
is a shared resource that is updated concurrently from multiple threads. It will lead to incorrect behavior when multiple generate
s with different batch size or with beam search mode and different prompts, or with any kind of sampling are called. Separate version of next_beam_idx
should be created for each generate invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it is failing with parallel generate with different batch sizes. I'll try passing next_beam_idx also in kwargs along with infer_request.
50a3f22
to
70d086a
Compare
* Support AWQ models * Add tests * Add dependencies * Fix tests * enable awq export only if ov support it * fix style (#2) * disable awq and gptq install for old torch (#3) * fix style * disable autogptq and autoawq install for old transformers testing * separate common quant models patching and gptq (#4) * disable windows install (huggingface#5) * separate common quant models patching and gptq * disable awq windows * skip logits check for quantized models (huggingface#6) * fix test after rebase * fix testing condition for 2024.6 and unpatch in case if failed * Fix qwen2-vl tests (huggingface#1084) * Skip private mdoel loading test for external contributors (huggingface#1082) * Fix reshaping unet if timestep is 0d tensor (huggingface#1083) * Disable kv cache compression for fp vlm (huggingface#1080) * Support AWQ models * Add tests * Add dependencies * Fix tests * enable awq export only if ov support it * fix style (#2) * disable awq and gptq install for old torch (#3) * fix style * disable autogptq and autoawq install for old transformers testing * separate common quant models patching and gptq (#4) * disable windows install (huggingface#5) * separate common quant models patching and gptq * disable awq windows * skip logits check for quantized models (huggingface#6) * fix test after rebase * fix testing condition for 2024.6 and unpatch in case if failed * add necessary packages in test_openvino_full * fix code style after rebase (huggingface#7) --------- Co-authored-by: eaidova <[email protected]> Co-authored-by: Nikita Savelyev <[email protected]> Co-authored-by: Ella Charlaix <[email protected]>
enable multi concurrency in the execution?
Before submitting