Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial support for concurency #2

Draft
wants to merge 14 commits into
base: ea/stateful
Choose a base branch
from

Conversation

dtrawins
Copy link

@dtrawins dtrawins commented Jan 8, 2024

enable multi concurrency in the execution?

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Comment on lines 338 to 339
if self.compiled_model is None:
super().compile()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call self.compile()

def forward(
self,
input_ids: torch.LongTensor,
infer_request: openvino.runtime.InferRequest,
Copy link
Collaborator

@slyalin slyalin Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it truly backward compatible to add another positional argument to forward that is exposed externally and can be potentially used in some not so trivial examples bypassing generate method? Would it be safer to add it to kwargs instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the safest way would be to pass it as keyword-only argument. It is more descriptive and clear approach than kwargs.

Comment on lines 32 to 37
prompt1 = [" The weather is "]
x = threading.Thread(target=gen_thread, args=(prompt1,))
x.start()
prompt2 = [" Openvino is a "]
y = threading.Thread(target=gen_thread, args=(prompt2,))
y.start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use more stressful test with beam search that would trigger a race on beam_idx model object field update. Or run multiple generates with different batch sizes.

@@ -436,30 +446,29 @@ def forward(
inputs["beam_idx"] = self.next_beam_idx
Copy link
Collaborator

@slyalin slyalin Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most critical comment: self.next_beam_idx is a shared resource that is updated concurrently from multiple threads. It will lead to incorrect behavior when multiple generates with different batch size or with beam search mode and different prompts, or with any kind of sampling are called. Separate version of next_beam_idx should be created for each generate invocation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is failing with parallel generate with different batch sizes. I'll try passing next_beam_idx also in kwargs along with infer_request.

@eaidova eaidova force-pushed the ea/stateful branch 2 times, most recently from 50a3f22 to 70d086a Compare January 8, 2024 13:40
eaidova added a commit that referenced this pull request Dec 18, 2024
eaidova added a commit that referenced this pull request Dec 20, 2024
eaidova added a commit that referenced this pull request Dec 23, 2024
* Support AWQ models

* Add tests

* Add dependencies

* Fix tests

* enable awq export only if ov support it

* fix style (#2)

* disable awq and gptq install for old torch (#3)

* fix style

* disable autogptq and autoawq install for old transformers testing

* separate common quant models patching and gptq (#4)

* disable windows install (huggingface#5)

* separate common quant models patching and gptq

* disable awq windows

* skip logits check for quantized models (huggingface#6)

* fix test after rebase

* fix testing condition for 2024.6 and unpatch in case if failed

* Fix qwen2-vl tests (huggingface#1084)

* Skip private mdoel loading test for external contributors (huggingface#1082)

* Fix reshaping unet if timestep is 0d tensor (huggingface#1083)

* Disable kv cache compression for fp vlm (huggingface#1080)

* Support AWQ models

* Add tests

* Add dependencies

* Fix tests

* enable awq export only if ov support it

* fix style (#2)

* disable awq and gptq install for old torch (#3)

* fix style

* disable autogptq and autoawq install for old transformers testing

* separate common quant models patching and gptq (#4)

* disable windows install (huggingface#5)

* separate common quant models patching and gptq

* disable awq windows

* skip logits check for quantized models (huggingface#6)

* fix test after rebase

* fix testing condition for 2024.6 and unpatch in case if failed

* add necessary packages in test_openvino_full

* fix code style after rebase (huggingface#7)

---------

Co-authored-by: eaidova <[email protected]>
Co-authored-by: Nikita Savelyev <[email protected]>
Co-authored-by: Ella Charlaix <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants