-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a choice of how to end streaming from callback: STOP or CANCEL #1476
base: master
Are you sure you want to change the base?
Conversation
TODO: add CANCEL for ContinuousBatching |
454cdd9
to
1592ed0
Compare
10a755b
to
d18fe16
Compare
done |
2758f6b
to
03ca3ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add tests for new functionality.
print(subword, end='', flush=True) | ||
# Return flag corresponds whether generation should be stopped. | ||
# False means continue generation. | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, should we also support callback w/o return value?
E.g. when user don't care about any stop / cancellation
std::cout << word << std::flush; | ||
// Return flag corresponds whether generation should be stopped. | ||
// false means continue generation. | ||
return false; | ||
|
||
return ov::genai::StreamerRunningStatus::RUNNING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ov::genai::StreamerRunningStatus::RUNNING; | |
return ov::genai::StreamingStatus::CONTINUE; |
@@ -30,6 +31,9 @@ struct EncodedGenerationResult { | |||
|
|||
// Status of generation | |||
GenerationStatus m_status = GenerationStatus::RUNNING; | |||
|
|||
// Status of streaming | |||
StreamerRunningStatus m_streaming_status = ov::genai::StreamerRunningStatus::UNDEF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can extend GenerationStatus
? E.g. DROPPED_BY_HANDLE means STOP in its current implementation, while for CANCEL we can add a new value.
BTW, looks like we can drop DROPPED_BY_PIPELINE
as unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More thoughts:
- maybe we should deprecated drop() method and introduce stop() instead
- similarly for GenerationStatus
- and extend both GenerationHandle and GenerationStatus with cancel() functionality
In this case CB and LLM pipelines logic / naming will be aligned
@@ -15,7 +15,7 @@ | |||
RawPerfMetrics, | |||
PerfMetrics, | |||
StreamerBase, | |||
get_version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep it?
@@ -11,12 +11,17 @@ namespace genai { | |||
|
|||
class TextCallbackStreamer: public StreamerBase { | |||
public: | |||
StreamerRunningStatus streaming_status = StreamerRunningStatus::UNDEF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I see StreamerBase
already contains this field ?
CANCEL = 3 // Stop generate, drop last prompt and all generated tokens from history, KV cache include history but last step | ||
}; | ||
|
||
using CallbackTypeVariant = std::variant<bool, StreamerRunningStatus>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using CallbackTypeVariant = std::variant<bool, StreamerRunningStatus>; | |
using CallbackTypeVariant = std::variant<void, bool, StreamerRunningStatus>; |
to support callback which just "prints"
@@ -22,6 +34,10 @@ class OPENVINO_GENAI_EXPORTS StreamerBase { | |||
/// @brief end is called at the end of generation. It can be used to flush cache if your own streamer has one | |||
virtual void end() = 0; | |||
|
|||
virtual StreamerRunningStatus get_finish_streaming_reason() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
virtual StreamerRunningStatus get_finish_streaming_reason() { | |
StreamingStatus get_streaming_status() { |
... | ||
@typing.overload | ||
def generate(self, prompts: list[str], generation_config: list[GenerationConfig], streamer: typing.Callable[[str], bool] | StreamerBase | None = None) -> list[GenerationResult]: | ||
def generate(self, prompts: list[str], generation_config: list[GenerationConfig], streamer: typing.Callable[[str], bool] | typing.Callable[[str], ...] | StreamerBase | None = None) -> list[GenerationResult]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we propagate StreamingStatus to Python API? to use enum instead of str
/** | ||
* @brief base class for streamers. In order to use inherit from from this class and implement put, and methods | ||
* | ||
* @param m_tokenizer tokenizer | ||
*/ | ||
class OPENVINO_GENAI_EXPORTS StreamerBase { | ||
protected: | ||
StreamerRunningStatus streaming_finish_status = StreamerRunningStatus::UNDEF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamerRunningStatus streaming_finish_status = StreamerRunningStatus::UNDEF; | |
StreamerRunningStatus m_streaming_finish_status = StreamerRunningStatus::UNDEF; |
@@ -187,6 +187,9 @@ class ov::genai::VLMPipeline::VLMPipelineImpl { | |||
SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config, block_size); | |||
requests.push_back(sequence_group); | |||
|
|||
OPENVINO_ASSERT((!m_is_chat_conversation || !std::get_if<std::function<StreamerRunningStatus(std::string)>>(&streamer)), | |||
"For chat mode, please, use Steamer as StreamerBase class or as callback with a bool return value."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is StreamerRunningStatus
return value is prohibited for chat? You wrap it with TextCallbackStreamer
below
|
||
namespace ov { | ||
namespace genai { | ||
|
||
enum class StreamerRunningStatus { | ||
UNDEF = 0, // Streaming is not run | ||
RUNNING = 1, // Continue to run of inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RUNNING and UNDEF seem to be equivalent. In that case you should keep only one of them. Moreover callback should never return UNDEF, so merging them fixes the API.
f739e2a
to
04bcc7f
Compare
No description provided.