Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a choice of how to end streaming from callback: STOP or CANCEL #1476

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sbalandi
Copy link
Contributor

@sbalandi sbalandi commented Jan 3, 2025

No description provided.

@github-actions github-actions bot added category: visual language Visual language pipeline category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: speculative decoding Speculative decoding category: GenAI C++ API Changes in GenAI C++ public headers no-match-files category: prompt lookup labels Jan 3, 2025
@sbalandi
Copy link
Contributor Author

sbalandi commented Jan 3, 2025

TODO: add CANCEL for ContinuousBatching

@ilya-lavrenov ilya-lavrenov added this to the 2025.0 milestone Jan 4, 2025
@ilya-lavrenov ilya-lavrenov self-assigned this Jan 6, 2025
@sbalandi sbalandi force-pushed the callback branch 5 times, most recently from 454cdd9 to 1592ed0 Compare January 8, 2025 19:38
@github-actions github-actions bot added category: Python API Python API for GenAI category: samples GenAI samples labels Jan 8, 2025
@sbalandi sbalandi force-pushed the callback branch 3 times, most recently from 10a755b to d18fe16 Compare January 8, 2025 22:19
@sbalandi
Copy link
Contributor Author

sbalandi commented Jan 8, 2025

TODO: add CANCEL for ContinuousBatching

done

@sbalandi sbalandi marked this pull request as ready for review January 8, 2025 22:43
@sbalandi sbalandi force-pushed the callback branch 3 times, most recently from 2758f6b to 03ca3ce Compare January 9, 2025 21:56
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add tests for new functionality.

print(subword, end='', flush=True)
# Return flag corresponds whether generation should be stopped.
# False means continue generation.
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, should we also support callback w/o return value?

E.g. when user don't care about any stop / cancellation

std::cout << word << std::flush;
// Return flag corresponds whether generation should be stopped.
// false means continue generation.
return false;

return ov::genai::StreamerRunningStatus::RUNNING;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ov::genai::StreamerRunningStatus::RUNNING;
return ov::genai::StreamingStatus::CONTINUE;

@@ -30,6 +31,9 @@ struct EncodedGenerationResult {

// Status of generation
GenerationStatus m_status = GenerationStatus::RUNNING;

// Status of streaming
StreamerRunningStatus m_streaming_status = ov::genai::StreamerRunningStatus::UNDEF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can extend GenerationStatus ? E.g. DROPPED_BY_HANDLE means STOP in its current implementation, while for CANCEL we can add a new value.

BTW, looks like we can drop DROPPED_BY_PIPELINE as unused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thoughts:

  • maybe we should deprecated drop() method and introduce stop() instead
  • similarly for GenerationStatus
  • and extend both GenerationHandle and GenerationStatus with cancel() functionality

In this case CB and LLM pipelines logic / naming will be aligned

@@ -15,7 +15,7 @@
RawPerfMetrics,
PerfMetrics,
StreamerBase,
get_version,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep it?

@@ -11,12 +11,17 @@ namespace genai {

class TextCallbackStreamer: public StreamerBase {
public:
StreamerRunningStatus streaming_status = StreamerRunningStatus::UNDEF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I see StreamerBase already contains this field ?

CANCEL = 3 // Stop generate, drop last prompt and all generated tokens from history, KV cache include history but last step
};

using CallbackTypeVariant = std::variant<bool, StreamerRunningStatus>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using CallbackTypeVariant = std::variant<bool, StreamerRunningStatus>;
using CallbackTypeVariant = std::variant<void, bool, StreamerRunningStatus>;

to support callback which just "prints"

@@ -22,6 +34,10 @@ class OPENVINO_GENAI_EXPORTS StreamerBase {
/// @brief end is called at the end of generation. It can be used to flush cache if your own streamer has one
virtual void end() = 0;

virtual StreamerRunningStatus get_finish_streaming_reason() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual StreamerRunningStatus get_finish_streaming_reason() {
StreamingStatus get_streaming_status() {

...
@typing.overload
def generate(self, prompts: list[str], generation_config: list[GenerationConfig], streamer: typing.Callable[[str], bool] | StreamerBase | None = None) -> list[GenerationResult]:
def generate(self, prompts: list[str], generation_config: list[GenerationConfig], streamer: typing.Callable[[str], bool] | typing.Callable[[str], ...] | StreamerBase | None = None) -> list[GenerationResult]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we propagate StreamingStatus to Python API? to use enum instead of str

@andrei-kochin andrei-kochin modified the milestones: 2025.0, 2025.1 Jan 13, 2025
/**
* @brief base class for streamers. In order to use inherit from from this class and implement put, and methods
*
* @param m_tokenizer tokenizer
*/
class OPENVINO_GENAI_EXPORTS StreamerBase {
protected:
StreamerRunningStatus streaming_finish_status = StreamerRunningStatus::UNDEF;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
StreamerRunningStatus streaming_finish_status = StreamerRunningStatus::UNDEF;
StreamerRunningStatus m_streaming_finish_status = StreamerRunningStatus::UNDEF;

@@ -187,6 +187,9 @@ class ov::genai::VLMPipeline::VLMPipelineImpl {
SequenceGroup::Ptr sequence_group = std::make_shared<SequenceGroup>(request_id, prompt_ids, generation_config, block_size);
requests.push_back(sequence_group);

OPENVINO_ASSERT((!m_is_chat_conversation || !std::get_if<std::function<StreamerRunningStatus(std::string)>>(&streamer)),
"For chat mode, please, use Steamer as StreamerBase class or as callback with a bool return value.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is StreamerRunningStatus return value is prohibited for chat? You wrap it with TextCallbackStreamer below


namespace ov {
namespace genai {

enum class StreamerRunningStatus {
UNDEF = 0, // Streaming is not run
RUNNING = 1, // Continue to run of inference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUNNING and UNDEF seem to be equivalent. In that case you should keep only one of them. Moreover callback should never return UNDEF, so merging them fixes the API.

@sbalandi sbalandi force-pushed the callback branch 8 times, most recently from f739e2a to 04bcc7f Compare January 17, 2025 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: continuous batching Continuous batching category: GenAI C++ API Changes in GenAI C++ public headers category: LLM LLM pipeline (stateful, static) category: prompt lookup category: Python API Python API for GenAI category: samples GenAI samples category: speculative decoding Speculative decoding category: visual language Visual language pipeline no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants