Whisper pipeline: use parallel streamer #1642

as-suvorov · 2025-01-29T09:46:50Z

add bool put(const std::vector<int64_t> tokens) to StreamerBase
add Whisper pipeline StreamerBase ctor. Deprecate ChunkStreamerBase ctor.
add parallel streaming for Whisper with async/wait.
add ThreadedStreamerWrapper. Utilize it in CB pipelines.

Ticket: 160606

src/cpp/src/whisper/streamer.hpp

…el_streamer

src/cpp/include/openvino/genai/whisper_pipeline.hpp

src/cpp/src/continuous_batching_impl.cpp

src/cpp/src/text_callback_streamer.hpp

src/cpp/src/threaded_streamer.hpp

src/cpp/src/whisper/streamer.hpp

src/python/py_whisper_pipeline.cpp

src/python/py_openvino_genai.cpp

src/cpp/src/continuous_batching_impl.cpp

iefode · 2025-02-04T09:47:11Z

src/cpp/src/threaded_streamer.hpp

+    std::shared_ptr<std::thread> m_worker_thread = nullptr;
+    std::mutex m_mutex;
+    std::condition_variable m_cv;
+    std::queue<std::variant<int64_t, std::vector<int64_t>>> m_queue;


Why not SynchronizedQueue from openvino.genai/src/cpp/src/synchronized_queue.hpp?

There was a deadlock for squeue.pull. It happen on streamer.end call.

streamer thread waits for a new token

main thread calls streamer.end, streamer thread should be stopped. But there is no API to gracefully unlock squeue.

One way to unlock squeue is to push some dummy token and handle is_stopped flag.
I thought it may be not very clean.

I guess the trick with dummy token and sync queue can simplify implementation. Do you want me to implement that?

Look like we handled it via

openvino.genai/src/cpp/src/sequence_group.hpp

Line 655 in a0852d0

push_empty_outputs();

Reimplemented with SynchronizedQueue

Wovchena · 2025-02-05T05:10:04Z

src/cpp/include/openvino/genai/streamer_base.hpp

@@ -18,6 +18,7 @@ class OPENVINO_GENAI_EXPORTS StreamerBase {
    /// @brief put is called every time new token is decoded,
    /// @return bool flag to indicate whether generation should be stopped, if return true generation stops
    virtual bool put(int64_t token) = 0;
+    virtual bool put(const std::vector<int64_t>& tokens) = 0;


That adds a requirements to the child classes to override one more method. I think it must have a default implementation.
IterableStreamer only overrides single token version.

openvino.genai/samples/python/text_generation/multinomial_causal_lm.py

Line 11 in 9575602

class IterableStreamer(openvino_genai.StreamerBase):

While it passes because bindings handle this, this seems to be correct to follow C++ API in python as well and override both versions if bool put(const std::vector<int64_t>& tokens) remains 0 by default

default implementation will be suboptimal as it does not populate m_tokens_cache , which is a field of derived class.

Do we need to have m_tokens_cache in interface as well?
It will tell users "something" about possible implementation

default implementation will be suboptimal as it does not populate m_tokens_cache , which is a field of derived class.

A NotIimplementedException is enough.

Do we need to have m_tokens_cache in interface as well?
It will tell users "something" about possible implementation

No, a child class will end up with this field which may not be used. end() already suggests that caching is possible.

A NotIimplementedException is enough.

In this case such streamer cannot work with Spec Decoding, Prompt Look-up or even stop strings

It's not BW compatible as well, as in current PR we assume that such method put(vector) is available

I see. Then the need to override IterableStreamer.put() is even more important.

is the final decision is to add default sub-optimal implementation of put(vector)? While our streamer will override this method to have optimal implementation

Yes. Maybe with a warning print

Wovchena · 2025-02-05T05:11:19Z

src/cpp/include/openvino/genai/streamer_base.hpp

@@ -18,6 +18,7 @@ class OPENVINO_GENAI_EXPORTS StreamerBase {
    /// @brief put is called every time new token is decoded,
    /// @return bool flag to indicate whether generation should be stopped, if return true generation stops
    virtual bool put(int64_t token) = 0;
+    virtual bool put(const std::vector<int64_t>& tokens) = 0;


Explain in doc-strings what the new overload is for.

Doc-string added

Wovchena · 2025-02-05T05:24:47Z

src/cpp/src/whisper/streamer.hpp

@@ -3,21 +3,35 @@

 #pragma once

+#include <condition_variable>
+#include <queue>
+#include <thread>


These headers seem to be redundant

Wovchena · 2025-02-05T05:33:13Z

src/python/py_openvino_genai.cpp

+                if (auto _token = std::get_if<int64_t>(&token)) {
+                    return self.put(*_token);
+                } else {
+                    auto tokens = std::get_if<std::vector<int64_t>>(&token);


Suggested change

auto tokens = std::get_if<std::vector<int64_t>>(&token);

auto tokens = std::get<std::vector<int64_t>>(&token);

src/cpp/src/threaded_streamer.hpp

Wovchena · 2025-02-05T05:48:21Z

src/cpp/src/threaded_streamer.hpp

+        }
+
+        std::lock_guard<std::mutex> lock(m_mutex);
+        return m_dropped;


m_dropped lives it's own live. It could have a separate lock. Pushing the idea further, you should simply wrap m_dropped with atomic so no lock is needed at all

Switched to atomic

ilya-lavrenov · 2025-02-07T10:24:13Z

src/cpp/src/text_callback_streamer.hpp

    TextCallbackStreamer(const Tokenizer& tokenizer, std::function<bool(std::string)> callback);

-    std::function<bool(std::string)> on_finalized_subword_callback = [](std::string words)->bool { return false; };
+    bool put(int64_t token) override;


should it override put(vector) to have optimal implementation?

TextCallbackStreamer cannot work with put(std::vector<int64_t>) properly for the moment. The trick with inserting chunk of tokens to streamer's cache looks not valid to me. This is due to n_delay_tokens feature: https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/text_callback_streamer.cpp#L42.
I propose to fix it in the next PRs.

ilya-lavrenov · 2025-02-07T10:28:54Z

src/python/py_whisper_pipeline.cpp

+               const WhisperGenerationConfig& generation_config,
+               const std::shared_ptr<ChunkStreamerBase>& streamer,
+               const py::kwargs& kwargs) -> py::typing::Union<ov::genai::WhisperDecodedResults> {
+                StreamerVariant _streamer = std::make_shared<ov::genai::ChunkToBaseStreamerAdapter>(streamer);


maybe we can add deprecation warning using PyErr_WarnEx(PyExc_DeprecationWarning .. ?

Yes, will do

as-suvorov · 2025-02-07T11:05:38Z

@ilya-lavrenov @Wovchena @sbalandi It seems that adding overload put(const std::vector<int64_t>& tokens) will introduce a breaking change for the python API. That's because method overloading is not possible in python as I understand. Latter method definition with the same name will overwrite previous one. With a such change python API will have to have a union arg type def put(token: int | list[int]) -> bool (#1642 (comment)) which is a breaking change.
It looks like @sbalandi's PR #1476 should merge first and then we can introduce def write(token: int | list[int]) with no breaking changes.

I want to split this PR into:

Use Whisper parallel streaming with async/wait
Add ThreadedStreamer for CB pipelines (with all the small fixes from this PR)
Add write(std::vector<int64_t>& tokens) method for StreamerBase. (based on Add a choice of how to end streaming from callback: STOP or CANCEL #1476)

Wovchena

This is what I noticed before the decision to split the PR reached me.

Wovchena · 2025-02-07T11:00:39Z

tests/python_tests/test_whisper_pipeline.py

+    assert streamer_instance.text == result.texts[0]
+
+    config = genai_pipe.get_generation_config()
+    config.return_timestamps = True


Parametrize the test with different instances of streamer. return_timestamps is also a good candidate to be parametrized, but I guess the intention was to also test different ways of setting it.

A function can be defined like def foo(val): foo.stored_val = [] if not hasattr(foo, 'stored_val') else foo.stored_val; foo.stored_val.append(val); print(foo.stored_val)

Wovchena · 2025-02-07T11:22:28Z

tests/cpp/threaded_streamer.cpp

+
+        ON_CALL(*this, end()).WillByDefault([this]() {
+            if (should_sleep) {
+                std::this_thread::sleep_for(m_sleep_for);


Should end() really sleep? heavy_callback_test doesn't check anything after end().

Is it possible to parametrize MockStreamerBase's constructor with a lambda taking const std::vector<int64_t>& tokens? In that case you could define put(int64_t token) as return put({token});. Every test would provide it's own lambda which sleeps or not, drops or not. This would let you to reduce the number of members and the behavior description would be local to every test.

Should the tests also verify that no value is missed? A lambda would help with that as well.

Use whisper streamer

8a38e00

github-actions bot added category: whisper Whisper pipeline no-match-files labels Jan 29, 2025

as-suvorov added this to the 2025.1 milestone Jan 29, 2025

as-suvorov requested review from Wovchena and ilya-lavrenov January 29, 2025 09:48

as-suvorov assigned ilya-lavrenov and Wovchena Jan 29, 2025

Remove lock

478fed1

as-suvorov marked this pull request as draft January 29, 2025 10:05

as-suvorov added do_not_merge do_not_review labels Jan 29, 2025

ilya-lavrenov reviewed Jan 29, 2025

View reviewed changes

src/cpp/src/whisper/streamer.hpp Outdated Show resolved Hide resolved

Wovchena requested changes Jan 29, 2025

View reviewed changes

src/cpp/src/whisper/streamer.hpp Outdated Show resolved Hide resolved

src/cpp/src/whisper/streamer.hpp Outdated Show resolved Hide resolved

Fix python lock

05bc82a

github-actions bot added the category: Python API Python API for GenAI label Jan 29, 2025

iefode self-assigned this Jan 30, 2025

Wovchena approved these changes Jan 30, 2025

View reviewed changes

as-suvorov added 4 commits January 30, 2025 15:16

Use async mode

b3ae260

Use StreamerBase with tokens put

c50dd9c

correct typings

baced7b

Use threaded streamer

d16c8c0

as-suvorov added 2 commits February 3, 2025 17:46

Merge remote-tracking branch 'upstream/master' into as/whisper_parall…

403270e

…el_streamer

Remove decoder.decode

46bd423

Return if has no callback

c505063

ilya-lavrenov reviewed Feb 4, 2025

View reviewed changes

as-suvorov added 4 commits February 4, 2025 09:46

Switch condition

7881cb9

Early return for is_dropped

02320ff

Move cb to private

5104bfc

Remove todods

e70b743

github-actions bot removed the category: samples GenAI samples label Feb 4, 2025

iefode reviewed Feb 4, 2025

View reviewed changes

Add stream_tokens for icb

1a6dcb6

as-suvorov removed do_not_merge do_not_review labels Feb 4, 2025

as-suvorov requested a review from Wovchena February 4, 2025 13:27

as-suvorov marked this pull request as ready for review February 4, 2025 13:27

Wovchena requested changes Feb 5, 2025

View reviewed changes

as-suvorov added 6 commits February 6, 2025 10:13

Apply review comments

78ad93f

Fix iterable streamer

a3e451f

Use sync queue

520e758

Remove put tokens override

f559f44

Add py tests

e4d5a6a

Add threaded streamer tests

8590765

github-actions bot added the category: samples GenAI samples label Feb 7, 2025

ilya-lavrenov reviewed Feb 7, 2025

View reviewed changes

as-suvorov marked this pull request as draft February 7, 2025 13:26

Wovchena requested changes Feb 7, 2025

View reviewed changes

as-suvorov mentioned this pull request Feb 7, 2025

CB pipelines: use threaded streamer #1690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper pipeline: use parallel streamer #1642

Whisper pipeline: use parallel streamer #1642

as-suvorov commented Jan 29, 2025 •

edited

Loading

iefode Feb 4, 2025

as-suvorov Feb 4, 2025

iefode Feb 7, 2025

as-suvorov Feb 7, 2025

Wovchena Feb 5, 2025

ilya-lavrenov Feb 5, 2025

Wovchena Feb 5, 2025

ilya-lavrenov Feb 5, 2025 •

edited

Loading

Wovchena Feb 5, 2025

ilya-lavrenov Feb 5, 2025

Wovchena Feb 5, 2025

Wovchena Feb 5, 2025

as-suvorov Feb 7, 2025

Wovchena Feb 5, 2025

as-suvorov Feb 7, 2025

Wovchena Feb 5, 2025

as-suvorov Feb 7, 2025

Wovchena Feb 5, 2025

as-suvorov Feb 7, 2025

ilya-lavrenov Feb 7, 2025

as-suvorov Feb 7, 2025

ilya-lavrenov Feb 7, 2025

as-suvorov Feb 7, 2025

as-suvorov commented Feb 7, 2025 •

edited

Loading

Wovchena left a comment

Wovchena Feb 7, 2025

Wovchena Feb 7, 2025

	auto tokens = std::get_if<std::vector<int64_t>>(&token);
	auto tokens = std::get<std::vector<int64_t>>(&token);

Whisper pipeline: use parallel streamer #1642

Are you sure you want to change the base?

Whisper pipeline: use parallel streamer #1642

Conversation

as-suvorov commented Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilya-lavrenov Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

as-suvorov commented Feb 7, 2025 • edited Loading

Wovchena left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

as-suvorov commented Jan 29, 2025 •

edited

Loading

ilya-lavrenov Feb 5, 2025 •

edited

Loading

as-suvorov commented Feb 7, 2025 •

edited

Loading