tokenizer: read simplified_chat_template #1712

Wovchena · 2025-02-11T10:26:53Z

Depends on huggingface/optimum-intel#1151

Ticket 161313

Depends on huggingface/optimum-intel#1151 Close openvinotoolkit#1663 Ticket 161313

ilya-lavrenov · 2025-02-11T13:12:26Z

src/cpp/src/tokenizer.cpp

@@ -195,11 +250,10 @@ class Tokenizer::TokenizerImpl {
        read_special_tokens_map(models_path);
        // Try to read tokenizer_config if some token ids or token str are not defined.
        read_tokenizer_config_if_necessary(models_path);
+        parse_if_exists(models_path / "tokenizer_config.json", m_chat_template);
+        parse_if_exists(models_path / "processor_config.json", m_chat_template);
+        parse_if_exists(models_path / "chat_template.json", m_chat_template);


it's not easy to follow order of special tokens setting and chat_templates

maybe we can have 3 separate functions:

setup_tokenizers (models)

setup_special_tokens

setup_chat_template

in this case, code will not look like spaghetti (IMO) and each action is done with a dedicated function, while now the logic / order is spread across several functions

I had the similar impression. But the code is already grouped by files forcing the priority to be the same for any special token.

setup_tokenizer() from the file system

read_config()

read_special_tokens_map()

read_tokenizer_config_if_necessary()

processor_config.json

chat_template.json

setup_tokenizer() from rt_info

The impression of spaghettiness comes form read_tokenizer_config_if_necessary() which suddenly refuses to read its file if all special tokens are already initialized. But it overrides all tokens if only some of them aren't initialized. That's why parse_if_exists(models_path / "tokenizer_config.json", m_chat_template); is outside of read_tokenizer_config_if_necessary().

I believe the current proposal is the most readable thing possible given the current architecture.

As for refactoring, there are ways to improve tokenizer.cpp indeed. The problem is that the exact implementation is a matter of teste so I avoided stepping into that. One way to do that is the following. Its benefits are that it enforces starting every member with m and preserves -1 token if explicitly written in one of the files, although it's must be verified if that's how transformers work.

struct Members { std::optional<CircularBufferQueue<ov::InferRequest>> tokenizers; std::optional<CircularBufferQueue<ov::InferRequest>> detokenizers; int64_t eos_token_id; // ... }; struct OptionalMembers { std::optional<CircularBufferQueue<ov::InferRequest>> tokenizers; std::optional<CircularBufferQueue<ov::InferRequest>> detokenizers; std::optional<int64_t> eos_token_id; std::optional<std::string> chat_template; // ... OptionalMembers from_dir_if_nullopt(std::filesystem::path dir) { read_config(dir); read_special_tokens_map(dir); // Try to read tokenizer_config if some token ids or token str are not defined. read_tokenizer_config_if_necessary(dir); parse_if_exists(dir / "processor_config.json", *chat_template); parse_if_exists(dir / "chat_template.json", *chat_template); } }; Members default_if_nullopt(OptionalMembers&&) {return Members{};} std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>> read_models(std::filesystem::path) {} OptionalMembers from_rt_info(std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>>) {return OptionalMembers{};} OptionalMembers from_dir(std::filesystem::path dir) { return from_rt_info(read_models(dir)).from_dir_if_nullopt(dir); } class NewTokenizerImpl; void initialize_cache(NewTokenizerImpl&) { encode(); decode();} struct NewTokenizerImpl { Members m; NewTokenizerImpl(const std::filesystem::path& dir) : m(default_if_nullopt(from_dir(dir))) {initialize_cache(*this);} NewTokenizerImpl(const std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>>& models) : m(default_if_nullopt(from_rt_info(models))) {initialize_cache(*this);} };

ilya-lavrenov · 2025-02-12T10:30:32Z

Please, merge with latest master to ensure CI can pass

Depends on huggingface/optimum-intel#1151

But it's not blocked by optimum PR, right? We can merge w/o waiting that one

Wovchena · 2025-02-12T11:04:38Z

Not blocked

Depends on huggingface/optimum-intel#1151 Close #1663 Ticket 161313

tokenizer: read simplified_chat_template

f3317de

Depends on huggingface/optimum-intel#1151 Close openvinotoolkit#1663 Ticket 161313

github-actions bot added category: tokenizers Tokenizer class or submodule update category: Python API Python API for GenAI category: GenAI C++ API Changes in GenAI C++ public headers no-match-files labels Feb 11, 2025

ilya-lavrenov self-assigned this Feb 11, 2025

Disable chat template

ab86d8e

github-actions bot added the category: continuous batching Continuous batching label Feb 11, 2025

ilya-lavrenov reviewed Feb 11, 2025

View reviewed changes

ilya-lavrenov added this to the 2025.1 milestone Feb 11, 2025

Fix namespace

78a4f22

ilya-lavrenov approved these changes Feb 11, 2025

View reviewed changes

ilya-lavrenov assigned pavel-esir Feb 11, 2025

pavel-esir approved these changes Feb 12, 2025

View reviewed changes

Merge branch 'master' into tokenizer-read-simplified_chat_template

3b4aa04

Wovchena enabled auto-merge February 12, 2025 11:04

Wovchena added this pull request to the merge queue Feb 12, 2025

github-merge-queue bot pushed a commit that referenced this pull request Feb 12, 2025

tokenizer: read simplified_chat_template (#1712)

4b225c7

Depends on huggingface/optimum-intel#1151 Close #1663 Ticket 161313

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025

Wovchena added this pull request to the merge queue Feb 12, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025

ilya-lavrenov added this pull request to the merge queue Feb 12, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025

Wovchena added this pull request to the merge queue Feb 13, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer: read simplified_chat_template #1712

tokenizer: read simplified_chat_template #1712

Wovchena commented Feb 11, 2025

ilya-lavrenov Feb 11, 2025 •

edited

Loading

Wovchena Feb 11, 2025

ilya-lavrenov commented Feb 12, 2025

Wovchena commented Feb 12, 2025

tokenizer: read simplified_chat_template #1712

Are you sure you want to change the base?

tokenizer: read simplified_chat_template #1712

Conversation

Wovchena commented Feb 11, 2025

ilya-lavrenov Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Wovchena Feb 11, 2025

Choose a reason for hiding this comment

ilya-lavrenov commented Feb 12, 2025

Wovchena commented Feb 12, 2025

ilya-lavrenov Feb 11, 2025 •

edited

Loading