Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer: read simplified_chat_template #1712

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Wovchena
Copy link
Collaborator

Depends on huggingface/optimum-intel#1151

Close #1663

Ticket 161313

@github-actions github-actions bot added category: tokenizers Tokenizer class or submodule update category: Python API Python API for GenAI category: GenAI C++ API Changes in GenAI C++ public headers no-match-files labels Feb 11, 2025
@ilya-lavrenov ilya-lavrenov self-assigned this Feb 11, 2025
@github-actions github-actions bot added the category: continuous batching Continuous batching label Feb 11, 2025
@@ -195,11 +250,10 @@ class Tokenizer::TokenizerImpl {
read_special_tokens_map(models_path);
// Try to read tokenizer_config if some token ids or token str are not defined.
read_tokenizer_config_if_necessary(models_path);
parse_if_exists(models_path / "tokenizer_config.json", m_chat_template);
parse_if_exists(models_path / "processor_config.json", m_chat_template);
parse_if_exists(models_path / "chat_template.json", m_chat_template);
Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not easy to follow order of special tokens setting and chat_templates

maybe we can have 3 separate functions:

  • setup_tokenizers (models)
  • setup_special_tokens
  • setup_chat_template

in this case, code will not look like spaghetti (IMO) and each action is done with a dedicated function, while now the logic / order is spread across several functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the similar impression. But the code is already grouped by files forcing the priority to be the same for any special token.

  1. setup_tokenizer() from the file system
    1. read_config()
    2. read_special_tokens_map()
    3. read_tokenizer_config_if_necessary()
    4. processor_config.json
    5. chat_template.json
  2. setup_tokenizer() from rt_info

The impression of spaghettiness comes form read_tokenizer_config_if_necessary() which suddenly refuses to read its file if all special tokens are already initialized. But it overrides all tokens if only some of them aren't initialized. That's why parse_if_exists(models_path / "tokenizer_config.json", m_chat_template); is outside of read_tokenizer_config_if_necessary().

I believe the current proposal is the most readable thing possible given the current architecture.

As for refactoring, there are ways to improve tokenizer.cpp indeed. The problem is that the exact implementation is a matter of teste so I avoided stepping into that. One way to do that is the following. Its benefits are that it enforces starting every member with m and preserves -1 token if explicitly written in one of the files, although it's must be verified if that's how transformers work.

struct Members {
    std::optional<CircularBufferQueue<ov::InferRequest>> tokenizers;
    std::optional<CircularBufferQueue<ov::InferRequest>> detokenizers;
    int64_t eos_token_id;
    // ...
};

struct OptionalMembers {
    std::optional<CircularBufferQueue<ov::InferRequest>> tokenizers;
    std::optional<CircularBufferQueue<ov::InferRequest>> detokenizers;
    std::optional<int64_t> eos_token_id;
    std::optional<std::string> chat_template;
    // ...

    OptionalMembers from_dir_if_nullopt(std::filesystem::path dir) {
        read_config(dir);
        read_special_tokens_map(dir);
        // Try to read tokenizer_config if some token ids or token str are not defined.
        read_tokenizer_config_if_necessary(dir);
        parse_if_exists(dir / "processor_config.json", *chat_template);
        parse_if_exists(dir / "chat_template.json", *chat_template);
    }
};

Members default_if_nullopt(OptionalMembers&&) {return Members{};}

std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>> read_models(std::filesystem::path) {}

OptionalMembers from_rt_info(std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>>) {return OptionalMembers{};}
OptionalMembers from_dir(std::filesystem::path dir) {
    return from_rt_info(read_models(dir)).from_dir_if_nullopt(dir);
}

class NewTokenizerImpl;
void initialize_cache(NewTokenizerImpl&) { encode(); decode();}

struct NewTokenizerImpl {
    Members m;

    NewTokenizerImpl(const std::filesystem::path& dir)
        : m(default_if_nullopt(from_dir(dir))) {initialize_cache(*this);}
    NewTokenizerImpl(const std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>>& models)
        : m(default_if_nullopt(from_rt_info(models))) {initialize_cache(*this);}
};

@ilya-lavrenov ilya-lavrenov added this to the 2025.1 milestone Feb 11, 2025
@ilya-lavrenov
Copy link
Contributor

Please, merge with latest master to ensure CI can pass

Depends on huggingface/optimum-intel#1151

But it's not blocked by optimum PR, right? We can merge w/o waiting that one

@Wovchena Wovchena enabled auto-merge February 12, 2025 11:04
@Wovchena
Copy link
Collaborator Author

Not blocked

@Wovchena Wovchena added this pull request to the merge queue Feb 12, 2025
github-merge-queue bot pushed a commit that referenced this pull request Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@Wovchena Wovchena added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@ilya-lavrenov ilya-lavrenov added this pull request to the merge queue Feb 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2025
@Wovchena Wovchena added this pull request to the merge queue Feb 13, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: continuous batching Continuous batching category: GenAI C++ API Changes in GenAI C++ public headers category: Python API Python API for GenAI category: tokenizers Tokenizer class or submodule update no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_tokenizer().apply_chat_template occurs an error
3 participants