-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer: read simplified_chat_template #1712
base: master
Are you sure you want to change the base?
tokenizer: read simplified_chat_template #1712
Conversation
Depends on huggingface/optimum-intel#1151 Close openvinotoolkit#1663 Ticket 161313
@@ -195,11 +250,10 @@ class Tokenizer::TokenizerImpl { | |||
read_special_tokens_map(models_path); | |||
// Try to read tokenizer_config if some token ids or token str are not defined. | |||
read_tokenizer_config_if_necessary(models_path); | |||
parse_if_exists(models_path / "tokenizer_config.json", m_chat_template); | |||
parse_if_exists(models_path / "processor_config.json", m_chat_template); | |||
parse_if_exists(models_path / "chat_template.json", m_chat_template); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not easy to follow order of special tokens setting and chat_templates
maybe we can have 3 separate functions:
- setup_tokenizers (models)
- setup_special_tokens
- setup_chat_template
in this case, code will not look like spaghetti (IMO) and each action is done with a dedicated function, while now the logic / order is spread across several functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the similar impression. But the code is already grouped by files forcing the priority to be the same for any special token.
- setup_tokenizer() from the file system
- read_config()
- read_special_tokens_map()
- read_tokenizer_config_if_necessary()
- processor_config.json
- chat_template.json
- setup_tokenizer() from rt_info
The impression of spaghettiness comes form read_tokenizer_config_if_necessary()
which suddenly refuses to read its file if all special tokens are already initialized. But it overrides all tokens if only some of them aren't initialized. That's why parse_if_exists(models_path / "tokenizer_config.json", m_chat_template);
is outside of read_tokenizer_config_if_necessary()
.
I believe the current proposal is the most readable thing possible given the current architecture.
As for refactoring, there are ways to improve tokenizer.cpp
indeed. The problem is that the exact implementation is a matter of teste so I avoided stepping into that. One way to do that is the following. Its benefits are that it enforces starting every member with m
and preserves -1
token if explicitly written in one of the files, although it's must be verified if that's how transformers
work.
struct Members {
std::optional<CircularBufferQueue<ov::InferRequest>> tokenizers;
std::optional<CircularBufferQueue<ov::InferRequest>> detokenizers;
int64_t eos_token_id;
// ...
};
struct OptionalMembers {
std::optional<CircularBufferQueue<ov::InferRequest>> tokenizers;
std::optional<CircularBufferQueue<ov::InferRequest>> detokenizers;
std::optional<int64_t> eos_token_id;
std::optional<std::string> chat_template;
// ...
OptionalMembers from_dir_if_nullopt(std::filesystem::path dir) {
read_config(dir);
read_special_tokens_map(dir);
// Try to read tokenizer_config if some token ids or token str are not defined.
read_tokenizer_config_if_necessary(dir);
parse_if_exists(dir / "processor_config.json", *chat_template);
parse_if_exists(dir / "chat_template.json", *chat_template);
}
};
Members default_if_nullopt(OptionalMembers&&) {return Members{};}
std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>> read_models(std::filesystem::path) {}
OptionalMembers from_rt_info(std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>>) {return OptionalMembers{};}
OptionalMembers from_dir(std::filesystem::path dir) {
return from_rt_info(read_models(dir)).from_dir_if_nullopt(dir);
}
class NewTokenizerImpl;
void initialize_cache(NewTokenizerImpl&) { encode(); decode();}
struct NewTokenizerImpl {
Members m;
NewTokenizerImpl(const std::filesystem::path& dir)
: m(default_if_nullopt(from_dir(dir))) {initialize_cache(*this);}
NewTokenizerImpl(const std::pair<std::shared_ptr<ov::Model>, std::shared_ptr<ov::Model>>& models)
: m(default_if_nullopt(from_rt_info(models))) {initialize_cache(*this);}
};
Please, merge with latest master to ensure CI can pass
But it's not blocked by optimum PR, right? We can merge w/o waiting that one |
Not blocked |
Depends on huggingface/optimum-intel#1151 Close #1663 Ticket 161313
Depends on huggingface/optimum-intel#1151
Close #1663
Ticket 161313