Skip to content

Commit

Permalink
ignore invalid UTF-8 input in the BPE tokenizer
Browse files Browse the repository at this point in the history
Silently insert U+FFFD(s) (Unicode replacement character) instead until the
next valid codepoint can be found.

This fixes `llama_tokenize` throwing an exception across the C API boundary
or libllama's module boundary (the caller's runtime might be incompatible!)

Returing a proper error code might be desirable, however the signature
of `llama_tokenize` doesn't allow it as all return values already have
existing meaning.
  • Loading branch information
cfillion committed Feb 7, 2025
1 parent b7552cf commit 6c70a3a
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion src/unicode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -618,7 +618,13 @@ std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8) {
result.reserve(utf8.size());
size_t offset = 0;
while (offset < utf8.size()) {
result.push_back(unicode_cpt_from_utf8(utf8, offset));
try {
result.push_back(unicode_cpt_from_utf8(utf8, offset));
}
catch (const std::invalid_argument & /*ex*/) {
++offset;
result.emplace_back(0xFFFD); // replacement character
}
}
return result;
}
Expand Down

0 comments on commit 6c70a3a

Please sign in to comment.