Ignore invalid UTF-8 input in the BPE tokenizer #11729

cfillion · 2025-02-07T09:02:58Z

This fixes llama_tokenize throwing an exception across the C API boundary or libllama's module boundary (the caller's C++ runtime might be incompatible, similarly to #11727!) by silently inserting U+FFFD(s) (Unicode replacement character) until the next valid codepoint can be found.

llama_tokenize(vocab, "\xE2\x80", 2, nullptr, 0, false, false);

// terminate called after throwing an instance of 'std::invalid_argument'
//   what():  invalid character

#7  __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff7876d90 <typeinfo for std::invalid_argument>, dest=0x7ffff76c5b30 <std::invalid_argument::~invalid_argument()>)
#8  unicode_cpt_from_utf8 (utf8=<incomplete sequence \342\200>, offset=@0x7fffffffd520: 0)
#9  unicode_cpts_from_utf8 (utf8=<incomplete sequence \342\200>)
#10 unicode_regex_split (text=<incomplete sequence \342\200>, regex_exprs=std::vector of length 1, capacity 1 = {...})
#11 llm_tokenizer_bpe_session::tokenize (this=0x7fffffffdab0, text=<incomplete sequence \342\200>, output=std::vector of length 0, capacity 0)
#12 llama_vocab::impl::tokenize (this=0x55555557c5b0, raw_text=<incomplete sequence \342\200>, add_special=false, parse_special=false)
#13 llama_vocab::tokenize (this=0x555556081440, raw_text=<incomplete sequence \342\200>, add_special=false, parse_special=false)
#14 llama_vocab::tokenize (this=0x555556081440, text=0x555555559074 <incomplete sequence \342\200>, text_len=2, tokens=0x0, n_tokens_max=0, add_special=false, parse_special=false)
#15 llama_tokenize (vocab=0x555556081440, text=0x555555559074 <incomplete sequence \342\200>, text_len=2, tokens=0x0, n_tokens_max=0, add_special=false, parse_special=false)

Non-C++ callers cannot reasonably catch that, C++ callers across the DLL boundary might invoke undefined behavior, and the documentation doesn't state that the input must be fully valid UTF-8 or else face unavoidable std::terminate().

Returning a proper error code might be desirable, however llama_tokenize already assigns meaning to all return values.

Other areas of llama.cpp already behave similarly when encountering invalid input:

llama.cpp/src/llama-vocab.cpp

Lines 1041 to 1048 in b7552cf

    
           try { 
        
               // if yes, return this sequence unmodified 
        
               size_t prefix_offset = input_offset; 
        
               unicode_cpt_from_utf8(input, prefix_offset); 
        
               return { &input[input_offset], prefix_offset - input_offset, prefix_offset - input_offset }; 
        
           } catch (std::invalid_argument & /*ex*/) { 
        
               // if no, consume 1 byte and return U+FFFD - REPLACEMENT CHARACTER 
        
               return { "\xEF\xBF\xBD", 3, 1 };

slaren · 2025-02-07T10:32:11Z

src/unicode.cpp

+        try {
+            result.push_back(unicode_cpt_from_utf8(utf8, offset));
+        }
+        catch (const std::invalid_argument & /*ex*/) {
+            ++offset;
+            result.emplace_back(0xFFFD); // replacement character
+        }


Please add a comment explaining this logic. This is a hack and we will need to deal with this properly at some point.

Silently insert U+FFFD(s) (Unicode replacement character) instead until the next valid codepoint can be found. This fixes `llama_tokenize` throwing an exception across the C API boundary or libllama's module boundary (the caller's runtime might be incompatible!) Returing a proper error code might be desirable, however the signature of `llama_tokenize` doesn't allow it as all return values already have existing meaning.

) Silently insert U+FFFD(s) (Unicode replacement character) instead until the next valid codepoint can be found. This fixes `llama_tokenize` throwing an exception across the C API boundary or libllama's module boundary (the caller's runtime might be incompatible!) Returing a proper error code might be desirable, however the signature of `llama_tokenize` doesn't allow it as all return values already have existing meaning.

ggerganov approved these changes Feb 7, 2025

View reviewed changes

slaren reviewed Feb 7, 2025

View reviewed changes

cfillion force-pushed the bpe-tokenize-mojibake-noexcept branch from 6c70a3a to cff1c3b Compare February 7, 2025 10:36

slaren approved these changes Feb 7, 2025

View reviewed changes

ggerganov merged commit 2d219b3 into ggerganov:master Feb 7, 2025
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore invalid UTF-8 input in the BPE tokenizer #11729

Ignore invalid UTF-8 input in the BPE tokenizer #11729

cfillion commented Feb 7, 2025 •

edited

Loading

slaren Feb 7, 2025

cfillion Feb 7, 2025

	try {
	// if yes, return this sequence unmodified
	size_t prefix_offset = input_offset;
	unicode_cpt_from_utf8(input, prefix_offset);
	return { &input[input_offset], prefix_offset - input_offset, prefix_offset - input_offset };
	} catch (std::invalid_argument & /ex/) {
	// if no, consume 1 byte and return U+FFFD - REPLACEMENT CHARACTER
	return { "\xEF\xBF\xBD", 3, 1 };

Ignore invalid UTF-8 input in the BPE tokenizer #11729

Ignore invalid UTF-8 input in the BPE tokenizer #11729

Conversation

cfillion commented Feb 7, 2025 • edited Loading

slaren Feb 7, 2025

Choose a reason for hiding this comment

cfillion Feb 7, 2025

Choose a reason for hiding this comment

cfillion commented Feb 7, 2025 •

edited

Loading