Add support for checking hash of downloaded files before use. #230
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We are using tiktoken in various production scenarios and sometimes have the problem that the download of
.tiktoken
files (e.g.,cl100k_base.tiktoken
) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances.More often, when this happens,
Encoder.encode()
will throw an exception such aswhich turns out to be quite hard to track down.
In an effort to make tiktoken more robust for production use, this PR adds the
sha256
hash of each of the downloaded files toopenai_public.py
and augmentsread_file
to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong.This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.