Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for checking hash of downloaded files before use. #230

Merged
merged 2 commits into from
Jan 30, 2024

Conversation

mdwelsh
Copy link
Contributor

@mdwelsh mdwelsh commented Dec 21, 2023

We are using tiktoken in various production scenarios and sometimes have the problem that the download of .tiktoken files (e.g., cl100k_base.tiktoken) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances.

More often, when this happens, Encoder.encode() will throw an exception such as

pyo3_runtime.PanicException: no entry found for key

which turns out to be quite hard to track down.

In an effort to make tiktoken more robust for production use, this PR adds the sha256 hash of each of the downloaded files to openai_public.py and augments read_file to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong.

This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.

Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a nice check to have!

@mdwelsh
Copy link
Contributor Author

mdwelsh commented Dec 22, 2023

Thanks! Anything I need to do to merge this?

@hauntsaninja hauntsaninja merged commit 3ee6c35 into openai:main Jan 30, 2024
21 checks passed
@hauntsaninja
Copy link
Collaborator

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants