Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic mode #501

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anivegesana
Copy link
Contributor

Implements #19 (comment)

For now, this is what I have chosen for the behavior of "deterministic" mode:

  • set and frozenset will be sorted before being pickled.
    • Subclasses of set and frozenset will not be effected (and will remain nondeterministic) because they can implement their own __reduce__ functions which don't have to follow the conventions of set's pickling procedure.
    • If the elements are incomparable (e.g. complex), they will be sorted by their hash instead. This will not create a natural order of elements that is easy to understand, but if the __hash__ function of the class doesn't depend on id, it will be deterministic.
    • If using the faster cPickle based pickler outlined in Discussion: cPickle #485, this feature may be disabled.
  • dict and subclasses will remain pickled in insertion order.
    • Entries in global variable dictionaries will be in order for each function. The dictionary as a whole, however, will be ordered in visitation order by function and will not be sorted in alphabetical order. This will mean that the globals dictionaries will be deterministic given that the visitation order of functions is deterministic.
    • This feature is guaranteed.
  • code objects will not have their line numbers removed and the file name and function name will not be modified.
    • I thought this would be best to leave up to the packages that use dill rather than requiring all packages that want determinism to drop line number information from code.
    • For now, huggingface datasets can override save_code like they already do, but issues like incorrect traceback from serialized function in python 3.10 #488 would make that difficult to maintain and it would be best to try to find a way to emit information that would help users selectively zero out/ignore line information in code objects when hashing. That way, information that debuggers need still remains in the pickle, but the hash remains consistent.

@anivegesana
Copy link
Contributor Author

I am having a difficult time testing this because non-determinism is an implementation specific detail, so CPython and PyPy have different data structures for sets and it is hard to find a test case that doesn't work on both of them.

@AaronFriel
Copy link

@anivegesana what's the status of this PR? Is the primary obstacle still testing - could we just have separate tests for CPython and PyPy and use runtime detection to test-the-test?

@anivegesana
Copy link
Contributor Author

Hey @AaronFriel,

I haven't looked at this in forever and will need some time to figure out what this did and tried to accomplish. With a couple of pass-throughs, I think most of the functionality can be obtained by registering custom pickling functions for set. Is there a particular use case you are looking to use this for?

@AaronFriel
Copy link

Pulumi, as an infrastructure as code tool, computes diffs between cloud provider resources declared in code, such as an AWS Lambda function, to determine whether to update the resource. A deterministic pickle ensures that we don't see spurious updates.

@N-Demir
Copy link

N-Demir commented Mar 9, 2024

+1 would really love to see a deterministic pickle because so many lib's caching strategies depend on it (like prefect)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants