Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable_hash(::Regex) #66

Open
rasmushenningsson opened this issue Jun 18, 2024 · 13 comments
Open

stable_hash(::Regex) #66

rasmushenningsson opened this issue Jun 18, 2024 · 13 comments

Comments

@rasmushenningsson
Copy link
Contributor

Regex is a mutable struct which causes it to stable_hash differently each time.

julia> bytes2hex(stable_hash(r"hello"; version=3))
"d9bdaff2d67b5f0d6890fafddde31b70678837711d5352fbc848a436b47d8e22"

julia> bytes2hex(stable_hash(r"hello"; version=3))
"665ac86a45a852990551faa2462d395029a029336935e12be6bfaf1ecabcc76d"

I think it would make sense to make it hash in a stable manner by default. Do you agree?
(The reason why it is a mutable struct is so it can interact with the GC to free the pointer to the compiled regex.)

Here's the corresponding hash function in Base, indicating that we should take pattern, compile_options and match_options into account.

function hash(r::Regex, h::UInt)
    h += hashre_seed
    h = hash(r.pattern, h)
    h = hash(r.compile_options, h)
    h = hash(r.match_options, h)
end
@haberdashPI
Copy link
Member

I think it would make sense to make it hash in a stable manner by default. Do you agree?

Yes, that sounds useful to me. I am wrapping up some work in #55 to improve the stability of hashes with a new API design for customizing how types get hashed. I think this improvement makes sense to make after that merges.

@haberdashPI
Copy link
Member

haberdashPI commented Jul 1, 2024

In the meantime, I believe the following should be an effective workaround. (Though I haven't tested it).

StableHashTraits.hash_method(::Regex) = FnHash(x -> (x.pattern, x.compile_options, r.match_options))

Or, for a specific context, you could do

struct HashRegex{T}
   parent::T
end
StableHashTraits.parent_context(x::HashRegex) = x.parent
StableHashTraits.hash_method(::Regex, ::HashRegex) = FnHash(x -> (x.pattern, x.compile_options, r.match_options))
stable_hash(r"a*b*", HashRegex(HashVersion{3}()))

@haberdashPI
Copy link
Member

Resolved by #58

@haberdashPI
Copy link
Member

I don't think #58 actually fixes this, it's just now feasible to fix. Need to add tests and methods of transformer.

@rasmushenningsson
Copy link
Contributor Author

I've looked into this again and it seems to working for HashVersion{4}.

Here's what the Regex definition looks like:

mutable struct Regex <: AbstractPattern
    pattern::String
    compile_options::UInt32
    match_options::UInt32
    regex::Ptr{Cvoid} # Cvoid===Nothing
end

For stable_hash, we want to ignore the regex::Ptr{Nothing}.

But actually, Ptr already hashes the same regardless of contents:

julia> bytes2hex(stable_hash(Ptr{Nothing}(); version=4))
"17cadb739b5ed376bb7c939f139debd2cba4530a93cb7ad7bbc1e61dbd3387eb"

julia> bytes2hex(stable_hash(Ptr{Nothing}(1234); version=4))
"17cadb739b5ed376bb7c939f139debd2cba4530a93cb7ad7bbc1e61dbd3387eb"

julia> bytes2hex(stable_hash(Ptr{Int}(1234); version=4))
"17cadb739b5ed376bb7c939f139debd2cba4530a93cb7ad7bbc1e61dbd3387eb"

so my conclusion is that it works now.

@rasmushenningsson
Copy link
Contributor Author

The natural follow-up question is if Ptr should hash differently depending on the pointer?
My initial feeling is that this perhaps should error. Because there is no way to know how to hash pointers in general. Sometimes (like here) the value doesn't matter, but another time it could be critical that they are different.

I'm fine with any handling of Ptr though.

@rasmushenningsson
Copy link
Contributor Author

Here's a simple transformer for Regex in case we want it:

transform_type(::Type{Regex}) = "Base.Regex"
function transformer(::Type{Regex}, ::HashVersion{4})
    # This skips the compiled regex which is stored as a Ptr{Nothing}
    return Transformer(pick_fields(:pattern, :compile_options, :match_options))
end

@haberdashPI
Copy link
Member

Ah... interesting. Sorry, I somehow missed your updates in the issue here.

I think we probably need some unit tests for Ptr. I don't really understand why Ptrs hash the same, probably because I don't really understand what the representation and StructType of Ptr is.

Ptr is a good sign that a value should not be serialized, so the handling where it simply always has the same value does probably make sense, but I want to make sure that is a defined (and documented) behavior.

@rasmushenningsson
Copy link
Contributor Author

After digging a little further, it looks like it has to do with how bitstypes (primitive types) are handled.

Here's an example showing what's going on for Ptr, by defining a new primitive type that hashes exactly the same way.

primitive type FakePtr 64 end
FakePtr(x) = Base.bitcast(FakePtr, x)
StableHashTraits.transform_type(::Type{FakePtr}) = "Ptr"

As we can see, the FakePtr and Ptr hashes the same. (And the value stored in the bitstype is ignored.)

julia> p = Ptr{Int}(1)
Ptr{Int64} @0x0000000000000001

julia> fp = FakePtr(2)
FakePtr(0x0000000000000002)

julia> bytes2hex(stable_hash(p; version=4))
"17cadb739b5ed376bb7c939f139debd2cba4530a93cb7ad7bbc1e61dbd3387eb"

julia> bytes2hex(stable_hash(fp; version=4))
"17cadb739b5ed376bb7c939f139debd2cba4530a93cb7ad7bbc1e61dbd3387eb"

Also note that

julia> StableHashTraits.hash_trait(p)
StructTypes.UnorderedStruct()

julia> fieldtypes(typeof(p))
()

In conclusion, bitstypes are treated as structs without any fields.

Is there any policy on how to handle bitstypes in StableHashTraits? I guess we would like to actually hash the bits in most cases?
And then possibly change Ptr if it should be treated differently from standard bitstypes.

@haberdashPI
Copy link
Member

haberdashPI commented Oct 30, 2024

Is there any policy on how to handle bitstypes in StableHashTraits? I guess we would like to actually hash the bits in most cases?
And then possibly change Ptr if it should be treated differently from standard bitstypes.

Here I think the issue is that by default StructTypes falls back to UnorderedStruct. That makes sense to me: we don't really know how some custom bitstype that isn't <: Number should be serialized.

🤔 maybe if a type has a StructType of UnorderedStruct with no fields, StableHashTraits should error by default? (And redirect the user to SingletonType if that was really their intent).

@rasmushenningsson
Copy link
Contributor Author

Erroring by default sounds good to me. These are edge cases where the user should take action that is suitable for their situation, e.g. by defining their own hash context.

@rasmushenningsson
Copy link
Contributor Author

I have just realized that the Regex hashing isn't related to Ptr. I'm learning more and more about StableHashTraits as I go...

This is because Regex is considered a StringType in StructTypes:

julia> StructTypes.StructType(r"a")
StructTypes.StringType()

The regex will get hashed using

function stable_hash_helper(str, hash_state, context, ::StructTypes.StringType)
    nested_hash_state = start_nested_hash!(hash_state)
    update_hash!(nested_hash_state, str isa AbstractString ? str : string(str), context)
    return end_nested_hash!(hash_state, nested_hash_state)
end

where the important part is string(str).

I discovered this accidentally when running the tests on Julia 1.9 (and earlier).
Because the implementation of string(::Regex) changed in Julia 1.10.

This feels less than ideal. The string representation of Regex can potentially change again, if someone decides it can be printed more nicely somehow.

What do you think is the best course of action?

  • If we want to keep the current Regex hash (version 4) and make it stable across previous and future Julia versions, we need to copy the implementation of show(io::IO, re::Regex) from Base in Julia 1.11 into StableHashTraits.
  • If we accept that the hash changes, I suggest using the Transformer-based implementation above.

@haberdashPI
Copy link
Member

haberdashPI commented Oct 31, 2024

I agree that relying on the string output of Regex is probably not a good course of action. The transformer representation seems like the "right" way to do it, long term.

Given that 1.3 was only recently released, it seems relatively safe to switch to this behavior.

It also isn't obvious to me which would be the right choice if we go with a "string" solution: using the show output for ≥ 1.10 or for < 1.9? (I can see an argument for either: 1.9 because it's older, and so arguably 1.10 "broke" this hash, or 1.10 because its the new LTS, and because it is the release that was available when 1.3 was released). So it isn't so clear that either yields greater "stability".

So at the moment I am leaning towards simply using the transform approach, and declaring the old behavior for Regex a bug (because it wasn't stable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants