Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VersionType pattern #41

Open
cdanger opened this issue Dec 8, 2024 · 39 comments
Open

Fix VersionType pattern #41

cdanger opened this issue Dec 8, 2024 · 39 comments
Labels
bug Something isn't working

Comments

@cdanger
Copy link

cdanger commented Dec 8, 2024

The VersionType pattern is currently (\d+\.)*\d+ which allows meaningless versions with multiple leading / trailing zeros, such as :

0000000.1

0.000000

Fix: change the pattern to ((0|[1-9]\d*)\.)*(0|[1-9]\d*) .

See also SemVer pattern for more elaborate semantic versioning patterns.

@cdanger cdanger added this to the xacml-core-3.1 milestone Dec 8, 2024
@humantypo
Copy link

I just tested this on pythex and these are considered valid:

0000.
.01
123.
09.83
9.800000

Playing around I came up with this:

\b(0|[1-9]{1,})\.(0|0*[1-9]{1,})(\.(0|0*[1-9]{1,}))?\b

Seems to work better.

You can play around with it here:

https://pythex.org/?regex=%5Cb(0%7C%5B1-9%5D%7B1%2C%7D)%5C.(0%7C0*%5B1-9%5D%7B1%2C%7D)(%5C.(0%7C0*%5B1-9%5D%7B1%2C%7D))%3F%5Cb&test_string=0000.%0A0.01%0A.01%0A.%0A1.1%0A1.005%0A123.%0A2.3.6%0A09.83%0A9.800000&ignorecase=0&multiline=1&dotall=0&verbose=1

@cdanger
Copy link
Author

cdanger commented Dec 14, 2024

The pattern is fine but beware that this pattern goes into the XSD <xs:pattern value="((0|[1-9]\d*)\.)*(0|[1-9]\d*)"/> in which case the pattern is applied to the whole string using regex delimiters ^ and $ implicitly (to match the start and end of the string). So if you want to test it with pythex or any average regex parser, make sure you add these delimiters, so the actual regex for pythex would be: ^((0|[1-9]\d*)\.)*(0|[1-9]\d*)$.

Also 9.80000 should be considered valid, I don't see a problem with that one (it's just the 80000 nth minor version of major version 9).

@humantypo
Copy link

OK, “^$” can be added to ensure it’s an isolated string. The core regex still holds from what I’ve tested. IMO 9.0 is OK, 9.000000 is not because it’s unbounded. We can arbitrarily bind it. How many chars?

@cdanger
Copy link
Author

cdanger commented Dec 15, 2024

IMO 9.0 is OK, 9.000000 is not because it’s unbounded.

Not sure what you are trying to say. If you test the regex with '^$' , i.e. ^((0|[1-9]\d*)\.)*(0|[1-9]\d*)$ , you'll see that 9.0 is indeed valid but 9.0000 is not (9.00 is invalid, 9.000 is invalid, 9.0000 is invalid, 9.00000 is invalid, etc.). So it agrees with you :-) . But again, in the XSD pattern syntax, you don't need the regex delimiters ^$, it's implicit.

@humantypo
Copy link

I was referring to this: Also 9.80000 should be considered valid, I don't see a problem with that one

It would be very easy to modify the regex to allow for some (arbitrary) number of consecutive 0s. While I realize that there might be some sort of system out there that would track thousands of "minor" versions, I was suggesting we cap the number so as to have "completeness" in the formula. 1.0000000000000000000000000000000000000 is not realistic, even though 1.000 might be. Yes, it will be arbitrary, but in this space I suspect that we could cover all real world situations with something like this without leaving it unbounded:

https://pythex.org/?regex=%5Cb(0%7C%5B1-9%5D%7B1%2C%7D)%5C.(0%7B1%2C4%7D%7C0*%5B1-9%5D%7B1%2C%7D)(%5C.(0%7B1%2C4%7D%7C0*%5B1-9%5D%7B1%2C%7D))%3F%5Cb&test_string=1.0%0A1.00%0A1.0000%0A1.00000000000000000%0A0000.%0A0.01%0A.01%0A.%0A1.1%0A1.005%0A123.%0A2.3.6%0A09.83%0A9.800000%0A765.8765%0A0%0A1&ignorecase=0&multiline=1&dotall=0&verbose=1

On a side note, this regex assumes there is a decimal. Is that something we wish to enforce? Seems reasonable to me.

@cdanger
Copy link
Author

cdanger commented Dec 15, 2024

OK, I need to clarify. The dot character '.' here is not a decimal in a versioning scheme, but a delimiter between the different versioning parts: Major version, Minor version, Patch version, etc. So it can be repeated more than once.

In fact, my goal is to follow SemVer (semantic versioning) to some extent, as suggested in my first comment. SemVer is a well-known best practice for versioning, especially in software. Quote from SemVer: A normal version number MUST take the form X.Y.Z where X, Y, and Z are non-negative integers, and MUST NOT contain leading zeroes. X is the major version, Y is the minor version, and Z is the patch version. Each element MUST increase numerically.

I simply reused the SemVer regex (the second one), but removed the Build metadata part which may be a bit overly complex for policy versioning I think, and allowed extra versioning parts after the Patch version (X.Y.Z1.Z2.Z3...). (Although I wouldn't mind sticking with pure SemVer with only 3 versioning parts, because I am not convinced we need more than that for just policy versioning.)

Now going back to your example, 1.000 would not be valid, but instead 1.0.0.0 is valid for the proposed regex, i.e. Major version = 1, Minor version = 0, Patch version = 0, Sub-version (could be called build version) = 0.

@steven-legg
Copy link

This also applies to VersionMatchType. For matching numbers the standard just says "a direct numeric match". I looked at what my implementation is doing, which is to ignore leading zeros for each number. So 0, 00 and 000 are the same number and 8, 08 and 008 are the same number, but 8, 80 and 800 are different numbers.

The matching needs to be better specified, either by saying that leading zeroes are ignored (or significant) or by changing the patterns to outlaw leading zeroes. Outlawing leading zeroes means that the equality of two versions can be determined by a string match.

@humantypo
Copy link

I can easily update it to allow leading zeros if that’s the direction we want to go. I’m not a fan of it but trivial change on the regex side. Agree 100% on improving the specification.

@cdanger
Copy link
Author

cdanger commented Dec 17, 2024

This also applies to VersionMatchType. For matching numbers the standard just says "a direct numeric match". I looked at what my implementation is doing, which is to ignore leading zeros for each number. So 0, 00 and 000 are the same number and 8, 08 and 008 are the same number, but 8, 80 and 800 are different numbers.

The matching needs to be better specified, either by saying that leading zeroes are ignored (or significant) or by changing the patterns to outlaw leading zeroes. Outlawing leading zeroes means that the equality of two versions can be determined by a string match.

Yes, like in the VersionType pattern I propose, leading zeroes are outlawed, each part (separated by a dot) must be a valid integer representation in decimal.

But I propose we deal with VersionMatchType in a separate issue and reuse the resolution of this one.

@cdanger
Copy link
Author

cdanger commented Dec 17, 2024

\b(0|[1-9]{1,}).(0|0*[1-9]{1,})(.(0|0*[1-9]{1,}))?\b

Played with it and since \b is just a word boundary, x 0.0 (or xx 0.0, xxx 0.0) is considered valid per this regex :-( . So I'd suggest replacing \b with ^$, so that means
^(0|[1-9]{1,})\.(0|0*[1-9]{1,})(\.(0|0*[1-9]{1,}))?$
but then 10.1.1 is no longer valid.

So... any benefit compared to ^((0|[1-9]\d*)\.)*(0|[1-9]\d*)$ ?

@steven-legg
Copy link

So... any benefit compared to ^((0|[1-9]\d*)\.)*(0|[1-9]\d*)$ ?

No.

@steven-legg
Copy link

allows for:

1
2
0
9.800000

Seems okay to me. I don't see a problem with anyone choosing just a single number that is monotonically increasing as the version. We don't need to impose both a major version number and minor version number on policy versions.

80000 is exceedingly unlikely for a version number assigned by a human, but I don't discount the possibility of some highly-automated, policy-revising PAP that increments a minor version number to very large values. If we impose a limit then what do we choose?

We need a determination on leading zeroes because it affects the implementation of version matching and interoperability. Apart from that we don't have a technical reason to be more restrictive.

@cdanger
Copy link
Author

cdanger commented Dec 18, 2024

^((0|[1-9]\d*)\.)*(0|[1-9]\d*)$

allows for:

1
2
0
9.800000

That seems incorrect.

I agree with you in theory it's differing a bit from SemVer as it allows to omit the Minor and Patch versions. But like @steven-legg , I considered this would be a bit tedious for Policy writers to always specify a Minor and Patch versions for policies. Still, to make everybody happy, we could specify a simplified versioning scheme where version X is considered equivalent to X.0.0 and X.Y is equivalent to X.Y.0. So that means...
1 = 1.0.0 ,
2 = 2.0.0,
9.800000 = 9.800000.0

800000 is still a valid integer for a Minor version, however big it is, so not an issue for me. Fixing any maximum value for a version would be arbitrary.

@humantypo
Copy link

Assuming that

1 = 1.0 = 1.0.0

What is the suggested matching mechanism?

I assume thar SemVer took the « minor version » requirement approach to help facilitate equality testing.

@cdanger
Copy link
Author

cdanger commented Dec 18, 2024

Well, currently the matching mechanism in XACML 3.0 says (VersionMatchType):

A '*' means that any single number is valid. A '+' means that any number, and any subsequent numbers, are valid. In this manner, the following four patterns would all match the version string '1.2.3': '1.2.3', '1.*.3', '1.2.*' and ‘1.+'.

So I propose two alternatives:

  1. Either we just add a new pattern to the current mechanism: Pattern X+ (without the dot) will match X or any version starting with X.
  2. Or we say: If the version is X (without dots), apply the matching mechanism as if it was X.0

(We don't need to worry about the patch version - third part - actually here.)

@humantypo
Copy link

Or we say: If the version is X (without dots), apply the matching mechanism as if it was X.0

I think we are getting closer with this reference but should we add something that makes references to "0" versions in particular so that 1 = 1.0 = 1.0.0 = 1.0.0.0, etc.

Perhaps something like "Trailing 0" version notations are to be ignored for for purposes of comparison

1.0 -> 1
4.0.0 -> 4
2.0.0.0 -> 2

Maybe with an example of:

1 = 1
3 = 3.0
2 = 2.0.0.0.0
1 != 1.0.0.0.1

Oddly pedantic, I know, but 0s are a unique case and it seems like a good idea to call out the fact that raw string matching isn't likely going to work with them.

@steven-legg
Copy link

Considering that #16 started with the proposition to dispense with versions and version matching because they weren't used in practice we are probably over engineering this. Anyone depending on version matching is already dealing with 1 != 1.0 and anyone who isn't doesn't care (except the need for implementations to be conformant). Let's fix ambiguities and move on.

@humantypo
Copy link

How do you suggest we fix the ambiguities? Seems like we either ignore them or specify in detail.

@steven-legg
Copy link

steven-legg commented Dec 19, 2024

What to do about numbers with leading zeroes isn't spelled out. The matching procedure is otherwise deterministic. We can either:

  • change the patterns to exclude leading zeroes (01 is illegal),
  • say leading zeroes are significant (01 and 1 are different numbers),
  • say leading zeroes are not significant (01 and 1 are the same number).

The first option leads to the easiest matching algorithm to implement.

The current matching procedure clearly decides that 1 != 1.0 and there is no single match string that can match all and only 1, 1.0 and 1.0.0. That doesn't bother me.

@humantypo
Copy link

Sounds like we're moving to simple string match. If 1 != 1.0 and 1.0. != 1.0.0 then 01 != 1 seems equally logical. In that case is regex needed at all...? We don't appear to have any semantic requirements (2 > 1.0?) other than simple character equality.

Also, if we can envision the case of some automated process generating a version like 0.8000000, perhaps there could be a case where a UUID is used to force uniqueness with each iteration? Seems unlikely, but I've always assumed that the ultimate purpose of this was to determine antecedence :)

@steven-legg
Copy link

If 1 != 1.0 and 1.0. != 1.0.0 then 01 != 1 seems equally logical. In that case is regex needed at all...?

It isn't. That would be the second option: say leading zeroes are significant.

We don't appear to have any semantic requirements (2 > 1.0?) other than simple character equality.

The IdReferenceType requires that the versions can be ordered from earliest to latest but an ordering isn't explicitly spelled out anywhere, so it's effectively implementation defined. There isn't even a requirement that two policies with the same ID must have different versions, though IdReferenceType needs this to be so in order to choose "the most recent one". Neither of two policies with the same ID and version can be considered more recent.

We need to specify that policies with the same ID cannot have equal versions, whatever way version equality is defined.

@humantypo
Copy link

humantypo commented Dec 19, 2024

This all makes sense. The question I now have is if there is value defining antecedent versioning? The use case that comes to mind is situations where “version > #” (vs “version = string”) could be determined from the value. To do that would require us to be more specific in our definition, to not do it would require the implementer to maintain and ordered list (which would be cumbersome).

Of course this depends on the validity of the use case. Is it something we think has merit?

@steven-legg
Copy link

In that case is regex needed at all...?

On reflection you were probably asking if we should impose no pattern on a version string and just use lexicographic string matching. That would be okay for equality matching but would not work for conventional version ordering, e.g., 1.10 < 1.9.

@humantypo
Copy link

Yes. And to do that I believe that we need a structure that clearly defines sequence which, in turn, requires logical equality specification. For example, if “2” is allowed as a version number there should be a mechanism for determining that is greater than 1.0, 1.9 and 1.0.0.8000000000.

@steven-legg
Copy link

Of course this depends on the validity of the use case. Is it something we think has merit?

We decided with #16 that we didn't have a use case but historically there were people that did so we kept versions. We can try to improve the version matching but we are flying blind without a solid use case.

Maybe it's not even fit for purpose. The current IdReferenceType allows some oddities, like an EarliestVersion or LatestVersion of "*.3". The LatestVersion is inclusive but I suspect that making it exclusive would be more useful. EarliestVersion="1", LatestVersion="2" includes every version starting with 1 but also includes 2, but not 2.0.1. So 1.9 and 2 are okay but 2.0.1 isn't? Versions 1 and 1.9 are presumably more different than 2 and 2.0.1. If LatestVersion were exclusive the range would only contain versions with major version 1 and would support what Cyril was looking for with "1+".

@humantypo
Copy link

Based on this conversation my read is that we are looking for a flexible solution that allows for a wide array of (reasonable) operational values (1, 0.1, 1.2.3.4, 1.80000), while being able to deterministically compare values for equality and antedence.

Here is my proposal:

We establish a basic version pattern of Major.Minor.Patch.Build, where each period separated number is an integer and only Major is required in a Rule/Policy. (Regex should not be needed given the general acceptance of what an integer is 🙂 )

When a comparison is performed, they candidate values are "expanded" using 0s for each of the missing elements, where applicable. More levels of versioning are not supported. Using the versions above we have:

1 -> 1.0.0.0
0.1 -> 0.1.0.0
1.2.3.4 unchanged
1.80000 -> 1.80000.0.0
1.2.3.4.5 -> invalid

Version order is determined by first comparing the integer values in this order: Major, Minor, Patch, Build. Where [pedantically] the smaller number is antecedent. Using the examples above yields:

1.8000 > 1.2.3.4 > 1 > 0.1

I tried a number of different ideas and kept coming back to realization I was overthinking this. "Use" and "Comparison" being treated independently was the clearest and simplest idea I could come up with.

@steven-legg
Copy link

Regex should not be needed given the general acceptance of what an integer is

I wouldn't count on that. It doesn't hurt to specify a pattern to remove any doubt, especially since this is a subset of what XACML 3.0 allows.

@humantypo
Copy link

…where an integer is defined as /(0|[1-9]{1,}/

Done. 🙂

@steven-legg
Copy link

Done.

Not quite, unless you meant to exclude version 10. The full pattern would be something like: "(0|[1-9]\d*)(.(0|[1-9]\d*)){0,3}".

@humantypo
Copy link

Nice catch 🙂

My primary goal was to exclude “.” because it makes the complicated regex mess that we’ve been wading through. By calling out version levels being separated by a period in the text I’m hoping we can reduce the regex to the simplest syntax possible (which is enlightening me as to why the SemVer people use “^$”).

This is the cleanest I could come up with: ^(0|[1-9][0-9]*)$

This matches:

0
10
1
11
12345

This doesn’t match:

00
01
0-1
1-0
A
bc

@steven-legg
Copy link

A regex for a non-negative integer in the text is easier for an implementor to disregard than a regex for a version string in the XML Schema or JSON Schema.

@humantypo
Copy link

ergo my suggestion for using the term “integer” 🙂

@steven-legg
Copy link

Interestingly, the XML Schema integer type allows leading zeroes.

@humantypo
Copy link

…and XPATH doesn’t match “01” and “1” unless you cast using number() which, I believe, will convert 01.1 to 1.1 and 01.1.1 to NaN, correct?

Fortunately, if we go with the “parse, cast, test” model it’s straightforward to cast “01” as 1. Of course this also means 1.01 = 1.1 or is invalid.

My suggestion is that we provide lots of examples. 🤣

@cdanger
Copy link
Author

cdanger commented Dec 24, 2024

Here is my proposal:

We establish a basic version pattern of Major.Minor.Patch.Build, where each period separated number is an integer and only Major is required in a Rule/Policy. (Regex should not be needed given the general acceptance of what an integer is 🙂 )

In that case, for matching Major.Minor.Patch.Build (where each Major / Minor / Patch / Build is a non-negative integer), the regex would be:
^((0|[1-9]\d*)\.){0,3}(0|[1-9]\d*)$

(The dot must be escaped because it is a special character in regex that otherwise represents any character, and \d is just an equivalent for [0-9], but more compact. The regex for a non-negative integer without useless leading zeros is (0|[1-9]\d*) ).

@humantypo
Copy link

^((0|[1-9]\d*)\.){0,3}(0|[1-9]\d*)$

Works for me. I'm OK with "0[0-9]" being invalid.

@steven-legg
Copy link

^((0|[1-9]\d*)\.){0,3}(0|[1-9]\d*)$

We don't need the anchors for an XML Schema pattern.

(0|[1-9]\d*)(\.(0|[1-9]\d*)){0,3} is probably more efficient to parse than
((0|[1-9]\d*)\.){0,3}(0|[1-9]\d*) since it doesn't require backtracking.

@cdanger
Copy link
Author

cdanger commented Dec 30, 2024

^((0|[1-9]\d*)\.){0,3}(0|[1-9]\d*)$

We don't need the anchors for an XML Schema pattern.

Yes, already mentioned that in a previous comment, but it was necessary for @humantypo to test with online regex checkers like pythex. Indeed, the ^$ are to be removed in the XSD.

(0|[1-9]\d*)(\.(0|[1-9]\d*)){0,3} is probably more efficient to parse than ((0|[1-9]\d*)\.){0,3}(0|[1-9]\d*) since it doesn't require backtracking.

All right. Fine with me.

@cdanger cdanger added the bug Something isn't working label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants