Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add opt-in strict string format validation #451

Merged
merged 11 commits into from
Dec 5, 2024

Conversation

zzstoatzz
Copy link
Contributor

@zzstoatzz zzstoatzz commented Nov 22, 2024

adds pydantic validation per #406 (comment). see the usage example

format example pattern/constraints
did did:plc:z72i7hdynmk6r22z27h6tvur did:[a-z]+:[a-zA-Z0-9._%-]+ (max 2KB, no /?#[]@)
handle alice.bsky.social Domain name: 2+ segments, ASCII alphanumeric/hyphens, 1-63 chars per segment, max 253 total. Last segment no leading digit.
at-uri at://alice.bsky.social/app.bsky.feed.post/3jxtb5w2hkt2m at:// + handle/DID + optional /collection/rkey. Max 8KB. No query/fragment in Lexicon.
datetime 2024-11-24T06:02:00Z ISO 8601/RFC 3339 with required 'T', seconds, timezone (Z/±HH:MM). No -00:00.
nsid app.bsky.feed.post 3+ segments: reversed domain (lowercase alphanum+hyphen) + name (letters only). Max 317 chars.
tid 3jxtb5w2hkt2m 13 chars of [2-7a-z]. First byte's high bit (0x40) must be 0.
record-key 3jxtb5w2hkt2m 1-512 chars of [A-Za-z0-9._:~-]. "." and ".." forbidden.
uri https://example.com/path RFC-3986 URI with letter scheme + netloc/path/query/fragment. Max 8KB, no spaces.
did:plc did:plc:z72i7hdynmk6r22z27h6tvur Method identifier must be 24 chars of base32 ([a-z2-7]). Cannot include underscore.

does not address open questions regarding the language types or re2. happy to chat on those if they'd be blocking here

may close #406

notes

I ran this to generate models

uvx poetry run atproto gen models

and this to update docs

uvx poetry run make -s -C docs gen

and this to run the new tests

uvx poetry run pytest tests/test_atproto_client/models/tests/test_string_formats.py

if desirable, I can follow up this PR to add contribution instructions to the current README section (for uv at least, and perhaps a TODO for proper poetry usage, since I don't really use that) - let me know!


in/valid examples for tests are sourced from https://github.com/bluesky-social/atproto/tree/main/interop-test-files/syntax

@zzstoatzz zzstoatzz force-pushed the strict-string-format branch from 47c3aab to 571edfa Compare November 23, 2024 04:41
pyproject.toml Outdated Show resolved Hide resolved
@zzstoatzz zzstoatzz changed the title [wip] add opt-in strict string format add opt-in strict string format Nov 23, 2024
@zzstoatzz zzstoatzz marked this pull request as ready for review November 23, 2024 04:49
Copy link
Owner

@MarshalX MarshalX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome hard work! thank you! speaking of your gotten knowledges about how to contribute we need to write CONTRIBUTING.md someday. but this will be separated PR for sure

packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@zzstoatzz zzstoatzz force-pushed the strict-string-format branch from 6ad722a to 6de9cac Compare November 23, 2024 16:49
@zzstoatzz
Copy link
Contributor Author

thanks for the review @MarshalX! comments should be addressed in 6ad722a

@zzstoatzz zzstoatzz force-pushed the strict-string-format branch 3 times, most recently from 7bcd93e to eb1d28b Compare November 24, 2024 22:49
@zzstoatzz zzstoatzz requested a review from MarshalX November 25, 2024 00:43
@zzstoatzz zzstoatzz force-pushed the strict-string-format branch 3 times, most recently from 8d7deee to 1c82c73 Compare November 25, 2024 13:18
@zzstoatzz zzstoatzz force-pushed the strict-string-format branch 2 times, most recently from 34444be to 9807a81 Compare November 29, 2024 23:12
@zzstoatzz
Copy link
Contributor Author

zzstoatzz commented Nov 30, 2024

hi @MarshalX - ok all of the cases are passing with no skipping!

I found that the spec and test cases in the interop valid/invalid .txt examples were not entirely covered by the types in atproto_core, so for now I have tended towards duplicating regex in string_formats.py and documenting with comments why I included the logic I did.

For example, looking at the AT-URI validation, I noticed what appear to be few differences between the regex I added and atproto_core's AtUri that seem important for passing the interop tests:

  1. The spec requires "at://" prefix (see tests like "a://did:plc:asdf123" and "at:/did:plc:asdf123" being invalid)
  2. The path must be a valid NSID format (invalid like "at://did:plc:asdf123/12345" and "at://did:plc:asdf123/short/stuff" show this)
  3. The handle authority section has some idiosyncrasies - tests fail cases like "at://name", "at://name.0", "at://bsky" that a general URI parser might accept

let me know if you think i'm missing something here or if you think I should move some of these requirements into the types in atproto_core instead

also, just in case it comes up: i had a tricky problem with loading the test cases where I was calling .strip() on each line of the .txt files, but at least one of the test cases was checking for whitespace 🙃

Copy link
Owner

@MarshalX MarshalX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such an amazing hard work you did! let's move that discussion about different of the AtUri to different issue; for now we are okay to have 2 validation logic.

i have nothing to add except comments that i left and question about documentation. since we support .format of the LexString do we want to show it in the documentation? Right now fields are not show their format at the atproto.blue. We could create separate issue for it to finally merge your great PR

The link to built docs from this PR is https://atproto--451.org.readthedocs.build/en/451/

packages/atproto_client/models/utils.py Outdated Show resolved Hide resolved
packages/atproto_client/models/string_formats.py Outdated Show resolved Hide resolved
packages/atproto_client/models/string_formats.py Outdated Show resolved Hide resolved
@zzstoatzz zzstoatzz force-pushed the strict-string-format branch from abda096 to a1f3e29 Compare December 2, 2024 04:10
@zzstoatzz
Copy link
Contributor Author

let's move that discussion about different of the AtUri to different issue

We could create separate issue for it to finally merge your great PR

will do! thanks for all the help thus far 🙂

@zzstoatzz zzstoatzz requested a review from MarshalX December 2, 2024 15:05
@MarshalX
Copy link
Owner

MarshalX commented Dec 3, 2024

i did the last check before merging and found something around TypedDict. could you pls check? probably we should not use new strict string formats for TypedDict?

image

@zzstoatzz
Copy link
Contributor Author

i did the last check before merging and found something around TypedDict. could you pls check? probably we should not use new strict string formats for TypedDict?

I will take a look as soon as I can!

@zzstoatzz zzstoatzz force-pushed the strict-string-format branch from bad5512 to 1ea9021 Compare December 5, 2024 00:47
@zzstoatzz
Copy link
Contributor Author

hi @MarshalX - hrmm can you share more on how you ended up with that? I may be missing something silly but can't directly reproduce that (with an example like this)

example
from typing import TypedDict

from atproto_client.models import base, string_formats
from pydantic import TypeAdapter, ValidationError


class Params(base.ParamsModelBase):
    actor: string_formats.Handle


class ParamsDict(TypedDict):
    actor: string_formats.Handle


# Test data
some_good_handle = 'test.bsky.social'
some_bad_handle = 'invalid@ @handle'

# Test both with and without strict validation
strict_context = {'strict_string_format': True}
non_strict_context = {'strict_string_format': False}

# Initialize adapters with explicit types
HandleTypeAdapter: TypeAdapter[str] = TypeAdapter(string_formats.Handle)
ParamsTypeAdapter: TypeAdapter[ParamsDict] = TypeAdapter(ParamsDict)
ParamsModelAdapter: TypeAdapter[Params] = TypeAdapter(Params)


def run_validation_test(context: dict[str, bool]) -> None:
    mode = 'strict' if context['strict_string_format'] else 'non-strict'
    print(f'\n=== Testing {mode} mode ===')

    print('\n1. Direct Handle validation:')
    try:
        handle_result: str = HandleTypeAdapter.validate_python(some_bad_handle, context=context)
        print(f'Validation passed: {handle_result=}')
    except ValidationError as e:
        print(f'Validation failed: {e!s}')

    print('\n2. TypedDict validation:')
    try:
        params_result: ParamsDict = ParamsTypeAdapter.validate_python({'actor': some_bad_handle}, context=context)
        print(f'Validation passed: {params_result=}')
    except ValidationError as e:
        print(f'Validation failed: {e!s}')

    print('\n3. Pydantic Model validation:')
    try:
        model_result: Params = ParamsModelAdapter.validate_python({'actor': some_bad_handle}, context=context)
        print(f'Validation passed: {model_result=}')
    except ValidationError as e:
        print(f'Validation failed: {e!s}')


run_validation_test(non_strict_context)
run_validation_test(strict_context)
stdout
» python examples/advanced_usage/validate_string_formats.py

=== Testing non-strict mode ===

1. Direct Handle validation:
Validation passed: handle_result='invalid@ @handle'

2. TypedDict validation:
Validation passed: params_result={'actor': 'invalid@ @handle'}

3. Pydantic Model validation:
Validation passed: model_result=Params(actor='invalid@ @handle')

=== Testing strict mode ===

1. Direct Handle validation:
Validation failed: 1 validation error for function-before[wrapper(), str]
  Value error, Invalid handle: must be a domain name (e.g. user.bsky.social) with max length 253 [type=value_error, input_value='invalid@ @handle', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

2. TypedDict validation:
Validation failed: 1 validation error for typed-dict
actor
  Value error, Invalid handle: must be a domain name (e.g. user.bsky.social) with max length 253 [type=value_error, input_value='invalid@ @handle', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

3. Pydantic Model validation:
Validation failed: 1 validation error for Params
actor
  Value error, Invalid handle: must be a domain name (e.g. user.bsky.social) with max length 253 [type=value_error, input_value='invalid@ @handle', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

@MarshalX
Copy link
Owner

MarshalX commented Dec 5, 2024

@zzstoatzz i mean does it rly need to be complex type with pydantic validation for TypedDict? Speaking of the highlighted problem on the screenshot, probably PyCharm (Intellij + Python plugin in my case) is just dumb xD

I do not say that it won't work. It will. But IDE`s analyzers struggle with it.

We use TypedDict only to help the users with field suggestions in IDE. And for static type analyzers (for someone who cares). When the user uses dict-based params instead of proper model. But under the hood, we will use proper model. So this is just a syntax sugar for the users who want to write it fast in unsafe dict-driven Python manure

tl;dr no problems with it; we just do not use TypedDict directly (only for type annotating); my IDE could not handle it

@MarshalX MarshalX changed the title add opt-in strict string format Add opt-in strict string format validation Dec 5, 2024
@MarshalX MarshalX merged commit f062df4 into MarshalX:main Dec 5, 2024
22 of 23 checks passed
@zzstoatzz
Copy link
Contributor Author

zzstoatzz commented Dec 5, 2024

oh ok! I was using pylance strict but I probably missed something and I'm happy to take a stab at any further type-smoothing that would be helpful!

thanks again for all your help 🙂

@zzstoatzz zzstoatzz deleted the strict-string-format branch December 5, 2024 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement pydantic validators for string format checks
2 participants