Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust reading of filenames #20

Merged
merged 3 commits into from
Oct 7, 2024

Conversation

Leonidas-from-XIV
Copy link
Contributor

This adds a lexer based on ocamllex to read the filenames of the patches.

Fixes #17.

@hannesm
Copy link
Owner

hannesm commented Sep 27, 2024

Thanks a lot, I'm curious about error conditions:

  • --- a/foo "bar" baz
  • --- "a/foo" bar baz
  • --- /dev/null10

Should these error out? Or is such a filename legal? I'd be happy to have them as part of the test suite (and I'm fine to consider them illegal and error out).

I'm asking since you clearly have more of the diff file format in your head than me. Thanks again.

@Leonidas-from-XIV
Copy link
Contributor Author

Thanks for the edge cases. Coming up with these is sometimes tricky :-)

  • --- a/foo "bar" baz should be legal, I just ran the three tools and added their output to the tests. Looks like I have some patching to do.
  • --- "a/foo" bar baz I think that would imply that a file has a / in its name which is not possible on most filesystems so I'd say this is just strict wrong and should be rejected.
  • --- /dev/null10 is a legal filename, but the parser will probably return DevNull by error here. I'll fix it.

@hannesm
Copy link
Owner

hannesm commented Sep 27, 2024

Thanks for your additional commits, what I had in mind is something with a starting quote:

diff --git a/test/test.ml b/test/test.ml
index 94f9e83..cb726a1 100644
--- a/test/test.ml
+++ b/test/test.ml
@@ -880,6 +880,17 @@ let busybox_diff_quotes = {|\
 let busybox_diff_quotes =
   operations [Patch.Edit ({|foo "bar" baz|}, {|foo bar "baz"|})] busybox_diff_quotes
 
+let busybox_diff_quotes2 = {|\
+--- "foo bar" baz
++++ "foo" bar "baz"
+@@ -1 +1 @@
+-This is wrong.
++This is right.
+|}
+
+let busybox_diff_quotes2 =
+  operations [Patch.Edit ({|"foo bar" baz|}, {|"foo" bar "baz"|})] busybox_diff_quotes2
+
 let filename_diffs =
   [
     "unified diff with spaces", `Quick, unified_diff_spaces;
@@ -888,6 +899,7 @@ let filename_diffs =
     "unified diff with quotes", `Quick, unified_diff_quotes;
     "git diff with quotes", `Quick, git_diff_quotes;
     "busybox diff with quotes", `Quick, busybox_diff_quotes;
+    "busybox diff with quotes 2", `Quick, busybox_diff_quotes2;
   ]
 
 let tests = [

And that fails at the moment with:

│ [FAIL]        filename                  3   unified diff with quotes.                                 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘
ASSERT File "test/test.ml", line 811, characters 34-41
FAIL File "test/test.ml", line 811, characters 34-41

   Expected: `[--- a/foo "bar" baz
+++ b/foo bar "baz"
]'
   Received: `[--- a/foo \
+++ b/foo bar \
]'

EDIT: I'm happy to fail in this case, but the parser dropping parts of the data and applying the wrong diff is not super-nice.

@Leonidas-from-XIV
Copy link
Contributor Author

True, there's an additional error case when the file starts with quotes. I've updated the tests so the destination filename starts with a quote.

I have a feeling that it might not be possible to read the filenames with spaces and quotes with just the lexer reading a string, given busybox and git diff do the opposite things when they encounter quotes and spaces (git diff quotes spaces and escapes quotes, busybox always uses literal quotes). Might need to properly tokenize and postprocess the outcome.

@hannesm
Copy link
Owner

hannesm commented Sep 27, 2024

Dear @Leonidas-from-XIV,

True, there's an additional error case when the file starts with quotes. I've updated the tests so the destination filename starts with a quote.

Thanks. I added some comments on what I'd expect the outcome to be.

I have a feeling that it might not be possible to read the filenames with spaces and quotes with just the lexer reading a string, given busybox and git diff do the opposite things when they encounter quotes and spaces (git diff quotes spaces and escapes quotes, busybox always uses literal quotes). Might need to properly tokenize and postprocess the outcome.

I have no file with a " on my system. I'd as well be fine to error out, and basically have the lexer check whether the entire string is consumed -- and if not, error.

If you strive for supporting both git diff and busybox, please also feel free to do so :)

My intuition is only to "better fail than use incomplete data" -- i.e. in the "foo " bar case. I'm happy to have a not 100% compatible implementation, and the first issue report where someone has a quote in their filename could serve as motivation to finish such an implementation.

@Leonidas-from-XIV
Copy link
Contributor Author

Leonidas-from-XIV commented Sep 27, 2024 via email

@Leonidas-from-XIV
Copy link
Contributor Author

Leonidas-from-XIV commented Sep 27, 2024

Making it pass diff -u, git diff and busybox diff output was a bit tricky.

There is still edge case: if the file is called "foo", then git diff will generate "a/\"foo\"" but busybox diff will generate "foo" and the unquoting logic does not know whether it needs to remove the outer quotes or not.

@kit-ty-kate kit-ty-kate self-requested a review September 27, 2024 23:12
@kit-ty-kate
Copy link
Collaborator

Thanks a lot for this work! I have several ideas on how to improve this PR, i'll take over this work if that's ok with you, and i hope to be done on Monday or Tuesday night if all goes well

@kit-ty-kate kit-ty-kate marked this pull request as draft September 28, 2024 01:17
@kit-ty-kate kit-ty-kate self-assigned this Sep 28, 2024
@kit-ty-kate kit-ty-kate marked this pull request as ready for review October 3, 2024 15:02
@kit-ty-kate
Copy link
Collaborator

I think this is ready to review. Don't mind the commit history, i'll squash and clean things before merging.

I strongly believe this supports everything GNU diff can throw at it and every other diff. Some OS have diffs that can output files that cannot be parsed by their own patch utility (e.g. if a filename contains a tab or a newline on any diff utility other than GNU or git it's pretty much game over) and for that we can't do anything. There is still the ambiguity between "some quoted filename" between GNU/git and the others but for that there isn't anything we can do either apart from having proper modes which i think would probably be a waste of time.

Once this PR is merged i'm hoping to:

What do you think @hannesm?

@hannesm
Copy link
Owner

hannesm commented Oct 3, 2024

Dear Kate (and Marek),

I'm very happy of the work you're doing on this package. Incredibly happy. I briefly looked at the API change of this PR, and am mostly fine with it. The minor thing I find slightly annoying is the "exception Error" - I'd appreciate something that is more self-explanatory (i.e. 'exception Filename_parse_error of string' -- to include the "offending" data -- so if someone hits it, we easily get that information.

* continue [Add a ~p parameter to Patch.to_diffs mimicking the behaviour of patch -p<num> #9](https://github.com/hannesm/patch/pull/9)

Awesome!

* fix the printer when confronted with unprintable characters

Nice!

* open a new PR to use the new `exception Error` in a lot more places because the current error modes are a bit all over the place (asserts, failure, invalid_args, …) and should probably be cleaned up before use in opam

I agree that deciding on a single error mechanism is good, and assert false is not a good choice.

* test the new code in [Apply patches when dealing with repositories via an OCaml implementation of patch ocaml/opam#5892](https://github.com/ocaml/opam/pull/5892) and for every opam-repository packages with patches

\o/

* fix things if needed

yes, that'd be fantastic :D

* do a release

👯 👍🏾

What do you think @hannesm?

I'm happy. I've limited time for reviewing at the moment. If you're stuck / undecided on some decision, feel free to reach out to me.

Otherwise, feel free to push stuff forward - I think you have full permissions to this repository. This PR is great, adds tests and fixes issues that exist for real.

@kit-ty-kate
Copy link
Collaborator

The minor thing I find slightly annoying is the "exception Error" - I'd appreciate something that is more self-explanatory (i.e. 'exception Filename_parse_error of string' -- to include the "offending" data -- so if someone hits it, we easily get that information.

how does the last commit look?

@kit-ty-kate
Copy link
Collaborator

I have a version cleanly rebased at https://github.com/kit-ty-kate/patch/commits/filename-spaces/
@Leonidas-from-XIV does that look good for you?

@hannesm
Copy link
Owner

hannesm commented Oct 4, 2024

how does the last commit look?

looks good, thanks!

Copy link
Contributor Author

@Leonidas-from-XIV Leonidas-from-XIV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finishing and refactoring the PR. I have some small suggestions but it looks good overall and I think this gets as close to parsing a filename as we can get without getting extra info about the format to parse.

@@ -2,8 +2,9 @@
(name patch)
(synopsis "Patch purely in OCaml")
(public_name patch)
(modules patch)
(wrapped false))
(modules patch patch_lib fname))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be dropping this and just moving the executable into a different folder, thus new modules would be picked up automatically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be done later

(** [parse s] parses [s] and returns a filename or [None] if the filename
is equivalent to [/dev/null].

Returns [Error msg] in case of error. *)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is a clear API, when I get Ok None I am not sure I would think of /dev/null.

I know I started with this idea originally, but I think returning Ok "/dev/null" is better and then other tools on top can map "/dev/null" to whatever logic they find reasonable.

Copy link
Collaborator

@kit-ty-kate kit-ty-kate Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is not accessible outside of the patch library so i think we should rather make the internal API consistent with the rest of the code which expects an option (until someone proposes something better, although i'm personally fine with the current state of things)

@kit-ty-kate
Copy link
Collaborator

Thanks!

@kit-ty-kate kit-ty-kate merged commit 81d31bd into hannesm:main Oct 7, 2024
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Robust filename parser
3 participants