Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match beginning and end of line correctly #3575

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

matthias314
Copy link
Contributor

@matthias314 matthias314 commented Dec 14, 2024

When searching with buf:FindNext, the start of the search region currently always matches ^ and the end of the region $, no matter if they are the indeed the start or end of a line. Similar problems exist with other empty-string pattern like \b. The goal of this PR is to fix this. This is done by modifying the search pattern for the first and last lines (prepending and/or appending ..

Code common to findDown and findUp has been moved to separate functions. One could tweak the specific way it's done, but maybe it's better to first get some feedback if you are willing to go in this direction at all.

@matthias314 matthias314 force-pushed the m3/findnext-start-end branch from 6a8e479 to c0ba33e Compare December 19, 2024 14:15
@dmaluka
Copy link
Collaborator

dmaluka commented Jan 15, 2025

This "padded regex" approach feels hacky. Maybe it is reliable (I haven't analyzed it carefully yet), but how about a straightforward approach instead: search not just within the given range (start..end) but within the extended range from the beginning of the first line to the end of the last line, and filter out those matches that are outside start..end?

@matthias314
Copy link
Contributor Author

That doesn't work because FindAll and friends return only non-overlapping matches, see here. For example, for s := "abcd" and re := regexp.MustCompile(".."), the call re.FindAllStringIndex(s, -1) returns [[0 2] [2 4]]. Hence the method you are suggesting would claim that there is no match in s[1:3].

@dmaluka
Copy link
Collaborator

dmaluka commented Jan 15, 2025

Ok, fair point.

internal/buffer/search.go Outdated Show resolved Hide resolved
internal/buffer/search.go Outdated Show resolved Hide resolved
internal/buffer/search.go Outdated Show resolved Hide resolved
internal/buffer/search.go Outdated Show resolved Hide resolved
internal/buffer/search.go Outdated Show resolved Hide resolved
internal/buffer/search.go Outdated Show resolved Hide resolved
@matthias314 matthias314 force-pushed the m3/findnext-start-end branch from 926e411 to 2e3bc77 Compare January 16, 2025 02:06
@matthias314
Copy link
Contributor Author

I've force-pushed a new version:

  • the first commit is the old PR,
  • the second commit should address all your comments,
  • the third commit is a variant where I only use the two bitmasks padStart and padEnd instead of an enum with four elements. You may or may nor find this more elegant.

I've also rebased the PR so that one can test it together with the changes to the search functionality done in previous PRs.

internal/buffer/search.go Outdated Show resolved Hide resolved
@dmaluka
Copy link
Collaborator

dmaluka commented Jan 16, 2025

Thanks, LGTM (apart from my above comment about comments, and the fact that it's better to squash such fix-up changes into one commit, but we can squash them later).

internal/buffer/search.go Outdated Show resolved Hide resolved
internal/buffer/search.go Outdated Show resolved Hide resolved
@dmaluka
Copy link
Collaborator

dmaluka commented Jan 16, 2025

Side note: looks like we can deduplicate the code in findUp()/findDown() even more. Looks like both functions are identical except that one uses FindIndex() while the other one uses FindAllIndex() and takes the last match. So we can merge them into one function, e.g. find() or findNext(), with down argument.

@matthias314
Copy link
Contributor Author

matthias314 commented Jan 16, 2025

I've combined findUp and findDown to findNext, with an additional argument down. However, I wonder if down is necessary at all. Could we replace it by start.LessEqual(end)? If we keep down, why don't we change the calls of findNext so that start is always less than or equal to end?

Moreover, the code at the beginning of findNext (copied from the old functions) is not fully clear to me. Why may start.X and end.X be modified (to different values!) before we test if start is greater than end?

One might think of making findNext even more general so that it can be used for ReplaceRegex, too (and also for the function FindNextSubmatch that I proposed in #3552).

EDIT: We might also want the first argument to findNext to be the full set of four regexps we're using.

internal/buffer/search.go Outdated Show resolved Hide resolved
@dmaluka
Copy link
Collaborator

dmaluka commented Jan 17, 2025

Moreover, the code at the beginning of findNext (copied from the old functions) is not fully clear to me. Why may start.X and end.X be modified (to different values!) before we test if start is greater than end?

When start or end is beyond the last line of the buffer, i.e. its Y is greater than b.LinesNum()-1, we clamp it to be exactly at the end of the buffer, so that we don't try to search in non-existent lines. (Even though LineBytes() returns an empty string for non-existent lines so in most cases search would probably work correctly even if we didn't clamp, but e.g. if the search pattern is just ^ or $, we would get unwanted matches in those non-existent lines.) So we set Y to the last line, and that's not enough, we also need to set X to the end of this last line. For example, if the buffer has 100 lines and the last line has 50 characters, and end passed to findDown() is 0, 200, we should change it to 50, 99, not to 0, 99 (so that we search up to the very end of the buffer, including the last line).

@matthias314 matthias314 force-pushed the m3/findnext-start-end branch from 8bd0cac to 3e16d5b Compare January 17, 2025 12:48
@dmaluka
Copy link
Collaborator

dmaluka commented Jan 17, 2025

LGTM, but I've just noticed that it still doesn't fix the issue in the case of replace -a.

So I guess we also need a similar fix in ReplaceRegex().

(Unless we treat this as the intended behavior, but then it's inconsistent with replace without -a...)

@matthias314
Copy link
Contributor Author

matthias314 commented Jan 17, 2025

I was thinking of modifying ReplaceRegex in a separate PR. But of course we could also do it here. Whatever you prefer.

@dmaluka dmaluka requested a review from JoeKar January 19, 2025 11:50
@dmaluka
Copy link
Collaborator

dmaluka commented Jan 19, 2025

@JoeKar what do you think about this PR? Maybe any ideas of an alternative approach?

I was thinking of modifying ReplaceRegex in a separate PR. But of course we could also do it here. Whatever you prefer.

I'm fine either way. Although it seems reasonable to address it in this PR, since it's rather a part of the same issue, and the way to fix it is probably just to adopt the same approach as in FindNext().

@JoeKar
Copy link
Collaborator

JoeKar commented Jan 19, 2025

@JoeKar what do you think about this PR? Maybe any ideas of an alternative approach?

This "padded regex" approach feels hacky.

Indeed it is hacky, but it seems to fit the needs via the help of regular expressions on its own. Otherwise we've to solve it with additional code dropping (parts of) results.
The only thing which bothers me right now is, that we (Must)Compile resp. parse the regular expressions twice, while we should be able to prevent this and compile it only once...with the same padding approach (e.g. padRegexp(s string)).

@dmaluka
Copy link
Collaborator

dmaluka commented Jan 19, 2025

The only thing which bothers me right now is, that we (Must)Compile resp. parse the regular expressions twice, while we should be able to prevent this and compile it only once...with the same padding approach (e.g. padRegexp(s string)).

Well, yes, we could compile the padded regexps in FindNext() and pass all 4 regexps to findUp() and findDown()... I'm not sure that's a big deal though. That is not the only waste we have here. For instance, we always compile all 4 regexps, while we never use them all. For example, rPadBoth is never used if the search range consists of more than one line (which we could easily check, and avoid compiling rPadBoth in this case, which would complicate the code, with doubtful usefulness).

Not even mentioning the fact that we compile the same regex every time when calling FindNext() for the same pattern multiple times.

@dmaluka
Copy link
Collaborator

dmaluka commented Jan 19, 2025

Hmm... Throughout the entire loop, fineLineParams() returns padStart at most once (for the first line), and padEnd at most once (for the last line), right? So we could both simplify and optimize the code by removing padRegexp() and instead letting fineLineParams() itself compile and return the padded regexp - the one exact padded regexp that is needed, exactly in the case when it is needed?

That should also address your concern about compiling twice, since even when we call findUp() or findDown() twice, we do that because we wrap, so we still have padStart or padEnd at most once throughout FindNext(), thus findLineParams() will still compile the corresponding padded regex at most once.

@matthias314
Copy link
Contributor Author

Several questions and comments about the extension to ReplaceRegex():

  • ReplaceRegex() uses re.ReplaceAllFunc() instead of re.ReplaceAll(). As far as I can tell, the reason is to get the correct value for netrunes (which counts characters, not runes). This is only used in ReplaceCmd() for interactive replacements. There one could probably get the right count separately, which would simplify ReplaceRegex(). Would that be a good idea? Or rather not because it would change the signature of the function and therefore be breaking?
  • One new difficulty is that one cannot use re.ReplaceAllFunc() inside ReplaceRegex() if the regexp is padded. (The reason is again that one would miss matches where the padding leads to an overlap.) Instead, one would have to call FindNext() (or findDown()) to find all matches in such a line and then do the replacement separately for each match. The whole thing would be easier if we had a function like FindAll. I think such a function would be of general use. Would you be interested in that?
  • Actually, it would be even more helpful to have something like FindNextSubmatch: return submatches when searching #3552 or FindAllSubmatch. Otherwise one would have to call re.ReplaceAll() instead of re.Expand() for each match on the first/last line because the submatch information is not available. The same currently applies to ReplaceCmd() (where ReplaceRegex() is called instead of re.Expand()). Again, I think a submatch functionality would be useful in general. Interested?

@JoeKar
Copy link
Collaborator

JoeKar commented Jan 19, 2025

So we could both simplify and optimize the code by removing padRegexp() and instead letting fineLineParams() itself compile and return the padded regexp - the one exact padded regexp that is needed, exactly in the case when it is needed?

Sounds good. 👍

@matthias314
Copy link
Contributor Author

As a first step, I've implemented @dmaluka's suggestion to remove padRegexp.

@dmaluka
Copy link
Collaborator

dmaluka commented Jan 21, 2025

ReplaceRegex() uses re.ReplaceAllFunc() instead of re.ReplaceAll(). As far as I can tell, the reason is to get the correct value for netrunes (which counts characters, not runes). This is only used in ReplaceCmd() for interactive replacements. There one could probably get the right count separately, which would simplify ReplaceRegex(). Would that be a good idea?

Maybe... Hard to say without seeing an actual implementation.

IIUC it is not directly related to the issue addressed by this PR, it would be just an extra improvement?

Or rather not because it would change the signature of the function and therefore be breaking?

I wouldn't worry too much about that. We already changed ReplaceRegex() interface once, in #2954. No one seemed to complain. We are not really strict about preserving compatibility of interfaces that are primarily for micro's internal use and just happen to be exposed to plugins (by virtue of being public methods/fields of objects that are returned to plugins by some other functions). Dogmatically preserving them all would be counter-productive. (Although it's IMO quite unfortunate that we have such vagueness of the APIs for plugins and their stability guarantees and so on, but that's how the plugin system is designed, unfortunately.)

One new difficulty is that one cannot use re.ReplaceAllFunc() inside ReplaceRegex() if the regexp is padded. (The reason is again that one would miss matches where the padding leads to an overlap.) Instead, one would have to call FindNext() (or findDown()) to find all matches in such a line and then do the replacement separately for each match.

Yep, and that seems inevitable. And I imagine we wouldn't need to do that per line, we can just call FindNext() in a loop for the entire range, similarly to what ReplaceCmd() does for interactive replace?

The whole thing would be easier if we had a function like FindAll. I think such a function would be of general use. Would you be interested in that?

I'm not sure. A file may be large, the number of matches may be huge, do we really want to store and return them all and then process them, if we can instead just call FindNext() in a loop and thus process them one by one on the fly?

Actually, it would be even more helpful to have something like #3552 or FindAllSubmatch. Otherwise one would have to call re.ReplaceAll() instead of re.Expand() for each match on the first/last line because the submatch information is not available. The same currently applies to ReplaceCmd() (where ReplaceRegex() is called instead of re.Expand()). Again, I think a submatch functionality would be useful in general. Interested?

Maybe. Again, hard to discuss without an actual implementation.

@dmaluka
Copy link
Collaborator

dmaluka commented Jan 21, 2025

As a first step, I've implemented @dmaluka's suggestion to remove padRegexp.

LGTM.

@matthias314
Copy link
Contributor Author

matthias314 commented Jan 21, 2025

Here is a modification of ReplaceRegex. I didn't bother too much about optimizations. For example, one could treat the first and last lines separately only if they are not complete lines. Also, the padded regexps may be compiled several times for these lines, and findAll could be implemented with a callback function, analogously to ReplaceAllFunc. The code assumes that the replacement text has no newlines; I think this assumption was implicitly made before already.

EDIT: Calls of the form util.CharacterCount(b.LineBytes(i)) occur several times in ReplaceRegex and even more so elsewhere in the code. What about a dedicated CharacterCount function for LineArray?

@matthias314 matthias314 force-pushed the m3/findnext-start-end branch from e91bc2e to 7943952 Compare January 22, 2025 23:15
@matthias314
Copy link
Contributor Author

I've force-pushed a slightly modified version. It works the same, but the code is cleaner now.

@dmaluka
Copy link
Collaborator

dmaluka commented Jan 25, 2025

I've force-pushed a slightly modified version. It works the same, but the code is cleaner now.

So please squash the last 2 commits into one (to let reviewers review the relevant version, not waste time on the outdated one).

@matthias314 matthias314 force-pushed the m3/findnext-start-end branch from 7943952 to 3822986 Compare January 25, 2025 20:42

from := Loc{0, i}.Clamp(start, end)
to := Loc{charCount, i}.Clamp(start, end)
matches := b.findAll(search, from, to)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, what is the advantage of using findAll() to remember all matches in some temporary storage to handle them afterwards, instead of just using findDown() and handling each match right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could do that. I wrote the code with the idea in mind to have some public FindAll function at a later point. That code would be based on the current findAll. One could change findAll to some findAllFunc that takes a callback function as argument (simialr to ReplaceAllFunc). This way the problem of creating temporary lists would disappear. (That would also be a good idea for a public function.) Currently findAll is only called for the first and last lines, so that performance doesn't really matter. That would change of course if we used this function for all lines.

result = search.Expand(result, replace, in, submatches)
l := b.LineBytes(i)
charCount = util.CharacterCount(l)
if i == start.Y || i == end.Y {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do this findDown() based logic (instead of the old search.ReplaceAllFunc() based logic) for all lines, not just for the first and last line?

And considering my other comment, I imagine we don't even need to deal with individual lines here, we can simply call findDown() in a loop for the entire range? Let findDown() itself care about individual lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that. We just have to make sure that we know how the length of the last line changes.

Regarding performance: Currently we create one delta for each "middle" line, even if there is no replacement in that line. If we use the other code, we would have one delta for each replacement. I think that in most cases that would be more efficient. However, if there are many changes in each line, then this would become very inefficient. Also note that without remembering the padded regexps, we would compile the front-padded regexps once for each replacement, so we should probably store it somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants