Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc sort improvements #1706

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

q3cpma
Copy link

@q3cpma q3cpma commented May 7, 2024

Hello, first contribution and first time touching Go here, so don't be nice and roast me well if I did some dumb things.

I was looking at the sort function because I want to implement some features and noticed some less than optimal code, be it in performance or "looking nice to me".
I didn't add any dependency (though text isn't indirect anymore), but https://github.com/jeandeaual/go-locale might be a good idea to use the system locale instead of hardcoding English. The other improvement that might be a good idea is to use slices.SortStableFunc, as recommended by Go; though it's only builtin starting with 1.22.

Here are the performance improvements I measured by timing sort and logging
Before:

sort duration for directory /home (numfiles: 1): 8.4µs
sort duration for directory /home/q3cpma (numfiles: 21): 445.601µs
sort duration for directory /home/q3cpma/Programming/externe/lf/build (numfiles: 1): 6.141µs
sort duration for directory /home/q3cpma/Programming/externe/lf/etc (numfiles: 19): 22.72µs
sort duration for directory /home/q3cpma/Programming/externe/lf/gen (numfiles: 4): 14.62µs
sort duration for directory /home/q3cpma/Programming/externe/lf (numfiles: 42): 156.22µs
sort duration for directory /home/q3cpma/Programming/externe/lf/test (numfiles: 100000): 55.838499ms
sort duration for directory /home/q3cpma/Programming/externe (numfiles: 25): 109.811µs
sort duration for directory /home/q3cpma/Programming (numfiles: 44): 286.64µs
sort duration for directory / (numfiles: 18): 31.15µs

After:

sort duration for directory /home (numfiles: 1): 41.52µs
sort duration for directory /home/q3cpma (numfiles: 21): 95.641µs
sort duration for directory /home/q3cpma/Programming/externe/lf/build (numfiles: 1): 12.6µs
sort duration for directory /home/q3cpma/Programming/externe/lf/etc (numfiles: 19): 33.01µs
sort duration for directory /home/q3cpma/Programming/externe/lf/gen (numfiles: 4): 25.06µs
sort duration for directory /home/q3cpma/Programming/externe/lf (numfiles: 45): 103.98µs
sort duration for directory /home/q3cpma/Programming/externe/lf/test (numfiles: 100000): 31.911716ms
sort duration for directory /home/q3cpma/Programming/externe (numfiles: 25): 49.76µs
sort duration for directory /home/q3cpma/Programming (numfiles: 44): 65.4µs
sort duration for directory / (numfiles: 18): 32.79µs

The large test directory contains touch test/{1..100000} as a more throughput oriented benchmark, since the other directories are basically instantaneously sorted.

Copy link
Collaborator

@joelim-work joelim-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for your PR submission. Setting aside performance, I think it is a good idea to use this collate package anyway as it makes the code more readable.

That being said, I tried your branch very briefly and found that if dirfirst is disabled then the sorting is skipped entirely since sort.SliceStable is never called. Please do more extensive testing to ensure the various sorting options are covered.

nav.go Outdated Show resolved Hide resolved
@q3cpma
Copy link
Author

q3cpma commented May 8, 2024

Thanks for the quick review, and you're right, I was a bit tired when I made this.

nav.go Outdated
Comment on lines 229 to 277
case naturalSort:
sort.SliceStable(dir.files, func(i, j int) bool {
s1, s2 := normalize(dir.files[i].Name(), dir.files[j].Name(), dir.ignorecase, dir.ignoredia)
if !dir.reverse {
return naturalLess(s1, s2)
} else {
return naturalLess(s2, s1)
}
})
lessfun = func(i, j int) bool {
return coll.CompareString(dir.files[i].Name(), dir.files[j].Name()) == -1
}
case nameSort:
sort.SliceStable(dir.files, func(i, j int) bool {
s1, s2 := normalize(dir.files[i].Name(), dir.files[j].Name(), dir.ignorecase, dir.ignoredia)
if !dir.reverse {
return s1 < s2
} else {
return s2 < s1
}
})
lessfun = func(i, j int) bool {
return coll.CompareString(dir.files[i].Name(), dir.files[j].Name()) == -1
}
Copy link
Collaborator

@joelim-work joelim-work May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two cases have the same code, so you should be able to combine them:

	case nameSort, naturalSort:
		lessfun = func(i, j int) bool {
			return coll.CompareString(dir.files[i].Name(), dir.files[j].Name()) == -1
		}

It's probably also worth putting a comment stating that naturalSort is already handled above.

This actually raises the question of whether natural sorting should have its own option, instead of being lumped under sortby, since it is really a sorting modifier and not a property (e.g. name, size, time) that is used as a basis for sorting. But changing the configuration options is a breaking change, so I think it is not worth worrying about for now.

nav.go Outdated
Comment on lines 348 to 327
// Finally sort after filtering it all
sort.SliceStable(dir.files, lessfun)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought this through, but does it make sense to do all the filtering (i.e. dironly, hidden, filter) first before doing any sorting? It would make the logic cleaner if all of the filtering could be moved to the top of this function, but I don't know if it will cause any issues.

Since you are pretty much rewriting the entire function, it's probably a good time to clean it up as much as possible.

@joelim-work
Copy link
Collaborator

I bumped some dependencies yesterday, looks like it causes a merge conflict.

@joelim-work
Copy link
Collaborator

OK so I decided to play around with this a bit more, and I wrote the following code in a rush, didn't test it too much yet. I wonder if this might be more desirable:

func (dir *dir) sort() {
	dir.sortby = getSortBy(dir.path)
	dir.dirfirst = getDirFirst(dir.path)
	dir.dironly = getDirOnly(dir.path)
	dir.hidden = getHidden(dir.path)
	dir.reverse = getReverse(dir.path)
	dir.hiddenfiles = gOpts.hiddenfiles
	dir.ignorecase = gOpts.ignorecase
	dir.ignoredia = gOpts.ignoredia

	dir.files = dir.allFiles

	filterFiles := func(files []*file, f func(*file) bool) (result []*file) {
		for _, file := range files {
			if f(file) {
				result = append(result, file)
			}
		}
		return result
	}

	if dir.dironly {
		dir.files = filterFiles(dir.files, func(file *file) bool {
			return file.IsDir()
		})
	}

	if !dir.hidden {
		dir.files = filterFiles(dir.files, func(file *file) bool {
			return !isHidden(file, dir.path, dir.hiddenfiles)
		})
	}

	if len(dir.filter) != 0 {
		dir.files = filterFiles(dir.files, func(file *file) bool {
			return !isFiltered(file, dir.filter)
		})
	}

	collopts := []collate.Option{}
	if dir.ignorecase {
		collopts = append(collopts, collate.IgnoreCase)
	}
	if dir.ignoredia {
		collopts = append(collopts, collate.IgnoreDiacritics)
	}
	if dir.sortby == naturalSort {
		collopts = append(collopts, collate.Numeric)
	}
	coll := collate.New(language.Und, collopts...)

	var lessfun func(i, j int) bool

	switch dir.sortby {
	case nameSort, naturalSort:
		lessfun = func(i, j int) bool {
			return coll.CompareString(dir.files[i].Name(), dir.files[j].Name()) == -1
		}
	case sizeSort:
		lessfun = func(i, j int) bool {
			return dir.files[i].TotalSize() < dir.files[j].TotalSize()
		}
	case timeSort:
		lessfun = func(i, j int) bool {
			return dir.files[i].ModTime().Before(dir.files[j].ModTime())
		}
	case atimeSort:
		lessfun = func(i, j int) bool {
			return dir.files[i].accessTime.Before(dir.files[j].accessTime)
		}
	case ctimeSort:
		lessfun = func(i, j int) bool {
			return dir.files[i].changeTime.Before(dir.files[j].changeTime)
		}
	case extSort:
		lessfun = func(i, j int) bool {
			cmp := coll.CompareString(dir.files[i].ext, dir.files[j].ext)
			return cmp == -1 || cmp == 0 && coll.CompareString(dir.files[i].Name(), dir.files[j].Name()) == -1
		}
	}

	if dir.reverse {
		oldlessfun := lessfun
		lessfun = func(i, j int) bool {
			return oldlessfun(j, i)
		}
	}

	if dir.dirfirst {
		oldlessfun := lessfun
		lessfun = func(i, j int) bool {
			if dir.files[i].IsDir() == dir.files[j].IsDir() {
				return oldlessfun(i, j)
			}
			return dir.files[i].IsDir()
		}
	}

	sort.SliceStable(dir.files, lessfun)

	dir.ind = max(dir.ind, 0)
	dir.ind = min(dir.ind, len(dir.files)-1)
}

@q3cpma
Copy link
Author

q3cpma commented May 9, 2024

Thanks, this was actually in my head at some point, but I was a bit wary of doing changes too big with my lack of knowledge of Go and its slices. Good thinking about language.Und, didn't see it.

I included your changes except that I made filtering go through the file list only once.

@q3cpma q3cpma force-pushed the sort-misc-improvements branch 3 times, most recently from f138d49 to abfb1fd Compare May 9, 2024 21:52
@joelim-work
Copy link
Collaborator

I didn't realize it was possible to combine filters in the way you have done, nice.

Anyway the code looks quite good now There's a minor issue with formatting that is causing the CI build to fail, but otherwise it is probably close to being in a state where it can be merged.

Just out of interest how much of an increase in performance is there with the new changes?

@q3cpma
Copy link
Author

q3cpma commented May 10, 2024

Not very different:

Before:

/                                          18      30.99µs
/home                                      2       7.24µs
/home/q3cpma                               73      416.421µs
/home/q3cpma/Programming                   44      369.791µs
/home/q3cpma/Programming/externe           30      133.94µs
/home/q3cpma/Programming/externe/lf        46      101.62µs
/home/q3cpma/Programming/externe/lf/build  1       7.11µs
/home/q3cpma/Programming/externe/lf/etc    19      42.27µs
/home/q3cpma/Programming/externe/lf/gen    4       19.31µs
/home/q3cpma/Programming/externe/lf/test   100000  58.616475ms

After:

/                                          18      26.2µs
/home                                      2       34.72µs
/home/q3cpma                               73      29.87µs
/home/q3cpma/Programming                   44      75.88µs
/home/q3cpma/Programming/externe           30      53.9µs
/home/q3cpma/Programming/externe/lf        45      41.05µs
/home/q3cpma/Programming/externe/lf/build  1       17.97µs
/home/q3cpma/Programming/externe/lf/etc    19      33.35µs
/home/q3cpma/Programming/externe/lf/gen    4       18.66µs
/home/q3cpma/Programming/externe/lf/test   100000  26.947215ms

Basically, a little overhead for small directories, but in the absolute, nothing. Significant gains for the big ones (not even counting the dirfirst skipped with dironly trick or less to sort if a lot is filtered).

@joelim-work
Copy link
Collaborator

Sorry to trouble you, but I did some more testing and I found that the ignorecase and ignoredia options no longer work when they're disabled, at least on my machine. I did some investigation and it seems like collate.New seems to ignore case and diacritics by default.

The following examples all print -1 (i.e. left < right):

c := collate.New(language.Und)
fmt.Println(c.CompareString("a", "B")) // "B" should come first since it is uppercase
c := collate.New(language.Und, collate.IgnoreCase)
fmt.Println(c.CompareString("a", "B"))
c := collate.New(language.Und)
fmt.Println(c.CompareString("é", "f"))  // "é" should come last since it is a special character
c := collate.New(language.Und, collate.IgnoreDiacritics)
fmt.Println(c.CompareString("é", "f"))

I'm not sure how to go about solving this though, I'm haven't used the collate package all that much.

@q3cpma
Copy link
Author

q3cpma commented May 10, 2024

Not troubling me, I'm quite happy you're doing so much testing (I should be doing!).

Filed golang/go#67296 for now, because it sure as hell looks like a bug to me.

@q3cpma
Copy link
Author

q3cpma commented Jun 25, 2024

Seeing the helpful comments in the issue I opened, I'm thinking about reverting all collate changes and doing something like this instead:

  • Remove diacritics via:
package main

import (
	"fmt"
	"unicode"

	"golang.org/x/text/transform"
	"golang.org/x/text/unicode/norm"
	"golang.org/x/text/runes"
)

func main() {
	t := transform.Chain(norm.NFD, runes.Remove(runes.In(unicode.Mn)), norm.NFC)
	result, _, _ := transform.String(t, "žůžo")
	fmt.Println(result)
}

The goal being to avoid the massive consing of the current solution.

Does it make sense to still do it in that PR?

@joelim-work
Copy link
Collaborator

For ignorecase, converting the string to lowercase ToLower is probably readable enough anyway - I don't see any reason to change the code there. Besides strings.EqualFold merely checks if the strings are equal, but here the strings need to be ordered.

As for ignoredia, there are unit tests for this, please check your implementation (from https://stackoverflow.com/a/65981868) satisfies the them.

@q3cpma
Copy link
Author

q3cpma commented Aug 31, 2024

Hello and sorry for the (very) long delay, my energy is a bit low these days and that text/collate thing (and Go in general not even providing natural string comparison) kinda took the wind out of my sails.

I reverted to our own removeDiacritics/strings.ToLower solution, even if that much allocation isn't pretty when you know Go's GC isn't generational. But well, that can be a later improvement.

Tested extSort a bit more (because I didn't remember why the "\x00" thing got removed) and everything seems to work. I get the following in a test dir:

b/
c
a.c
a.txt
b.txt
c.txt
a.zip

In fact, sort after all the filtering is done, makes more sense
And add the following optimizations:
* Hoist the branching out of normalize()
* In extSort, normalize the names only when extensions are equal
* In extSort, use strings.Compare that does a proper 3-way comparison in Go 1.23
@joelim-work
Copy link
Collaborator

So it's been a while since I last looked at this too.

TBH, given that the collate package isn't being used anymore and nothing new is being added in terms of functionality, I have somewhat lost interest in this change. Now it just looks like the code has been rewritten for the sake of using a different programming paradigm. Overall this feels subjective to me, so at this point in time I'm not inclined to merge it, but I will leave it open in case anyone else is interested.

In the meantime I think there are smaller things that can be improved, and I would accept them if each are submitted separately:

  • The "\x00" code is not needed when sorting by extension, it is leftover cruft from make sortby ext work as expected #539
  • The normalize function can be changed to operate on only one input string at a time
  • The code for handling filtering (dironly, hidden, filter) is clearly copy/pasted and can be extracted out into a separate function
  • Filtering can be performed before sorting

@q3cpma
Copy link
Author

q3cpma commented Aug 31, 2024

It's a bit understandable, but I wouldn't call it merely different, personally; especially since ~40 lines are removed and performance certainly improved. But oh well, that's how it is, and I agree that I did let my functional side bleed a bit too much into a very imperative language.

@joelim-work
Copy link
Collaborator

I'm not completely against functional programming - for instance I can see that reversing the sort comparator function at the end saves the need to specify if !dir.reverse { ... } for every type of sort. It's just that I'm uninclined to change the style of the existing code unless it is actively causing maintenance problems.

Maybe dir.reverse could be handled like this instead, while keeping the original style:

case sizeSort:
	sort.SliceStable(dir.files, func(i, j int) bool {
		if dir.reverse {
			i, j = j, i
		}
		return dir.files[i].TotalSize() < dir.files[j].TotalSize()
	})

As for performance, I don't notice any difference, and I don't recall seeing any other issues raised about sorting speed. I haven't done any benchmarking myself, but if it's generally in the nanoseconds then I don't think it's an important factor compared to the actual code.

@q3cpma
Copy link
Author

q3cpma commented Sep 1, 2024

Well, the entire point of it was to avoid a branch per comparison and to copy paste code for each sort; but knowing how bad Go's inliner is with closures, who knows...

Sorry for wasting your time, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants