Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use natural sorting in imread(...) when globbing multiple files #265

Merged
merged 4 commits into from
Jul 25, 2022

Conversation

VolkerH
Copy link
Contributor

@VolkerH VolkerH commented Jul 21, 2022

dask_image.imread.imread can include glob strings in the filename argument.

Currently, the results of the glob operation are sorted using sorted, a behviour that is not described in the doctring.

sorted sorts alphabetically which causes files such as 10.tif, 9.tif to be read in this order, which is counter-intuitive.
A common strategy adopted by users to circumenvent such file name problems is to zero-pad all numbers in a string.

This PR adds a sortfunc argument to to imread, which allows to pass in a custom sort function. A use case for this is to pass in a function that performs natural sort, which will sort strings of numbers numerically. Such a use-case is shown in the included test.

The docstring for imread has been updated to include the new argument, which also remedies the current undocumented behaviour.
Files have been checked with flake8.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@GenevieveBuckley
Copy link
Collaborator

add to allowlist

@GenevieveBuckley GenevieveBuckley mentioned this pull request Jul 22, 2022
4 tasks
@GenevieveBuckley
Copy link
Collaborator

Do you anticipate needing to use a sort function besides natural sort?

I'd be happy to straight up replace sorted with natsort. It's an extra dependency, but I think the benefit would be more than worth it.

I'm less keen to change the API to allow users to pass in arbitrary sorting functions, although I see why you suggest it. Instead, I think it would be cleaner to allow the function to accept a list of strings or Paths. Then the user could modify this list in any arbitrary way before providing it to dask-image, and it's much easier for the user to inspect.

Ideally, we'd do both these things. Then there would be a sensible default with glob and natsort, plus a more flexible alternative if necessary.

@GenevieveBuckley
Copy link
Collaborator

I'd be happy to straight up replace sorted with natsort. It's an extra dependency, but I think the benefit would be more than worth it.

Oh, even easier! I didn't realize tifffile had a natural_sorted function (from tifffile import natural_sorted). We already require tifffile as a dependency, so that's even easier to switch to using. (Thanks for teaching me something new today 😄 )

@VolkerH
Copy link
Contributor Author

VolkerH commented Jul 22, 2022

I agree that natural sort would be the most sensible default.
I just did not want to change the existing behaviour in case any legacy code is relying on alphabetical sort (unlikely, most of those legacy cases would probably be bugs).
Passing in a list of files would be another option (I also thought of that).

Yes, tifffile has a lot of hidden gems. Most of the documentation is in the actual tifffile python source code, so it is a bit hidden.

@GenevieveBuckley
Copy link
Collaborator

How about you edit this PR to remove the API change, and instead do a straight replacement of the current behaviour with tifffile.natural_sort. I'd be happy to accept that more or less as-is.

The current behaviour is clearly a bug, and it is very unlikely anybody is relying on it for legacy code.

I'd also be happy to see a PR allowing users to pass in a list of filenames, but that should be a separate PR. (You're welcome to try making that too if you like, but I also don't expect you to. Your problem will likely be solved after this first PR and you'll want to get back to what you were doing earlier)

Yes, tifffile has a lot of hidden gems. Most of the documentation is in the actual tifffile python source code, so it is a bit hidden.

I have seen that documentation in the source code, perhaps it's time to go back and see what other cool stuff I missed!

@VolkerH
Copy link
Contributor Author

VolkerH commented Jul 22, 2022

I made natural sorting the default and removed the ability to pass in a custom sort function as expected.

As side notes (probably better placed in two separate issues):

  • The documentation should explain what the difference between dask_image.imread.imread and dask.array.image.imread is and when to use which. I know that there must a difference because ...
  • .... I noticed that dask_image.imread.imread fails for certain .tif files where dask.array.image.imread works just fine. The source of the error isn't in dask_image but in pims and has to do with a datetime string formatting in the tif metadata that is not expected by pims. I will try and make a self-contained example and raise an issue in pims.

@VolkerH VolkerH changed the title Add sortfunc argument to imread(...) to customize glob order Use natural sorting in imread(...) when globbing multiple files Jul 22, 2022
Copy link
Collaborator

@GenevieveBuckley GenevieveBuckley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @VolkerH, this is a useful improvement

@GenevieveBuckley GenevieveBuckley merged commit 5cf77bf into dask:main Jul 25, 2022
@GenevieveBuckley
Copy link
Collaborator

  • The documentation should explain what the difference between dask_image.imread.imread and dask.array.image.imread is and when to use which. I know that there must a difference because ...

Yes, this is perhaps better discussed in #229
I agree, we should have one reliable way to do it. The current situation is confusing.

Minor differences include:

  • dask.array.image.imread will make one chunk per filename, which is not always a great choice (eg: you might have a single massive file with multiple timepoints in it, etc.)
  • dask.array.image.imread uses the scikit-image imread by default (although you can pass in whatever reader you like with a keyword argument, which is an excellent feature). Whereas dask-image uses pims, and you don't have a choice in the matter.
    *For the future, I think imageio now allows you to read file metadata without reading the entire file, so it would be good to make dask-image depend on that instead of pims (partly also because that's a larger and more active project)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants