-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General framework for dataset, format-agnostic and compatible with NGFF #292
Labels
Milestone
Comments
This was referenced Jan 9, 2025
4 tasks
melisande-c
added a commit
that referenced
this issue
Feb 10, 2025
…ata storage (#373) ## Description Opening this draft PR to discuss work so far on the `PatchExtractor` class for the new dataset development. There are no docs or tests for the time being until we have agreed on the final structure! I have deviated slightly from what was discussed in our meeting, but I hope the design makes sense, otherwise very happy to take suggestions! This will be an iterative process, so I'm sure changes will be made to these classes in the future, just hope this is a decent starting point! ## Structure ### `ImageStack` class - An interface to extract patches from an array type objection. Must implement the `extract_patch` method. Must have a `data_shape` attribute and a `source` attribute. (The `source` attribute is not used yet but I thought it might be handy to keep track of where data is coming from). - **Concrete implementations**: - `InMemoryImageStack`: The underlying array is a numpy array. Has constructor class methods to initialise from a tiff file, an existing numpy array, or a custom file type. - `ZarrImageStack`: The underlying array should point to a zarr array stored on disk. (Not implemented yet!) ### `PatchExtractor` class - contains multiple `ImageStack` classes and can extract patches from each of them. ### `RandomPatchSpecsGenerator` - Not settled on the structure of this, implemented it more as of a proof of concept. Should probably be kept decoupled from the `ImageStack` classes and the `PatchExtractor`. - Ported the existing random patching logic to the `generate` method, but now only the `PatchSpecs` ("patch specifications") are generated. `PatchSpecs` is a typed dict that can be unpacked in the `PatchExtractor.extract_patch` method or a sequence of `PatchSpecs` can be passed to the `PatchExtractor.extract_patches` method. ### Patch extractor factory Added a `dataset_ng/patch_extractor/patch_extractor_factory.py` file which contains a `create_patch_extractors` function. - Chooses the correct `PatchExtractor` constructor according to information in the `DataConfig`. - Takes the train data and optional val data, train target data and val target data and returns corresponding `PatchExtractors`. - Probably subject to change (probably want to separate the training set creation from the validation set creation). - But some version of this will be called within the new dataset initialisation. I have also included two demo notebooks: - `dataset_ng/demo_data_reader.ipynb` - `dataset_ng/patch_extractor_factory_demo.ipynb` ### Related Issues Part of the development of #292 ### Additional Notes and Examples - A note on organisation: I have put everything in a new directory `dataset_dev` so it wouldn't get too confusing, if people agree? EDIT: has been moved to `dataset_ng` - Obviously pre-commit does not pass atm because there are no docs, but I made sure mypy passes. EDIT: ruff now ignores docstrings for file in the `dataset_ng` directory. numpy docstring hook has been disabled for `dataset_ng` entirely. EDIT: Name updates and *Patch extractor factor* section. --- **Please ensure your PR meets the following requirements:** - [x] Code builds and passes tests locally, including doctests - [ ] New tests have been added (for bug fixes/features) - [x] Pre-commit passes - [ ] PR to the documentation exists (for bug fixes / features) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Joran Deschamps <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Important
Summary: replacing current datasets by a general framework compatible with NGFF file formats.
Currently, we have accumulated a lot of technical debt on our datasets for the sake of simplifying the code. We have ended up with two different types of datasets (InMemory and PathIterable), which have different patching strategies (sequential vs random) and are different types of datasets. Furthermore, the train and prediction datasets have been split because of the tiling in the prediction.
This creates a lot of complexity and different behaviours depending on hyperparameters. The fact that every time we discuss datasets we are confused about which one does what is a bad sign!
To allow additional file formats, we also provide parameters to pass extensions and file reader callables. While this is a nice idea, it is very limiting and will not scale well.
As we are moving towards compatibility with NGFF, we need a better and more general dataset system that can be compatible with multiple back-ends. In addition, this would help third-parties implement their own file format.
Features
Which part of the code?
Potential solutions
@veegalinova and @CatEek have worked out a potential architecture, please post the schematic here for reference once we will be more confident this is the way to go.
Relevant issues
#293: Exclude background patches during training.
#282: Why the difference between patching (random vs sequential)?
#297: Should we save train/val patches info for reproducibility?
#315: cryoCARE compatibility, it uses memory-mapped data
#303: Quantile normalization
#342: Add validation dataloader parameters
#343: Validation patches are random with
PathIterable
#358: Stitching impossible if tiles not ordered by image
#364: Turn off normalization if data already normalized
#293: Exclude background patches during training
#167: Adapt TTA from training configuration
The text was updated successfully, but these errors were encountered: