General framework for dataset, format-agnostic and compatible with NGFF #292

jdeschamps · 2024-12-06T15:25:33Z

Description

Important

Summary: replacing current datasets by a general framework compatible with NGFF file formats.

Currently, we have accumulated a lot of technical debt on our datasets for the sake of simplifying the code. We have ended up with two different types of datasets (InMemory and PathIterable), which have different patching strategies (sequential vs random) and are different types of datasets. Furthermore, the train and prediction datasets have been split because of the tiling in the prediction.

This creates a lot of complexity and different behaviours depending on hyperparameters. The fact that every time we discuss datasets we are confused about which one does what is a bad sign!

To allow additional file formats, we also provide parameters to pass extensions and file reader callables. While this is a nice idea, it is very limiting and will not scale well.

As we are moving towards compatibility with NGFF, we need a better and more general dataset system that can be compatible with multiple back-ends. In addition, this would help third-parties implement their own file format.

Features

Dataset
- Tiff, Zarr, numpy array
- Optional target
Patcher
- Exclude background coordinates (here or in sampler?)
- Random patching for training
- Save train / val coordinates for reproducibility?
- Seed-able
- Possibility to fix coordinates (sequential?) for validation
- multi-file patching
Sampler
Data statistics
- Welford mean/std
- Quantile normalization
- No normalization
- Pre-computed
Metadata extractor
- returns shape of each file for a number of files
- if metadata is unavailable, defaults to reading the file
Augmentations
- Normalization handling
- Random rotate 90 degrees
- Random X or Y flip
- Add noise

Which part of the code?

Potential solutions

@veegalinova and @CatEek have worked out a potential architecture, please post the schematic here for reference once we will be more confident this is the way to go.

Relevant issues

#293: Exclude background patches during training.
#282: Why the difference between patching (random vs sequential)?
#297: Should we save train/val patches info for reproducibility?
#315: cryoCARE compatibility, it uses memory-mapped data
#303: Quantile normalization
#342: Add validation dataloader parameters
#343: Validation patches are random with PathIterable
#358: Stitching impossible if tiles not ordered by image
#364: Turn off normalization if data already normalized
#293: Exclude background patches during training
#167: Adapt TTA from training configuration

The text was updated successfully, but these errors were encountered:

…ata storage (#373) ## Description Opening this draft PR to discuss work so far on the `PatchExtractor` class for the new dataset development. There are no docs or tests for the time being until we have agreed on the final structure! I have deviated slightly from what was discussed in our meeting, but I hope the design makes sense, otherwise very happy to take suggestions! This will be an iterative process, so I'm sure changes will be made to these classes in the future, just hope this is a decent starting point! ## Structure ### `ImageStack` class - An interface to extract patches from an array type objection. Must implement the `extract_patch` method. Must have a `data_shape` attribute and a `source` attribute. (The `source` attribute is not used yet but I thought it might be handy to keep track of where data is coming from). - **Concrete implementations**: - `InMemoryImageStack`: The underlying array is a numpy array. Has constructor class methods to initialise from a tiff file, an existing numpy array, or a custom file type. - `ZarrImageStack`: The underlying array should point to a zarr array stored on disk. (Not implemented yet!) ### `PatchExtractor` class - contains multiple `ImageStack` classes and can extract patches from each of them. ### `RandomPatchSpecsGenerator` - Not settled on the structure of this, implemented it more as of a proof of concept. Should probably be kept decoupled from the `ImageStack` classes and the `PatchExtractor`. - Ported the existing random patching logic to the `generate` method, but now only the `PatchSpecs` ("patch specifications") are generated. `PatchSpecs` is a typed dict that can be unpacked in the `PatchExtractor.extract_patch` method or a sequence of `PatchSpecs` can be passed to the `PatchExtractor.extract_patches` method. ### Patch extractor factory Added a `dataset_ng/patch_extractor/patch_extractor_factory.py` file which contains a `create_patch_extractors` function. - Chooses the correct `PatchExtractor` constructor according to information in the `DataConfig`. - Takes the train data and optional val data, train target data and val target data and returns corresponding `PatchExtractors`. - Probably subject to change (probably want to separate the training set creation from the validation set creation). - But some version of this will be called within the new dataset initialisation. I have also included two demo notebooks: - `dataset_ng/demo_data_reader.ipynb` - `dataset_ng/patch_extractor_factory_demo.ipynb` ### Related Issues Part of the development of #292 ### Additional Notes and Examples - A note on organisation: I have put everything in a new directory `dataset_dev` so it wouldn't get too confusing, if people agree? EDIT: has been moved to `dataset_ng` - Obviously pre-commit does not pass atm because there are no docs, but I made sure mypy passes. EDIT: ruff now ignores docstrings for file in the `dataset_ng` directory. numpy docstring hook has been disabled for `dataset_ng` entirely. EDIT: Name updates and *Patch extractor factor* section. --- **Please ensure your PR meets the following requirements:** - [x] Code builds and passes tests locally, including doctests - [ ] New tests have been added (for bug fixes/features) - [x] Pre-commit passes - [ ] PR to the documentation exists (for bug fixes / features) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Joran Deschamps <[email protected]>

jdeschamps added the refactoring Streamline and improve source code label Dec 6, 2024

jdeschamps assigned CatEek and veegalinova Dec 6, 2024

jdeschamps added the feature New feature or request label Dec 6, 2024

jdeschamps changed the title ~~Format-agnostic dataset, compatible with random-access (e.g. Zarr, czi, memory-map)~~ General framework for dataset, format-agnostic and compatible with NGFF Dec 6, 2024

jdeschamps mentioned this issue Dec 6, 2024

Exclude background patches during training #293

Open

jdeschamps added this to the v0.1.0 milestone Dec 17, 2024

jdeschamps mentioned this issue Dec 17, 2024

PathIterable dataset with a single file and no validation will fail[BUG] #160

Open

This was referenced Jan 9, 2025

Validation dataloder parameters #342

Closed

Validation is changing every epoch in PathIterable #343

Open

melisande-c self-assigned this Jan 21, 2025

melisande-c mentioned this issue Jan 23, 2025

Feature(Next-Gen Datasets): Patch extraction agnostic to underlying data storage #373

Merged

4 tasks

jdeschamps moved this to Backlog in v0.1.0 Jan 27, 2025

jdeschamps added this to v0.1.0 Jan 27, 2025

jdeschamps moved this from Backlog to In Progress in v0.1.0 Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General framework for dataset, format-agnostic and compatible with NGFF #292

General framework for dataset, format-agnostic and compatible with NGFF #292

jdeschamps commented Dec 6, 2024 •

edited

Loading

General framework for dataset, format-agnostic and compatible with NGFF #292

General framework for dataset, format-agnostic and compatible with NGFF #292

Comments

jdeschamps commented Dec 6, 2024 • edited Loading

Description

Features

Which part of the code?

Potential solutions

Relevant issues

jdeschamps commented Dec 6, 2024 •

edited

Loading