Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General framework for dataset, format-agnostic and compatible with NGFF #292

Open
jdeschamps opened this issue Dec 6, 2024 · 0 comments
Assignees
Labels
feature New feature or request refactoring Streamline and improve source code
Milestone

Comments

@jdeschamps
Copy link
Member

jdeschamps commented Dec 6, 2024

Description

Important

Summary: replacing current datasets by a general framework compatible with NGFF file formats.

Currently, we have accumulated a lot of technical debt on our datasets for the sake of simplifying the code. We have ended up with two different types of datasets (InMemory and PathIterable), which have different patching strategies (sequential vs random) and are different types of datasets. Furthermore, the train and prediction datasets have been split because of the tiling in the prediction.

This creates a lot of complexity and different behaviours depending on hyperparameters. The fact that every time we discuss datasets we are confused about which one does what is a bad sign!

To allow additional file formats, we also provide parameters to pass extensions and file reader callables. While this is a nice idea, it is very limiting and will not scale well.

As we are moving towards compatibility with NGFF, we need a better and more general dataset system that can be compatible with multiple back-ends. In addition, this would help third-parties implement their own file format.

Features

  • Dataset
    • Tiff, Zarr, numpy array
    • Optional target
  • Patcher
    • Exclude background coordinates (here or in sampler?)
    • Random patching for training
    • Save train / val coordinates for reproducibility?
    • Seed-able
    • Possibility to fix coordinates (sequential?) for validation
    • multi-file patching
  • Sampler
  • Data statistics
    • Welford mean/std
    • Quantile normalization
    • No normalization
    • Pre-computed
  • Metadata extractor
    • returns shape of each file for a number of files
    • if metadata is unavailable, defaults to reading the file
  • Augmentations
    • Normalization handling
    • Random rotate 90 degrees
    • Random X or Y flip
    • Add noise

Which part of the code?

Potential solutions

@veegalinova and @CatEek have worked out a potential architecture, please post the schematic here for reference once we will be more confident this is the way to go.

Relevant issues

#293: Exclude background patches during training.
#282: Why the difference between patching (random vs sequential)?
#297: Should we save train/val patches info for reproducibility?
#315: cryoCARE compatibility, it uses memory-mapped data
#303: Quantile normalization
#342: Add validation dataloader parameters
#343: Validation patches are random with PathIterable
#358: Stitching impossible if tiles not ordered by image
#364: Turn off normalization if data already normalized
#293: Exclude background patches during training
#167: Adapt TTA from training configuration

@jdeschamps jdeschamps added the refactoring Streamline and improve source code label Dec 6, 2024
@jdeschamps jdeschamps added the feature New feature or request label Dec 6, 2024
@jdeschamps jdeschamps changed the title Format-agnostic dataset, compatible with random-access (e.g. Zarr, czi, memory-map) General framework for dataset, format-agnostic and compatible with NGFF Dec 6, 2024
@jdeschamps jdeschamps added this to the v0.1.0 milestone Dec 17, 2024
@melisande-c melisande-c self-assigned this Jan 21, 2025
@jdeschamps jdeschamps moved this to Backlog in v0.1.0 Jan 27, 2025
@jdeschamps jdeschamps moved this from Backlog to In Progress in v0.1.0 Jan 27, 2025
melisande-c added a commit that referenced this issue Feb 10, 2025
…ata storage (#373)

## Description

Opening this draft PR to discuss work so far on the `PatchExtractor`
class for the new dataset development. There are no docs or tests for
the time being until we have agreed on the final structure!

I have deviated slightly from what was discussed in our meeting, but I
hope the design makes sense, otherwise very happy to take suggestions!

This will be an iterative process, so I'm sure changes will be made to
these classes in the future, just hope this is a decent starting point!

## Structure

### `ImageStack` class 
- An interface to extract patches from an array type objection. Must
implement the `extract_patch` method. Must have a `data_shape` attribute
and a `source` attribute. (The `source` attribute is not used yet but I
thought it might be handy to keep track of where data is coming from).

- **Concrete implementations**:
- `InMemoryImageStack`: The underlying array is a numpy array. Has
constructor class methods to initialise from a tiff file, an existing
numpy array, or a custom file type.
- `ZarrImageStack`: The underlying array should point to a zarr array
stored on disk. (Not implemented yet!)

### `PatchExtractor` class
- contains multiple `ImageStack` classes and can extract patches from
each of them.

### `RandomPatchSpecsGenerator`
- Not settled on the structure of this, implemented it more as of a
proof of concept. Should probably be kept decoupled from the
`ImageStack` classes and the `PatchExtractor`.
- Ported the existing random patching logic to the `generate` method,
but now only the `PatchSpecs` ("patch specifications") are generated.

`PatchSpecs` is a typed dict that can be unpacked in the
`PatchExtractor.extract_patch` method or a sequence of `PatchSpecs` can
be passed to the `PatchExtractor.extract_patches` method.

### Patch extractor factory

Added a `dataset_ng/patch_extractor/patch_extractor_factory.py` file
which contains a `create_patch_extractors` function.
- Chooses the correct `PatchExtractor` constructor according to
information in the `DataConfig`.
- Takes the train data and optional val data, train target data and val
target data and returns corresponding `PatchExtractors`.
- Probably subject to change (probably want to separate the training set
creation from the validation set creation).
- But some version of this will be called within the new dataset
initialisation.

I have also included two demo notebooks: 
- `dataset_ng/demo_data_reader.ipynb`
- `dataset_ng/patch_extractor_factory_demo.ipynb`

### Related Issues

Part of the development of #292

### Additional Notes and Examples

- A note on organisation: I have put everything in a new directory
`dataset_dev` so it wouldn't get too confusing, if people agree?
EDIT: has been moved to `dataset_ng`

- Obviously pre-commit does not pass atm because there are no docs, but
I made sure mypy passes.
EDIT: ruff now ignores docstrings for file in the `dataset_ng`
directory. numpy docstring hook has been disabled for `dataset_ng`
entirely.

EDIT: Name updates and *Patch extractor factor* section.

---

**Please ensure your PR meets the following requirements:**

- [x] Code builds and passes tests locally, including doctests
- [ ] New tests have been added (for bug fixes/features)
- [x] Pre-commit passes
- [ ] PR to the documentation exists (for bug fixes / features)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Joran Deschamps <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request refactoring Streamline and improve source code
Projects
Status: In Progress
Development

No branches or pull requests

4 participants