-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functionality questions for ome-zarr formatted data #181
Comments
Thanks for summarizing your attempts @Christianfoley. Some notes for context:
|
Regarding the simple augment and the reversed channels, the |
We don't have a source for reading from OME Zarr. Our |
It is! You can create as many zarr sources as you like, pointing to different zarr containers and/or datasets. You can then combine them using |
That would be a cleaner solution, agreed. For that, it would be helpful to know how you plan to use the OME zarr datasets during training. Is the idea to select a random row/column for each sample in a batch? Or will it be necessary to request specific rows/columns? |
Ah I may have misunderstood. I thought the Row/Col/Pos related the arrays spatially and you essentially wanted to tile them together to create one larger array, which would be difficult to do with our existing nodes. Pairing many raw datasets with their associated gt and then randomly sampling is well supported. |
Hi Jan and William! Thank you so much for your quick and informative responses.
We should be able to consider all arrays in all rows/columns for random selection. That being said, we would only like to load tiles (256x256) whose field of view must meet some qualifications; much of our actual image data is simply background that lacks significant features, and we've shown that only selecting regions that have some percentage of foreground in them is very valuable to network performance. Ideally we would like to do this without making another copy of the data. We could do this on the fly by loading in a region and computing whether it meets the qualifications, but I think it may be more efficient to precompute these regions, and have some metadata specifying which regions contain foreground features. This way regions don't get loaded and then dumped when they aren't up to par. On the topic of loading smaller regions, I noticed that it took significantly more time to load a 256x256 ROI from a larger 2048x2048 image than it did to load the same region (256x256) from a pre-saved tile of that region. |
Yes, we have also found that avoiding excessive sampling from background regions to be helpful. Gunpowder supports both lazy and precomputed methods for doing this.
hmm, without more details of your specific files I'm not sure if I can determine the cause of the slow down. If you're loading from a chunked data storage format such as zarr, it is unlikely that you will be reading exactly along block boundaries, so you will have to read and discard some extra data. Thus reading from a zarr will always come with some small penalty but I don't know of any formats that could avoid this penalty and still allow you to read random crops from a large volume with reasonable performance. If performance is a problem I recommend looking at the |
Thanks! I was able to use this node to get the random sampling working, but only after providing my sources as a tuple. Ex: pipeline = (source_1 + source_2 + source3) + gp.RandomProvider() # does not work, only returns from last provider
pipeline = (source_1, source_2, source3) + gp.RandomProvider() # does work, chooses randomly from providers In the API reference, the example usage is: |
Ah yes. That was recently fixed but we have not yet made new release to propagate the changes. |
Hi @funkey @pattonw, thanks again for your help so far. I have been running into an issue with building pipelines and it would be very helpful if you could help me understand a few things: Currently I am creating multiple pipelines, one for training, test, and validation data. I would generally draw samples from the validation set each x epochs to generate a loss value for the learning rate scheduler. However, when I do this, I get the error:
Additionally, I don't quite understand the functionality of
|
This should be possible. Here's a quick working example: import zarr
import gunpowder as gp
import numpy as np
container = zarr.open("test.zarr")
x, y = np.meshgrid(range(10),range(10))
container["test"] = x + y
test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"})
pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"})
request = gp.BatchRequest()
request.add(test_key, (5,5))
with gp.build(pipeline1):
for _ in range(10):
batch1 = pipeline1.request_batch(request)
assert test_key in batch1
with gp.build(pipeline2):
batch2 = pipeline2.request_batch(request)
assert test_key in batch2 You should only run into problems if you attempt to build the same pipeline multiple times, or maybe if you share nodes between both pipelines.
The problem is that import zarr
import gunpowder as GP
import numpy as np
container = zarr.open("test.zarr")
x, y = np.meshgrid(range(10),range(10))
container["test"] = x + y
test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"})
pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"})
request = gp.BatchRequest()
request.add(test_key, (5,5))
def generate_batches(pipeline: gp.Pipeline, request: gp.BatchRequest) -> gp.Batch:
with gp.build(pipeline):
while True:
yield pipeline.request_batch(request)
batch_generator1 = generate_batches(pipeline1, request)
batch_generator2 = generate_batches(pipeline2, request)
for _ in range(10):
batch1 = next(batch_generator1)
assert test_key in batch1
batch2 = next(batch_generator2)
assert test_key in batch2 In this form the pipeline should be not too difficult to incorporate into a data loader. |
Thanks so much for the quick reply! I will try these approaches.
Should I be worried about this impacting data loading performance? For example, with this create a lot of overhead for each setup and teardown if I include a PreCache node with many workers? |
Should not impact performance at all. |
The generator method seems to work well with our dataloaders, thanks! We have previously dealt with bottlenecks in our data IO slowing down large training sessions. Because of the way our data is formatted (multi-channel, high depth zarr arrays, chunked by 2d slice), only a small subsection of the data returned by current batch-requests is necessary to us at one time. Example: path_to_zarr = os.path.join("/data/VeroCell_40X_11NA/VeroPhaseFL_40X_pos4.zarr"
spec = gp.ArraySpec(interpolatable=True, voxel_size=gp.Coordinate((1,1,1))
raw = gp.ArrayKey('RAW') #array with 8 channels: [1,8,41,2048,2048]
mask = gp.ArrayKey('MASK')
source = gp.ZarrSource(
filename=zarr_dir,
datasets={raw: "arr_0", mask: "arr_0_mask"}
array_specs=spec
)
random_location = gp.RandomLocation(min_masked = 0.3, mask = mask) #reject background
pipeline = source + random_location
new_request = gp.BatchRequest()
new_request[raw] = gp.Roi((0, 0, 0), (5, 256, 256))
with gp.build(batch_pipeline) as pipeline:
sample = pipeline.request_batch(request = new_request)
data = sample[raw].data
print(data.shape) # (1, 8, 5, 256, 256)
input_we_need = np.expand_dims(data[0,3], (0,1))
target_we_need = np.expand_dims(data[0,7, 3], (0,1,2))
print(input_we_need.shape) # (1, 1, 5, 256, 256)
print(target_we_need.shape) # (1, 1, 1, 256, 256) It seems to me that in this process we need to read (num_channels)x(maximum z-depth)x(width)x(height) elements, when we are using only a fraction of that data.
|
Hello @funkey , @pattonw! We are at a stage where we need to improve the training speed. One of the bottlenecks is that with the current implementation of the data loading pipeline, we read more data than we need to. More specifically, Christian's question above has become timely again:
We are currently working with CZYX arrays, where we need to draw patches/slices along the C and Z dimensions. In a month or so, we will start working with TCZYX, where we need to draw patches/slices along T,C,Z dimensions. Does gunpowder's structure allow users to draw ROIs and slices along the time, channel, and Z dimensions? Can you point us to some examples? |
Thanks @mattersoflight ! The updated key method might be helpful for contextualizing our structure. |
Hi @mattersoflight and @Christianfoley, there is minimal built in support for processing channel dimensions since most of the gunpowder operations are focused on spatial queries and augmentations. There are a couple approaches to handling this sort of thing.
If you want a Here is an example of a CTZYX dataset where you read all the channels, or add a custom node to filter out a subset of the channels. You made need a slightly more complicated customized zarr source node if you want to transpose the first two dimensions to let gunpowder handle time as spatial: import zarr
import gunpowder as gp
import numpy as np
class TakeChannels(gp.BatchFilter):
def __init__(self, array_key: gp.ArrayKey, channels: list[int]):
self.array_key = array_key
self.channels = channels
def process(self, batch, request):
outputs = gp.Batch()
array = batch[self.array_key]
array.data = array.data[tuple(self.channels), ...]
outputs[self.array_key] = array
return outputs
container = zarr.open("test.zarr")
c, t, z, y, x = np.meshgrid(range(10),range(10),range(10),range(10),range(10))
container["test"] = c + t + z + y + x
test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"}, {test_key:gp.ArraySpec(voxel_size=gp.Coordinate(1,1,1,1))})
request = gp.BatchRequest()
request.add(test_key, (5,5,5,5))
with gp.build(pipeline1):
batch = pipeline1.request_batch(request)
assert batch[test_key].data.shape == (10,5,5,5,5)
pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"}, {test_key:gp.ArraySpec(voxel_size=gp.Coordinate(1,1,1,1))}) + TakeChannels(test_key, [2,3])
with gp.build(pipeline2):
batch = pipeline2.request_batch(request)
assert batch[test_key].data.shape == (2,5,5,5,5) |
Hi everyone, I also need an |
Greetings from the Mehta Lab and apologies in advance for the long post!
I am attempting to use gunpowder as a dataloader for (float32) data in the ome-zarr format. I have run into a few issues trying to get functionality to work with some data I have. I have enumerated some questions I have below.
Support for multiple zarr stores in the OME-HCS Zarr format
If I have data stored in ome-zarr format as a series of hierarchical groups (row > col > position > data_arrays), when I create datasets inside of a source node, they need to be specified by inputting the full hierarchy path to the dataset source:
Because of this format, we store arrays containing data in different rows that are all part of one 'dataset' in different zarr stores. Is it possible to create a single source that can access multiple zarr stores?
Inconsistent behavior of BatchRequest objects
When applying some augmentations (for example the SimpleAugment node), re-usage of a BatchRequest without redefining the request or pipelines will randomly result in data returned with the wrong indices:
For example, I define a dataset and a pipeline with and without a simple augmentation node:
Then I define a batch request:
Then I use that request to generate two batches from each pipeline in sequence:
The result is the following:
Visualization of batch from loop 1 Visualization of batch from loop 2
I am confused as to why the behavior changes when the data, pipeline, and batch request haven't changed. Is there a reason that the second augmentation batch returns with reversed channels?
Thanks!!
The text was updated successfully, but these errors were encountered: