Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic serialization/deserialization of Numpy arrays #13

Open
jtc42 opened this issue Feb 7, 2020 · 7 comments
Open

Automatic serialization/deserialization of Numpy arrays #13

jtc42 opened this issue Feb 7, 2020 · 7 comments
Labels
critical On the critical path enhancement New feature or request

Comments

@jtc42
Copy link
Member

jtc42 commented Feb 7, 2020

See https://gitlab.com/openflexure/openflexure-microscope-server/-/blob/master/openflexure_microscope/utilities.py#L25

We should introduce into the spec common scientific data types, and sensible ways to (de)serialize them.

Start with Numpy arrays (as we've already done this, see link above).

Discussion: What other scientific data types might be useful?

@jtc42 jtc42 added critical On the critical path enhancement New feature or request labels Feb 7, 2020
@ChasNelson1990
Copy link

So JSONs are definitely is a good way to deal with dicts and even pd.DataFrame objects but are they the right way to deal with np.ndarrays?

Also, many of our arrays might be more suitable in an xarray object? In which case using hdf5 as a base type makes sense. Xarrays reccomend netCDF (based on hdf5) http://xarray.pydata.org/en/stable/io.html.

Pandas has hdf5 support (it might be an additional package, can't remember) and there's https://www.h5py.org/ for more general use.

Worth considering compression though, which is not covered by most existing python hdf5 packages (I think).

@jtc42
Copy link
Member Author

jtc42 commented Mar 20, 2020

So the logic here is that, because of the structure of the API, and especially the websocket-based data event stuff, we should be able to include data within a JSON object.

This is similar to how things like OME XML works, in that they have XML metadata which contains a binary blob.

I'm more than happy to have suggestions on encoding formats, but whatever we choose, it would be nice if it could be sensible embedded with a JSON object.

Np.ndarrays are nice because they're just C-contiguous arrays. The encoding we use in the OFM software gives you type information, array dimensions, and a base64 encoded binary blob of the array, which means it can be used outside of Python if you want.

That said, I'm not married to the idea.

We could move to a model of JSON linking to a separate binary file, but especially for small data sets, theres something appealing about the data and its metadata all being contained within a single object.

As usual, open to suggestions.

@ChasNelson1990
Copy link

Hmm... that makes sense I guess... a quick google on why not use hdf5 came up with- https://cyrille.rossant.net/moving-away-hdf5/
This, and other links, do seem to suggest storing big blobs separately with JSON holding the metadata and so, for small things, just using JSON.

For reference, this matches the xarray system to I believe, the actual array is just a numpy array and the xarray object basically wraps that with a 'metadata' layer, i.e. like column names in a pd.Series.

@jtc42
Copy link
Member Author

jtc42 commented Mar 20, 2020

Oh thats super useful thanks!

So it might be that we just add in xarray support (it does seem really sensible) which would serialise like:

Ndarray:

{
        "@type": "ndarray",
        "dtype": << data type >>,
        "shape": << array shape >>,
        "base64": << base 64 encoded blob >>
}

Xarray:

{
        "@type": "xarray",
        "dtype": << data type >>,
        "shape": << array dims >>,
        "coords": << xarray coords dict >>,
        "attrs": << xarray attrs dict >>,
        "base64": << base 64 encoded blob >>
}

Then in cases where the data is being stored separately, we just return a link to the object binary (.npz (numpy), netCDF (xarray)) file.

@ChasNelson1990
Copy link

Looks sensible.

@glyg
Copy link

glyg commented Nov 26, 2021

Hi! For now, is it OK to copy/paste the openflexure serialization code on one's own code base?
For smallish data, it seem sufficient to base64 encode an array directly in the response.

@jtc42
Copy link
Member Author

jtc42 commented Nov 26, 2021

Hi! For now, is it OK to copy/paste the openflexure serialization code on one's own code base?
For smallish data, it seem sufficient to base64 encode an array directly in the response.

Yeah you should be able to use whatever code you like as long as it's within the GPL license (https://gitlab.com/openflexure/openflexure-microscope-server/-/blob/master/LICENSE)

rwb27 added a commit to rwb27/python-labthings that referenced this issue May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical On the critical path enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants