-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic serialization/deserialization of Numpy arrays #13
Comments
So JSONs are definitely is a good way to deal with dicts and even pd.DataFrame objects but are they the right way to deal with np.ndarrays? Also, many of our arrays might be more suitable in an xarray object? In which case using hdf5 as a base type makes sense. Xarrays reccomend netCDF (based on hdf5) http://xarray.pydata.org/en/stable/io.html. Pandas has hdf5 support (it might be an additional package, can't remember) and there's https://www.h5py.org/ for more general use. Worth considering compression though, which is not covered by most existing python hdf5 packages (I think). |
So the logic here is that, because of the structure of the API, and especially the websocket-based data event stuff, we should be able to include data within a JSON object. This is similar to how things like OME XML works, in that they have XML metadata which contains a binary blob. I'm more than happy to have suggestions on encoding formats, but whatever we choose, it would be nice if it could be sensible embedded with a JSON object. Np.ndarrays are nice because they're just C-contiguous arrays. The encoding we use in the OFM software gives you type information, array dimensions, and a base64 encoded binary blob of the array, which means it can be used outside of Python if you want. That said, I'm not married to the idea. We could move to a model of JSON linking to a separate binary file, but especially for small data sets, theres something appealing about the data and its metadata all being contained within a single object. As usual, open to suggestions. |
Hmm... that makes sense I guess... a quick google on why not use hdf5 came up with- https://cyrille.rossant.net/moving-away-hdf5/ For reference, this matches the xarray system to I believe, the actual array is just a numpy array and the xarray object basically wraps that with a 'metadata' layer, i.e. like column names in a pd.Series. |
Oh thats super useful thanks! So it might be that we just add in xarray support (it does seem really sensible) which would serialise like: Ndarray:
Xarray:
Then in cases where the data is being stored separately, we just return a link to the object binary (.npz (numpy), netCDF (xarray)) file. |
Looks sensible. |
Hi! For now, is it OK to copy/paste the openflexure serialization code on one's own code base? |
Yeah you should be able to use whatever code you like as long as it's within the GPL license (https://gitlab.com/openflexure/openflexure-microscope-server/-/blob/master/LICENSE) |
Fix CI issues
See https://gitlab.com/openflexure/openflexure-microscope-server/-/blob/master/openflexure_microscope/utilities.py#L25
We should introduce into the spec common scientific data types, and sensible ways to (de)serialize them.
Start with Numpy arrays (as we've already done this, see link above).
Discussion: What other scientific data types might be useful?
The text was updated successfully, but these errors were encountered: