-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v3] Fixed-width unicode string support in zarr v3 #2347
Comments
I'm not sure if I'll be able to work on this, but here's some notes on the V2 behavior, and some things: >>> import numpy as np
>>> import zarr
>>> import json
>>> b = np.array([b'a', b'bb', b'ccc'])
>>> u = np.array(['a', 'bb', 'ccc'])
>>> store = {}
>>> zarr.array(b, store=store, path="bytes", compressor=None)
>>> zarr.array(u, store=store, path="unicode", compressor=None)
>>> print(json.loads(store['bytes/.zarray'])['dtype'])
# |S3
>>> print(json.loads(store['unicode/.zarray'])['dtype'])
# <U3
assert store['bytes/0'] == b.tobytes()
assert store['unicode/0'] == u.tobytes() NumPy uses 32-bit UCS-4 codepoints for Unicode data ref. (I think that |
Given it doesn't look like this funcitonality will get into 3.0, it looks like this breaking change is something else to add to #2596 |
Leaving a link to this here as I did not find this on the page: https://hackmd.io/@ivirshup/SkdO2szas |
Today I tried to save an xarray dataset with fixed width strings on a dimension and they were automatically converted at write time to variable length string objects: import xarray as xr
import numpy as np
import zarr
x_size = 10
y_size = 5
band_size = 3
x = np.linspace(0, 1, x_size, dtype=np.float64)
y = np.linspace(0, 1, y_size, dtype=np.float64)
bands = np.array(['B1', 'B2', 'B3'], dtype='<U2') # Fixed-length string dtype
data = np.random.rand(band_size, y_size, x_size)
ds = xr.Dataset(
{
"data_variable": (("band", "y", "x"), data),
},
coords={
"x": x,
"y": y,
"band": bands,
},
)
store = zarr.storage.LocalStore('test_data')
ds.to_zarr(store, zarr_format=3, consolidated=False)
root = zarr.open_group(store)
root['band'].metadata
# ArrayV3Metadata(shape=(3,), data_type=<DataType.string: 'string'>, chunk_grid=RegularChunkGrid(chunk_shape=(3,)), chunk_key_encoding=DefaultChunkKeyEncoding(name='default', separator='/'), fill_value='', codecs=(VLenUTF8Codec(), ZstdCodec(level=0, checksum=False)), attributes={}, dimension_names=('band',), zarr_format=3, node_type='array', storage_transformers=()) Ideally this conversion would not happen |
Zarr version
v3
Numcodecs version
na
Python Version
na
Operating System
na
Installation
na
Description
Mentioned in #2323 (comment), right now we can't create a fixed-width string dtype in zarr v3.
We would want the NumPy dtype of that array to be
U3
, a fixed-width unicode string dtype. We'd want to support this in addition to the variable width strings being used currently. Some initial questions I don't know the answer to:data_type
shows up in the metadata?Steps to reproduce
.
Additional output
No response
The text was updated successfully, but these errors were encountered: