-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nan values when saving parq files with virtualize.to_kerchunk() #339
Comments
Hey there @QuentinMaz! Thanks for trying out VirtualiZarr and opening up a clear MRE. Definitely seems like an issue. On some initial digging it seems like:
I'll try to dig into this further, in the meantime if you're open to working a bit on the bleeding edge, you could try writing the references to Icechunk. It might take a bit of environment-fu since Kerchunk doesn't yet support Zarr V3. To keep a single environment, you could use the new Zarr-V3 compliant hdf5 reader, then write to Icechunk. from virtualizarr.readers.hdf import HDFVirtualBackend
vds = open_virtual_dataset('file.nc', backend=HDFVirtualBackend) |
Thanks for the comments and answer @norlandrhagen! Even though I am pretty sure to not be skilful enough to help through your investigations, I will have a look at the status of the issue to (try to) follow your progress :) Good luck! |
I just ran into this bug (latest version virtualizarr 1.3.0) and can pin it down to different behavior if inlining variables, see slightly modified example using floats for depth and adding a time dimension (confirming #352 is indeed the same issue): filename = "test"
temperature = 15 + 8 * np.random.randn(2, 2, 3, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
depths = np.arange(150.0, step=50)
times = pd.date_range('2018-01-01','2018-01-03')
da = xr.DataArray(
data=temperature,
dims=["x", "y", "depth", "time"],
coords=dict(
lon=(["x", "y"], lon),
lat=(["x", "y"], lat),
depth=depths,
time=times
),
attrs=dict(
description="Ambient temperature.",
units="degC",
),
)
ds = da.to_dataset(name="temperature")
ds.to_netcdf(f"{filename}.nc") Case 1: No-inlining, No Problemvds = open_virtual_dataset(
f"{filename}.nc",
indexes={},
)
# depth/.zarray -> "fill_value":NaN
# depth/.zattrs -> n/a
# time/.zarray -> "fill_value":-9223372036854775806
# time/.zattrs -> n/a
vds.virtualize.to_kerchunk(f"{filename}.json", format='json')
loaded_ds = xr.open_dataset(f"{filename}.json", engine="kerchunk", chunks={})
print("depths loaded vds:\t", loaded_ds.depth.to_numpy())
print("times loaded vds:\t", loaded_ds.time.to_numpy()) Case 2: Inline depth & time (which "start with 0" -> incorrect NaNs)vds = open_virtual_dataset(
f"{filename}.nc",
indexes={},
loadable_variables=['depth','time'],
)
# depth/.zarray -> "fill_value":0.0
# depth/.zattrs -> "_FillValue":NaN
# time/.zarray -> "fill_value":0
# time/.zarray -> n/a
vds.virtualize.to_kerchunk(f"{filename}.json", format='json')
loaded_ds = xr.open_dataset(f"{filename}.json", engine="kerchunk", chunks={})
print("depths loaded vds:\t", loaded_ds.depth.to_numpy())
print("times loaded vds:\t", loaded_ds.time.to_numpy())
# NOTE: a non-optimal workaround for this is to set mask_and_scale=False when reading!
print('-----')
loaded_ds = xr.open_dataset(f"{filename}.json", engine="kerchunk", chunks={}, mask_and_scale=False)
print("depths loaded vds:\t", loaded_ds.depth.to_numpy())
print("times loaded vds:\t", loaded_ds.time.to_numpy()) I didn't really follow the logic in assigning self.fill_value ... Might be properly fixed by #414 ? |
Hi,
I have used
virtualizarr
to concatenate several.nc
files into oneparq
one.I noticed that when I then open the saved dataset, the first value of its index is replaced with
nan
.I thus suspect that
virtualize.to_kerchunk()
might have a bug.Here how to replicate the issue:
I am a beginner and have therefore no idea of the cause...
The text was updated successfully, but these errors were encountered: