Nan values when saving parq files with virtualize.to_kerchunk() #339

QuentinMaz · 2024-12-09T14:21:42Z

Hi,

I have used virtualizarr to concatenate several .nc files into one parq one.
I noticed that when I then open the saved dataset, the first value of its index is replaced with nan.
I thus suspect that virtualize.to_kerchunk() might have a bug.

Here how to replicate the issue:

filename = "test"
# synthetic xarray.DataSet inspired by xarray's documentation
temperature = 15 + 8 * np.random.randn(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
depths = np.arange(150, step=50)
da = xr.DataArray(
    data=temperature,
    dims=["x", "y", "depth"],
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        depth=depths
    ),
    attrs=dict(
        description="Ambient temperature.",
        units="degC",
    ),
)
ds = da.to_dataset(name="temperature")
ds.to_netcdf(f"{filename}.nc")
vds = open_virtual_dataset(
    f"{filename}.nc",
    indexes={}, 
    decode_times=True, 
    loadable_variables=["lon", "lat", "depth"]
)
print("depth index of vds:\t\t", vds.depth.to_numpy())
# depth index of vds:		 [  0  50 100]

# saves as parq/ folder
vds.virtualize.to_kerchunk(f"{filename}.parq", format="parquet")
loaded_ds = xr.open_dataset(f"{filename}.parq", engine="kerchunk", chunks={})
print("depth index of the loaded vds:\t", loaded_ds.depth.to_numpy())
# depth index of the loaded vds:	 [ nan  50. 100.]

# temporary fix
loaded_ds.coords["depth"].values[0] = 0.
print("index after fix:\t\t", loaded_ds.depth.to_numpy())
# index after fix:		 [  0.  50. 100.]

I am a beginner and have therefore no idea of the cause...

The text was updated successfully, but these errors were encountered:

norlandrhagen · 2024-12-09T18:02:02Z

Hey there @QuentinMaz! Thanks for trying out VirtualiZarr and opening up a clear MRE.

Definitely seems like an issue. On some initial digging it seems like:

VirtualiZarr and Kerchunk's SingleHDF5ToZarr both point to the same on-disk reference for depth:

# Virtualizarr Depth manifest 
{'0': {'path': 'file:///<...>/test.nc','offset': 8256, 'length': 24}}
# kerchunk Depth reference
#   'depth/0': ['test.nc', 8256, 24],

Writing to .json or .parquet produces the same missing first value in depth.
It seems like when using Kerchunk to write the reference .json, the fill value for depth is null. When writing with .virtualize.to_kerchunk(), then fill_value for depth is 0. It seems like we're getting a fill_value coercion from None to 0 based on the float dtype :

VirtualiZarr/virtualizarr/zarr.py

Line 64 in af9c374

if self.fill_value is None:

I'll try to dig into this further, in the meantime if you're open to working a bit on the bleeding edge, you could try writing the references to Icechunk. It might take a bit of environment-fu since Kerchunk doesn't yet support Zarr V3. To keep a single environment, you could use the new Zarr-V3 compliant hdf5 reader, then write to Icechunk.

from virtualizarr.readers.hdf import HDFVirtualBackend

vds = open_virtual_dataset('file.nc', backend=HDFVirtualBackend)

QuentinMaz · 2024-12-12T07:41:34Z

Thanks for the comments and answer @norlandrhagen!
I suspected indeed that the nan value might be due to the initial value being 0 especially.
I am a kind of beginner on the project I have recently joined, so I think I will stick to my temporary fix ;) But for sure I will try later your suggestions!

Even though I am pretty sure to not be skilful enough to help through your investigations, I will have a look at the status of the issue to (try to) follow your progress :) Good luck!

scottyhq · 2025-02-04T15:25:00Z

I just ran into this bug (latest version virtualizarr 1.3.0) and can pin it down to different behavior if inlining variables, see slightly modified example using floats for depth and adding a time dimension (confirming #352 is indeed the same issue):

filename = "test"
temperature = 15 + 8 * np.random.randn(2, 2, 3, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
depths = np.arange(150.0, step=50)
times = pd.date_range('2018-01-01','2018-01-03')

da = xr.DataArray(
    data=temperature,
    dims=["x", "y", "depth", "time"],
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        depth=depths,
        time=times
    ),
    attrs=dict(
        description="Ambient temperature.",
        units="degC",
    ),
)
ds = da.to_dataset(name="temperature")
ds.to_netcdf(f"{filename}.nc")

Case 1: No-inlining, No Problem

vds = open_virtual_dataset(
    f"{filename}.nc",
    indexes={}, 
)

# depth/.zarray -> "fill_value":NaN
# depth/.zattrs -> n/a
# time/.zarray -> "fill_value":-9223372036854775806
# time/.zattrs -> n/a
vds.virtualize.to_kerchunk(f"{filename}.json", format='json')
loaded_ds = xr.open_dataset(f"{filename}.json", engine="kerchunk", chunks={}) 
print("depths loaded vds:\t", loaded_ds.depth.to_numpy())
print("times loaded vds:\t", loaded_ds.time.to_numpy())

Case 2: Inline depth & time (which "start with 0" -> incorrect NaNs)

vds = open_virtual_dataset(
    f"{filename}.nc",
    indexes={}, 
    loadable_variables=['depth','time'],
)

# depth/.zarray -> "fill_value":0.0
# depth/.zattrs -> "_FillValue":NaN
# time/.zarray -> "fill_value":0
# time/.zarray -> n/a
vds.virtualize.to_kerchunk(f"{filename}.json", format='json')
loaded_ds = xr.open_dataset(f"{filename}.json", engine="kerchunk", chunks={}) 
print("depths loaded vds:\t", loaded_ds.depth.to_numpy())
print("times loaded vds:\t", loaded_ds.time.to_numpy())

# NOTE: a non-optimal workaround for this is to set mask_and_scale=False when reading!
print('-----')
loaded_ds = xr.open_dataset(f"{filename}.json", engine="kerchunk", chunks={}, mask_and_scale=False) 
print("depths loaded vds:\t", loaded_ds.depth.to_numpy())
print("times loaded vds:\t", loaded_ds.time.to_numpy())

I didn't really follow the logic in assigning self.fill_value ... Might be properly fixed by #414 ?

ayushnag mentioned this issue Dec 16, 2024

Datetime coordinate missing single timestep (NaT) when data is loaded #352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan values when saving parq files with virtualize.to_kerchunk() #339

Nan values when saving parq files with virtualize.to_kerchunk() #339

QuentinMaz commented Dec 9, 2024 •

edited

Loading

norlandrhagen commented Dec 9, 2024 •

edited

Loading

QuentinMaz commented Dec 12, 2024

scottyhq commented Feb 4, 2025

Nan values when saving parq files with virtualize.to_kerchunk() #339

Nan values when saving parq files with virtualize.to_kerchunk() #339

Comments

QuentinMaz commented Dec 9, 2024 • edited Loading

norlandrhagen commented Dec 9, 2024 • edited Loading

QuentinMaz commented Dec 12, 2024

scottyhq commented Feb 4, 2025

Case 1: No-inlining, No Problem

Case 2: Inline depth & time (which "start with 0" -> incorrect NaNs)

QuentinMaz commented Dec 9, 2024 •

edited

Loading

norlandrhagen commented Dec 9, 2024 •

edited

Loading