Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: ufunc 'isnan' not supported for the input types #356

Open
TomNicholas opened this issue Dec 17, 2024 · 4 comments
Open

TypeError: ufunc 'isnan' not supported for the input types #356

TomNicholas opened this issue Dec 17, 2024 · 4 comments
Labels
bug Something isn't working readers

Comments

@TomNicholas
Copy link
Member

Trying to open a local copy of one of the files comprising the CWorthy OAE atlas (xref #132) (stored in the cloud here) with @sharkinsspatial 's HDF reader raises an error

In [4]: vds = open_virtual_dataset('../experimentation/virtualizarr/oae/alk-forcing.000-1999-01.pop.h.347.nc', backend=HDFVirtualBackend)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 vds = open_virtual_dataset('../experimentation/virtualizarr/oae/alk-forcing.000-1999-01.pop.h.347.nc', backend=HDFVirtualBackend)

File ~/Documents/Work/Code/virtualizarr/virtualizarr/backend.py:217, in open_virtual_dataset(filepath, filetype, group, drop_variables, loadable_variables, decode_times, cftime_variables, indexes, virtual_array_class, virtual_backend_kwargs, reader_options, backend)
    214 if backend_cls is None:
    215     raise NotImplementedError(f"Unsupported file type: {filetype.name}")
--> 217 vds = backend_cls.open_virtual_dataset(
    218     filepath,
    219     group=group,
    220     drop_variables=drop_variables,
    221     loadable_variables=loadable_variables,
    222     decode_times=decode_times,
    223     indexes=indexes,
    224     virtual_backend_kwargs=virtual_backend_kwargs,
    225     reader_options=reader_options,
    226 )
    228 return vds

File ~/Documents/Work/Code/virtualizarr/virtualizarr/readers/hdf/hdf.py:64, in HDFVirtualBackend.open_virtual_dataset(filepath, group, drop_variables, loadable_variables, decode_times, indexes, virtual_backend_kwargs, reader_options)
     55 drop_variables, loadable_variables = check_for_collisions(
     56     drop_variables,
     57     loadable_variables,
     58 )
     60 filepath = validate_and_normalize_path_to_uri(
     61     filepath, fs_root=Path.cwd().as_uri()
     62 )
---> 64 virtual_vars = HDFVirtualBackend._virtual_vars_from_hdf(
     65     path=filepath,
     66     group=group,
     67     drop_variables=drop_variables + loadable_variables,
     68     reader_options=reader_options,
     69 )
     71 loadable_vars, indexes = open_loadable_vars_and_indexes(
     72     filepath,
     73     loadable_variables=loadable_variables,
   (...)
     78     decode_times=decode_times,
     79 )
     81 attrs = HDFVirtualBackend._get_group_attrs(
     82     path=filepath, reader_options=reader_options, group=group
     83 )

File ~/Documents/Work/Code/virtualizarr/virtualizarr/readers/hdf/hdf.py:354, in HDFVirtualBackend._virtual_vars_from_hdf(path, group, drop_variables, reader_options)
    352 if key not in drop_variables:
    353     if isinstance(g[key], Dataset):
--> 354         variable = HDFVirtualBackend._dataset_to_variable(path, g[key])
    355         if variable is not None:
    356             variables[key] = variable

File ~/Documents/Work/Code/virtualizarr/virtualizarr/readers/hdf/hdf.py:284, in HDFVirtualBackend._dataset_to_variable(path, dataset)
    282 if isinstance(fill_value, np.ndarray):
    283     fill_value = fill_value[0]
--> 284 if np.isnan(fill_value):
    285     fill_value = float("nan")
    286 if isinstance(fill_value, np.generic):

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The error message is not very helpful. Numpy could have at least told me what the type it received was, and the HDF reader could have added context about which variable it was trying to parse when it failed (so that I could choose to load it instead).

@TomNicholas TomNicholas added bug Something isn't working readers labels Dec 17, 2024
@sharkinsspatial
Copy link
Collaborator

@TomNicholas This will be fixed in an upcoming PR. Based on our discussions with @rabernat I have been misinterpreting the semantic relationship between the HDF definition of fillvalue and the Zarr definition of fillvalue (which as @rabernat elaborated has changed over time and has been further complicated by v3 changes, see pydata/xarray#5475 for an excellent discussion of the topic). To repeat @rabernat 's advice

  • The HDF fillvalue definition is reserved for the return value for uninitialized chunks so in the case of partially written chunks or
    when a dataset is created but not yet populated with actual data.

  • In the virtualzarr context we should be using CF convention _FillValue (if present) to populate our zarray metadata fillvalue.

So this block will be completely removed https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/readers/hdf/hdf.py#L282-L287. I have the changes made during the hack day changed and I'll try to submit a PR tomorrow.

@TomNicholas
Copy link
Member Author

Wonderful thank you @sharkinsspatial ! Does that mean there is a pre-existing issue for this?

the HDF reader could have added context about which variable it was trying to parse when it failed (so that I could choose to load it instead).

Separately this might still be useful, but could be tracked in another issue.

@sharkinsspatial
Copy link
Collaborator

@TomNicholas This is a good point that we hit when dealing with the other null fillvalue we hit. I'll try to wrap exceptions with additional per variable information so that users can have an easier diagnosis in a separate PR.

@abarciauskas-bgse
Copy link
Collaborator

Based on our discussion in the VirtualiZarr meeting on Friday, I read through some documentation and pydata/xarray#5475 and would like to summarize my understanding here. I do have 2 minor questions but I think most is clear to me otherwise.

Fill value use cases

As discussed here and in pydata/xarray#5475, there are 2 use cases for the generic concept of fill value:

  1. uninitialized data: the fill value is used to "pre-fill disk space allocated to the variable"
  2. missing data: the fill value represents undefined or missing values and can be used as a mask

Fill value representation

And how does each format represent a fill value for each use case?

  1. HDF: This fill value is unambiguously defined for HDF as being for the first use case, uninitialized data (HDF5 docs, h5py Dataset.fillvalue).
  2. NetCDF: From what I can tell, the NetCDF attribute _FillValue is used for both use cases. From the docs:

    The _FillValue attribute specifies the fill value used to pre-fill disk space allocated to the variable...
    (later) Generic applications often need to write a value to represent undefined or missing values. The fill value provides an appropriate value for this purpose because it is normally outside the valid range and therefore treated as missing when read by generic applications.

  3. Zarr V3: in Is _FillValue really the same as zarr's fill_value? pydata/xarray#5475, Ryan Abernathey suggests to use the zarr.array.fill_value property to store the value to be used for uninitialized data and the _FillValue attribute to store the value to be used for missing data.

Questions

  1. Is it ever the case that HDF5 or NetCDF-4 could have different values for uninitialized data and missing data, and how would that be represented in those formats?
  2. My understanding now is that when parsing NetCDF-4 or HDF5 to zarr V3 arrays (either virtual or native), the fillvalue (in the case of HDF) and _FillValue (in the case of NetCDF) would be stored as the fill_value property AND the _FillValue attribute of the corresponding zarr V3 array so that xarray can infer that the _FillValue attribute could be used as a mask. Is that correct?
  3. I'm still trying to parse xarray's encoding decoding logic about fill values, so I don't have a specific question there yet. But it seems clear that xarray, in creating an xarray.Variable from a NetCDF backend, uses the _FillValue attribute of NetCDF dataset and stores it as a key/value pair in the Variable's encoding, and not in the Variable's attributes. I just can't tell exactly where the _FillValue addition to the encoding dict is happening yet, apart from seeing where it is removed from attrs in https://github.com/pydata/xarray/blob/main/xarray/backends/h5netcdf_.py. So if anyone wants to shed light on this, I would be interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working readers
Projects
None yet
Development

No branches or pull requests

3 participants