-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Using an xarray coordinate dimension as string for station_id
, timestamps for time
#2
Comments
Yeah we could move strings in station_name for sure. The reason we didn't at the time was lack of compatibility with the Fortran netCDF library. For this reason (and for compatibility with older scripts, many of which are still in use), it might be better to move to a new variable name: perhaps station_str or similar. |
It would be ideal to preserve lazyloading if possible for the main data variables - time, lead_time, realization, station don't necessarily use large amounds of memory on their own but the main data variables can. |
@derob I don't know about making station_id a string. Isn't swift partly predicated on using station_id as integers to construct node-link networks? If anything, station_name would be a good candidate to convert to a string (as was originally intended, but couldn't be done due to Fortran issues), but as I said you may run into backwards compatibility issues. |
We tend to use integers for station_id but they are referenced as strings in my understanding... The onboarding python example uses a single subarea model with the of subarea.Subarea |
Thank you for your inputs, much appreciated. To be clear, I was not suggesting any change to the STF netCDF convention, only the "python, in memory xarray". Writing a STF-2.0 compatible netCDF file to disk is a non-negotiable feature. @derob indeed swift2 nodes, links and subareas are all identified with strings, which can also cater for a legacy integer representation of course. |
I have been defining |
So loading station_id as strings as @jmp75 has suggested is no problem; saving them as strings in the .nc files, which I think is what Andrew is doing, implies a change in spec and might cause issues with backwards compatibility (at least if we keep those variable names). Not necessarily saying I'm against it; this might be the preferable way to go with v3.0 (or whatever) of the spec. |
Very happy to improve the spec for both usebility and CF compatibility. Unfortunately a lot of those conventions are dominated by the meteorological community, who tend to favour spatial coverage over time when saving data (so you tend to get large spatial coverage in files with only a single forecast). Keeping many forecasts in one file is very beneficial for us though: both for storage and ease of verification. Suspect it ultimately may be a losing fight. |
@jbcsiro one of the really nice things about xarray, which we should probably think about when developing the package is the ability to automagically read multiple files on disk into a single in-memory object... There are issues, of course, with managing large numbers of small files, but there is probably a balance and maybe single forecast files are ok in the future from an analysis perspective. |
That is an interesting topic, when to separate into individual files and went to not separate and persist all associated data variables into one file. Moving one large file about has challenges, but keeping order with separate files is also a difficult. At the moment I feel safer and happy to take the hit with one large file stored in a central network drive. |
There are pros and cons of large and small files. I generally prefer large files, but sometimes it is more efficient computationally to generate many small files. There can be limits on file numbers on some storages which means that many small files aren't a good idea. zarr 'files' are actually just a folder of many small files that chunk data in a particular way, so there needs to be some careful thought about the most appropriate chunking on disk for zarr. |
From ---> https://docs.dask.org/en/stable/array-chunks.html
So I don't think you need to have many files in the same dimension as in memory chunks, however it might be better/advantageous. So if the file number becomes limiting it could be reduced by increasing the file size but keeping in memory the same. and https://docs.dask.org/en/stable/array-chunks.html#loading-chunked-data |
Hydrological forecasts are often produced at the catchment scale, so saving individual forecasts produces too many files relative to file size IMO. When it comes time to move or archive those forecasts this is very annoying, especially when you can just save e.g. a month's worth of forecasts in a single file (assuming you have the spec to do it) and make that stuff a lot easier. This is generally not the case with meteorological forecasts, esp for global models. |
Is your feature request related to a problem? Please describe.
Loading an STF netcdf file from disk via xarray, as is and without further processing, leads to the following in memory dataset:
These days, such a dataset would arguably be designed using strings wherever possible for coordinates. In particular slicing through data to retrieve data for a given station can be done with the statement
rain.sel( station_id = "123456A")
.Trying to do a similar selection on the data set as loaded,
rain.sel(station_id=123456)
would lead toKeyError: "'station_id' is not a valid dimension or coordinate
A similar observation can be done for the
time
dimension needing to use date/time representation, perhaps even more so than for station_idNote also that having coordinates that are starting at one (i.e. Fortran indexing) rather than zero (C/Python) is not inconsequential in practice; this is notoriously fertile ground for "off by one" bugs when slicing data for use in modelling.
Describe the solution you'd like
The In-memory representation of STF via xarray has the following coordinates
int
in python anyway, likely 64 bits. int32 retained on-disk)Additional context
Possible downsides:
Other considerations:
xarray
backed bynetcdf
anyway does not offer native capabilities for this so far as I know. One would need to use zarr as an xarray backend, or perhaps other options (time series DBs?)The text was updated successfully, but these errors were encountered: