Using keepbits along the time dimension? #128
-
I used the CDSAPI to extract snow-depth and soil-moisture from the ERA5-Land simulation and examined it with xbitinfo. The fields look like this: I looked at keepbits along the latitude, longitude and time dimensions for my two variables and they were:
the question: I'm guessing that data like snow_depth and soil-moisture, the keepbits along the spatial dimensions @milankl or @xbitinfo folks, what do you think? If you want to try this workflow out, I copied a sample file to a public s3 location so here is a reproducible notebook. Here when I used the most conservative KeepBits along the time dimension, I still obtained over a factor of 3 space savings over the original netcdf file, so I was happy. (The original NetCDF file was netcdf3 which has no compression but is "packed" using short integers with |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Snow depth: What I'd probably suggest doing in this case is to analyse the bitwise information content on a subarray of your field. E.g. the rectangle between 35˚N,-120˚E and 50˚N,-82˚E. You could use the time dimension instead, but in general this may not always be available, and also includes those constant grid points that are different to the regions that actually have a varying snow depth. The underlying problem is that while there is information in a sense that if there's no snow in some of the states of the US then you basically also know there's no snow further south. However, for many states there's also no entropy to begin with (is there ever snow in Florida?). This means the statistics of entropy and mutual information change throughout the field and are not actually coherent. In such a situation it's often possible and advisable to focus on the subarray which has the highest information content. While that choice is somewhat subjective, it's often intuitive, and I'm not sure that the exact choice matters much. Here, many values are 0 or missing, so bitrounding shouldn't affect them anyway. Hence given that you will only round the non-zero values, ideally we should also analyse the bitwise information over those. Soil-moisture: I believe soil moisture still has a land-sea mask, but the 9,10,13 keepbits suggest that there's fewer regions with constant values? I reckon if the datasets encloses a desert, similar arguments apply. But from the high number of keepbits I assume that's not actually an issue here. |
Beta Was this translation helpful? Give feedback.
Snow depth: What I'd probably suggest doing in this case is to analyse the bitwise information content on a subarray of your field. E.g. the rectangle between 35˚N,-120˚E and 50˚N,-82˚E. You could use the time dimension instead, but in general this may not always be available, and also includes those constant grid points that are different to the regions that actually have a varying snow depth. The underlying problem is that while there is information in a sense that if there's no snow in some of the states of the US then you basically also know there's no snow further south. However, for many states there's also no entropy to begin with (is there ever snow in Florida?). This means the st…