Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calendar_average for CESM / CAM data #676

Open
brianpm opened this issue Jan 23, 2025 · 3 comments
Open

calendar_average for CESM / CAM data #676

brianpm opened this issue Jan 23, 2025 · 3 comments
Assignees
Labels
feature a new feature that is going to be developed support Support request opened by outside user/collaborator

Comments

@brianpm
Copy link

brianpm commented Jan 23, 2025

Version

2024.4.0

How did you install geocat-comp?

conda-forge

Operating System

linux and MacOS

Summary

calendar_average does not calculate annual averages correctly from monthly data if the time coordinate is at the mid-month value.

When the times are at mid-month, the function fails because it thinks the times are not evenly spaced.

This is problematic because until recently, CESM (at least CAM) output has put the timestamp at the end of the averaging interval, meaning that monthly averages were timestamped in the following month... e.g., a January average is timestamped as midnight on 1 February. A common way to correct for this is to use the time bounds to get the center of the averaging interval, but that ends up moving around within months.

Also note that development CESM versions have changed the behavior to make the timestamp be at the center point of the averaging interval, so this becomes an issue for both old and new versions of CESM.

If I do not correct the time coordinate, then GeoCAT will try to do the annual average, but the result is wrong because all the months are shifted by one. (See example below.)

Expected behavior

When the data is monthly, the function should not care if the time coordinate is evenly spaced (but still should be weighted by days-in-month).

Steps to reproduce

(See /glade/u/home/brianpm/Code/gc_calendar.ipynb for this example.)

import geocat.comp as gcomp
import xarray as xr
import numpy as np
# Example 1:
# development CAM with
# monthly data with time stamp at center point
ds0 = xr.open_mfdataset("/glade/derecho/scratch/hannay/archive/f.e30_beta04.FLTHIST.ne30.rhmini0.5.001/atm/hist/f.e30_beta04.FLTHIST.ne30.rhmini0.5.001.cam.h0a.*.nc")

x0 = ds0["TREFHT"].compute()

g_result = gcomp.climatologies.calendar_average(x0, 'year', time_dim='time', keep_attrs=True)

#--> ValueError: Data needs to be uniformly spaced in the 'time' dimension.


# Example 2:
# data has time stamp at end of averaging interval, placing it in wrong month
ds1a = xr.open_dataset("/glade/work/brianpm/b.e21.BHIST.f09_g17.CMIP6-historical.010/b.e21.BHIST.f09_g17.CMIP6-historical.010.cam.h0.FLUT.200001-201412.nc")

# this is how I typically correct for the timestamp being in the wrong month
ds1b = xr.open_dataset("/glade/work/brianpm/b.e21.BHIST.f09_g17.CMIP6-historical.010/b.e21.BHIST.f09_g17.CMIP6-historical.010.cam.h0.FLUT.200001-201412.nc", decode_times=False)

if 'time_bnds' in ds1b:
    t = ds1b['time_bnds'].mean(dim='nbnd')
    t.attrs = ds1b['time'].attrs
    ds1b = ds1b.assign_coords({'time':t})
elif 'time_bounds' in ds1b:
    t = ds1b['time_bounds'].mean(dim='nbnd')
    t.attrs = ds1b['time'].attrs
    ds1b = ds1b.assign_coords({'time':t})
ds1b = xr.decode_cf(ds1b)

x1a = ds1a['FLUT']
x1b = ds1b['FLUT']

print(x1a.time[0])
print(x1b.time[0])
print("---")
print(x1a.time[-1])
print(x1b.time[-1])

g_result = gcomp.climatologies.calendar_average(x1a, 'year', time_dim='time', keep_attrs=True)
#--> g_result has 16 time entries INSTEAD OF 15

g_exception =  gcomp.climatologies.calendar_average(x1b, 'year', time_dim='time', keep_attrs=True)
# --> ValueError: Data needs to be uniformly spaced in the 'time' dimension.

# Correct annual average:
days_gb = x1b.time.dt.daysinmonth.groupby('time.year').map(lambda x: x / x.sum())
# weighted average with normalized weights: <x> = SUM x_i * w_i  (implied division by SUM w_i)
result =  (x1b * days_gb).groupby('time.year').sum(dim='time')
result.attrs['units'] = x1b.attrs.get("units",None)
# result has 15 time entries (called year), and each of them used 12 months, weighted by days per month

Relevant log output

Environment

@brianpm brianpm added bug Something isn't working support Support request opened by outside user/collaborator triage Needs triage labels Jan 23, 2025
@kafitzgerald
Copy link
Contributor

Thanks for reporting this and for the helpful examples!

This seems like something we should be able to handle and may be easier now as well with some of the groupby enhancements in Xarray.

@kafitzgerald kafitzgerald self-assigned this Jan 29, 2025
@kafitzgerald kafitzgerald changed the title [🐛]: calendar_average Does not calculate annual average correctly (under some conditions) calendar_average for CESM / CAM data Jan 29, 2025
@kafitzgerald
Copy link
Contributor

kafitzgerald commented Jan 29, 2025

Thanks again for the nice examples - these were super helpful!

To reiterate, it seems like the calendar_average function is working roughly as described in the documentation, but there are a couple of cases that arise with CESM/CAM data that would be helpful to support:

  1. Handling of offset timestamps (from timestamps being assigned as the end of the averaging interval)
  2. Handling on non-uniform time dimension spacing (from centered timestamps on months of differing lengths)

For the first case, I think handling this with pre-processing might make more sense than in geocat-comp since this is both a special case and not particularly intuitive. Admittedly, the pre-processing is a bit clunky (example below) because the calendar requires using cftime so Pandas timedeltas don't work here, but it's not terrible and and hopefully will become cleaner at some point (see: pydata/xarray#5687). Open to discussion here as well though.

Example subtracting a month from the time index:
x1a.time - xr.coding.cftime_offsets.MonthBegin(1)

For the second case, I think it's probably more reasonable to expect calendar_average and perhaps some of the other climatology functions to accommodate the non-uniform spacing of the centered time stamps in CESM / CAM. As you noted, this is currently isn't supported (i.e. the function explicitly requires and checks for uniform spacing in the time dimension). We'll have to adapt some of those checks and adjust the code a bit, but I think that should be fairly straightforward.

Let me know if there's anything here that sounds off and happy to chat too if that'd be easier.

In the meantime, I'll start looking into adapting the climatology functions for non-uniform spacing.

@kafitzgerald kafitzgerald added feature a new feature that is going to be developed and removed bug Something isn't working triage Needs triage labels Jan 29, 2025
@brianpm
Copy link
Author

brianpm commented Jan 30, 2025

@kafitzgerald -- I agree. The offset time stamp should be dealt with separately. The way I do it is to change the time coordinate to be the average of the time bounds data. This use the CF-compliant metadata to put the time stamp at the middle of the averaging interval. A side effect of that is that the time stamps aren't evenly spaced. But this is probably not a super common situation, so users should deal with it separately.

In terms of the non-uniform spacing, this is definitely not specific to CESM data. As far as I can tell, any monthly data will trigger this issue. For example, if the time stamp is always on the 15th of the month (or any other day) the times aren't evenly spaced. And I think since xarray decodes time coordinates into some kind of time stamp object, you can't have just "year-month" (maybe it is possible, but I'm not sure how to do it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a new feature that is going to be developed support Support request opened by outside user/collaborator
Projects
None yet
Development

No branches or pull requests

2 participants