-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancements to GEMPAK reader(s) #2067
Comments
Plenty of GEMPAK files have GRIB2 packing these days, so I'd like to second that list item, which is a particularly high priority for my own process of converting things over to MetPy. |
A variation of this would be to allow pressures with units for the p = 70000 * units('Pa')
hght = gem_file.gdxarray(parameter='HGHT', level=p) would return the 700-mb heights rather than producing a |
I've become more interested in combining multiple |
I think this is the crux of the required functionality, but with the extra complication that some variables are only provided on a subset of levels. For instance, you might have geopotential height every 50 mb, but absolute vorticity only at 850, 700, 500, and 300 mb. xarray's backend thus considers geopotential height and absolute vorticity to have two distinct vertical coordinates, and this leads to the explosion of coordinates that you see in, say, the xarray Dataset that corresponds to GFS model output pulled from a TDS. To avoid the coordinate explosion, I suppose the alternative would be to set the unprovided levels (e.g., the 800-mb absolute vorticity in this example) to all NaN or some other "missing" value. I'm not sure if xarray is smart enough to use one byte to store the NaN, or if that would waste a bunch of memory (a separate NaN stored for each grid point at that level). |
There are probably ways using lazy arrays to leave these missing chunks uninstantiated until they're needed (e.g. for plotting or calculations). Having said that, this would introduce a bunch of NaNs in e.g. profiles, plus it implies that data are available for these levels, only to be left missing. Dealing with this in code leaves a bunch of complications dealing with missing data. I'd personally opt for just having multiple coordinates and letting e.g. |
I was taking a closer look at decoding GRIB2 packing recently. It's definitely more complicated than the others to implement. That being said, it seems doable. GEMPAK has what looks like a fully functional GRIB2 decoder. It's possible that it could be pared down a bit if all we care about is just unpacking the data. The GEMPAK decoder also handles PNG and JPEG compressed data. I don't work with GRIB2 packed GEMPAK files, so I do not know how often those appear in the wild. Dealing with those compressions would introduce another layer of complication and perhaps some additional dependencies. |
Dealing with PNG/JPEG compression (assuming it's not JPEG2k) should be doable using Pillow (which is already in our dependency chain thanks to Matplotlib. The hard part here really isn't decoding the grid, but interpreting all the other bits and bytes that amount to metadata. I've got an old start on something to walk through GRIB messages: log = logging.getLogger("metpy.io.grib")
#log.setLevel(logging.WARNING)
def _make_datetime(s):
s = bytearray(s)
year = (s[0] << 8) | s[1]
return datetime(year, *s[2:])
section0 = NamedStruct([('grib', '4s'), ('reserved', 'H'), ('discipline', 'B'),
('edition', 'B'), ('total_length', 'Q')], '>', 'Section0')
section_header = NamedStruct([('section_len', 'I'), ('section_num', 'B')], '>', 'section_head')
class Descriptor(object):
def __init__(self, fxy):
pass
def read(self, buf):
pass
class Element(Descriptor):
def __init__(self, fxy):
self._id = fxy
self._info = bufr_table_b[fxy]
self.bits = int(self._info['BUFR_DataWidth_Bits'])
def read(self, src):
return src.read(self.bits)
class Replication(Descriptor):
def __init__(self, descs, repeats):
pass
def read(self, src):
return src.read(self.bits)
class Operator(Descriptor):
def __init__(self, fxy):
pass
class Sequence(Descriptor):
def __init__(self, fxy):
self._id = fxy
self._info = bufr_table_d[fxy]
self.bits = [int(bufr_table_b[int(i['FXY2'])]['BUFR_DataWidth_Bits']) for i in self._info]
def read(self, src):
return [src.read(b) for b in self.bits]
class GRIBMessage(object):
def __init__(self, fobj):
hdr = section0.unpack_file(fobj)
# print(hdr)
if hdr.grib != b'GRIB':
logging.error('Bad section 0: %s', hdr)
raise RuntimeError('Bad section 0: %s' % str(hdr))
self._buffer = IOBuffer(fobj.read(hdr.total_length - section0.size))
self.sections = [hdr]
sec = self._read_section()
# print(sec)
sec = self._read_section()
# print(sec)
sec = self._read_section()
# print(sec)
sec = self._read_section()
# print(sec)
sec = self._read_section()
# print(sec)
sec = self._read_section()
# print(sec)
end_bytes = self._buffer.read(4)
if end_bytes != b'7777':
logging.warning('Last bytes not correct: %s', end_bytes)
def _read_section(self, jump_end=True):
sec_start = self._buffer.set_mark()
sec_hdr = self._buffer.read_struct(section_header)
sec = getattr(self, '_read_section{:d}'.format(sec_hdr.section_num))()
self.sections.append(sec)
if jump_end:
self._buffer.jump_to(sec_start, sec_hdr.section_len)
return sec
section1 = NamedStruct([('center', 'H'),
('subcenter', 'H'), ('master_table_version', 'B'), ('local_table_version', 'B'),
('ref_time_significance', 'B'), ('dt', '7s', _make_datetime),
('production_status', 'B'), ('data_type', 'B')], '>', 'Section1')
# ('template_id', 'H')], '>', 'Section1')
def _read_section1(self):
return self._buffer.read_struct(self.section1)
def _read_section2(self):
raise NotImplementedError('Local section not handled')
section3 = NamedStruct([('grid_def_source', 'B'), ('num_data_points', 'I'), ('num_octets', 'B'),
('interp', 'B'), ('template_num', 'H')], '>', 'Section3')
def _read_section3(self):
return self._buffer.read_struct(self.section3)
section4 = NamedStruct([('num_values', 'H'), ('prod_template', 'H')], '>', 'Section4')
def _read_section4(self):
return self._buffer.read_struct(self.section4)
section5 = NamedStruct([('num_values', 'I'), ('data_template', 'H')], '>', 'Section5')
def _read_section5(self):
return self._buffer.read_struct(self.section5)
section6 = NamedStruct([('indicator', 'B')], '>', 'Section6')
def _read_section6(self):
return self._buffer.read_struct(self.section6)
def _read_section7(self):
self._data_ptr = self._buffer._offset |
It does appear to be JPEG2k. |
I kinda figured since that's why we were dealing with the JJ2000 library over on the netcdf-java side. Looks like Pillow can support jpeg2k, not sure how regularly it's included in builds. We can cross that bridge when we get to it I guess. |
Just documenting some desired enhancements to the GEMPAK io submodule brought up in the original PR, other issues, and side conversations, as well as some ideas I had while working on recent PRs:
snxarray
(snxarray is slow #2062, resolved by Improve some thermo calculation bottlenecks by refactoring how units are handled #2064)https://github.com/ecmwf/cfgrib
can be leveraged?*xarray()
methods (gdxarray() should assign units to x and y coordinates #2118, referenced in GEMPAK reader assigns grid mapping information incorrectly #2046)gdxarray()
outputxarray.backends.BackendEntrypoint
(actual backend) andxarray.backends.BackendArray
(lazy loaded arrays)The text was updated successfully, but these errors were encountered: