Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP expand Variables class to handle s3 urls from NSIDC #434

Closed
wants to merge 74 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
57ac538
wip inheritance method for modularizing authentication
rwegener2 Jul 31, 2023
703d832
add nsidc_s3 option to Variables class
rwegener2 Aug 1, 2023
9d09ff9
mvp remove intake from Read
rwegener2 Aug 1, 2023
4564b3b
outline of mixin method of authentication
rwegener2 Aug 3, 2023
16b8d3f
add s3 credential timer and auth check
rwegener2 Aug 8, 2023
54aeda0
Update icepyx/core/variables.py
rwegener2 Aug 9, 2023
083426c
Update icepyx/core/query.py
rwegener2 Aug 9, 2023
76d3c96
add docstrings to auth.py
rwegener2 Aug 9, 2023
edfa362
Merge branch 'move_auth' of https://github.com/icesat2py/icepyx into …
rwegener2 Aug 9, 2023
e8c9060
add comment to stop tests from running docstring on build
rwegener2 Aug 14, 2023
72c5347
fix user warning for giving an email parameter
rwegener2 Aug 14, 2023
434bbf2
add tests for auth module
rwegener2 Aug 14, 2023
7e8bf0f
add warning message for use of earthdata_login
rwegener2 Aug 14, 2023
e06765a
remove .netrc creation and update existing tests to new auth method
rwegener2 Aug 14, 2023
9e1f745
undo changes to troubleshoot build
rwegener2 Aug 14, 2023
a2a455f
another baby commit to figure out what is breaking travis
rwegener2 Aug 14, 2023
1aacb1d
remove duplicate netrc creation
rwegener2 Aug 14, 2023
50db05e
update documentation for new auth procedure
rwegener2 Aug 15, 2023
f44ead3
remove earthdata_login function from docstrings
rwegener2 Aug 15, 2023
12e21ba
remove missed instance of earthdata_login in docs
rwegener2 Aug 15, 2023
10bc734
attempt add auth to API reference
rwegener2 Aug 15, 2023
8033ed4
Update icepyx/core/auth.py
rwegener2 Aug 22, 2023
f1fa0df
alphabetize ordering
rwegener2 Aug 22, 2023
6cdddbf
add warning to dev log
rwegener2 Aug 22, 2023
b7b8b7e
add auth to components docs
rwegener2 Aug 22, 2023
97fda07
add more detail to auth string
rwegener2 Aug 22, 2023
1172e9e
add authentication explainer
rwegener2 Aug 22, 2023
abd950a
add internals to index.rst
rwegener2 Aug 22, 2023
8c3545b
match formatting
rwegener2 Aug 22, 2023
90354e6
update code block formatting
rwegener2 Aug 22, 2023
7fedc2d
continue to update formatting
rwegener2 Aug 22, 2023
e26c9a1
add an additional example line
rwegener2 Aug 22, 2023
a31b092
Update icepyx/tests/test_behind_NSIDC_API_login.py
rwegener2 Aug 23, 2023
ab3de31
Update doc/source/contributing/icepyx_internals.rst
rwegener2 Aug 23, 2023
3aab79a
Update doc/source/contributing/icepyx_internals.rst
rwegener2 Aug 23, 2023
ade942f
move auth description text to EarthdataAuthMixin class
rwegener2 Aug 23, 2023
f052b76
fix typo and double admonition in data access notebook
JessicaS11 Aug 24, 2023
63b0275
minor updates to auth module docstrings
JessicaS11 Aug 24, 2023
1796a2c
combine auth warning messages
rwegener2 Aug 24, 2023
4d7687b
switch s3token argument default to None
rwegener2 Aug 24, 2023
3838044
fix json error in data_access example notebook
rwegener2 Aug 24, 2023
e6caa70
merge move_auth
rwegener2 Aug 24, 2023
61b2006
Merge branch 'development' into s3_variables
rwegener2 Aug 24, 2023
e5458a1
Merge branch 'development' into refactor_intake
rwegener2 Aug 29, 2023
24f6a42
delete is2cat and references
rwegener2 Aug 29, 2023
b13b847
remove extra comments
rwegener2 Aug 30, 2023
0779b80
update doc strings
rwegener2 Aug 30, 2023
1cfbf72
update tests
rwegener2 Aug 30, 2023
de61d87
update documentation for removing intake
rwegener2 Aug 30, 2023
9f06611
update approach paragraph
rwegener2 Aug 30, 2023
d019b9a
remove one more instance of catalog from the docs
rwegener2 Aug 30, 2023
156ea89
clear jupyter history
rwegener2 Aug 30, 2023
b26ca4e
Update icepyx/core/read.py
rwegener2 Sep 1, 2023
ce1ca76
remove intake and related modules
rwegener2 Sep 1, 2023
fd00aeb
Merge branch 'development' into read_arguments
rwegener2 Sep 4, 2023
431af78
mvp with new read parameters
rwegener2 Sep 5, 2023
612662e
clean up remainder of file and remove extraneous comments
rwegener2 Sep 5, 2023
c16a003
maintain backward compatibility and combine arguments
rwegener2 Sep 5, 2023
7648078
update to new error message
rwegener2 Sep 5, 2023
4cfbfdb
update docs
rwegener2 Sep 8, 2023
f7f823b
glob kwargs and list error
rwegener2 Sep 8, 2023
203f3ad
formatting updates
rwegener2 Sep 8, 2023
10d1591
Apply suggestions from code review
rwegener2 Sep 12, 2023
0b23d1e
remove num_files
rwegener2 Sep 12, 2023
6f5bead
fix docs test typo
rwegener2 Sep 12, 2023
035ee5a
trying again to fix the build
rwegener2 Sep 12, 2023
903c351
add feedback to docs page
rwegener2 Sep 12, 2023
d842bde
Merge branch 'development' into read_arguments
rwegener2 Sep 13, 2023
5e06de9
fix typo
rwegener2 Sep 14, 2023
9ca29f1
Merge branch 'development' into read_arguments
rwegener2 Sep 14, 2023
e8e35ad
Merge branch 'development' into read_arguments
rwegener2 Sep 18, 2023
d26a194
Merge branch 'development' into s3_variables
rwegener2 Sep 18, 2023
af79818
Merge branch 'read_arguments' into s3_variables
rwegener2 Sep 18, 2023
ba52c55
resolve merge conflicts from development
rwegener2 Oct 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 35 additions & 10 deletions icepyx/core/read.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,10 +269,6 @@ class Read:
data_source : string, List
A string or list which specifies the files to be read. The string can be either: 1) the path of a single file 2) the path to a directory or 3) a [glob string](https://docs.python.org/3/library/glob.html).
The List must be a list of strings, each of which is the path of a single file.

product : string
ICESat-2 data product ID, also known as "short name" (e.g. ATL03).
Available data products can be found at: https://nsidc.org/data/icesat-2/data-sets
**Deprecation warning:** This argument is no longer required and will be deprecated in version 1.0.0. The dataset product is read from the file metadata.

filename_pattern : string, default None
Expand All @@ -289,6 +285,9 @@ class Read:
glob_kwargs : dict, default {}
Additional arguments to be passed into the [glob.glob()](https://docs.python.org/3/library/glob.html#glob.glob)function

glob_kwargs : dict, default {}
Additional arguments to be passed into the [glob.glob()](https://docs.python.org/3/library/glob.html#glob.glob)function

out_obj_type : object, default xarray.Dataset
The desired format for the data to be read in.
Currently, only xarray.Dataset objects (default) are available.
Expand Down Expand Up @@ -320,10 +319,10 @@ class Read:

# ----------------------------------------------------------------------
# Constructors

def __init__(
self,
data_source=None,
data_source=None, # DevNote: Make this a required arg when catalog is removed
product=None,
filename_pattern=None,
catalog=None,
Expand All @@ -336,7 +335,7 @@ def __init__(
"The `catalog` argument has been deprecated and intake is no longer supported. "
"Please use the `data_source` argument to specify your dataset instead."
)

if data_source is None:
raise ValueError("data_source is a required arguemnt")

Expand Down Expand Up @@ -381,7 +380,6 @@ def __init__(
product_dict = {}
for file_ in self._filelist:
product_dict[file_] = self._extract_product(file_)

# Raise warnings or errors for multiple products or products not matching the user-specified product
all_products = list(set(product_dict.values()))
if len(all_products) > 1:
Expand Down Expand Up @@ -425,7 +423,6 @@ def __init__(
" metadata {self._product}",
stacklevel=2,
)

if out_obj_type is not None:
print(
"Output object type will be an xarray DataSet - "
Expand Down Expand Up @@ -461,6 +458,20 @@ def vars(self):
)

return self._read_vars

@property
def filelist(self):
"""
Return the list of files represented by this Read object.
"""
return self._filelist

@property
def product(self):
"""
Return the product associated with the Read object.
"""
return self._product

@property
def filelist(self):
Expand All @@ -478,7 +489,21 @@ def product(self):

# ----------------------------------------------------------------------
# Methods


@staticmethod
def _extract_product(filepath):
"""
Read the product type from the metadata of the file. Return the product as a string.
"""
with h5py.File(filepath, 'r') as f:
try:
product = f.attrs['short_name'].decode()
product = is2ref._validate_product(product)
# TODO test that this is the proper error
except KeyError:
raise 'Unable to parse the product name from file metadata'
return product

@staticmethod
def _extract_product(filepath):
"""
Expand Down
13 changes: 10 additions & 3 deletions icepyx/core/variables.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import numpy as np
import os
import pprint

Expand Down Expand Up @@ -27,7 +26,7 @@ class Variables(EarthdataAuthMixin):
Parameters
----------
vartype : string
One of ['order', 'file'] to indicate the source of the input variables.
One of ['order', 'file', 'nsidc-s3'] to indicate the source of the input variables.
This field will be auto-populated when a variable object is created as an
attribute of a query object.
avail : dictionary, default None
Expand Down Expand Up @@ -75,6 +74,14 @@ def __init__(
elif self._vartype == "file":
# DevGoal: check that the list or string are valid dir/files
self.path = path
elif self._vartype == "nsidc-s3":
# Grab metadata from s3 path
template = ('s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/'
Copy link
Member

@weiji14 weiji14 Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template looks to be hardcoded to the non-gridded ATLAS datasets like ATL06. Other gridded products (e.g. ATL14, ATL16, ATL20) would have a different template like s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/{year}/{filename}' . E.g. looking at https://search.earthdata.nasa.gov/search?ff=Available%20in%20Earthdata%20Cloud&fi=ATLAS&gdf=HDF&fst0=Cryosphere&lat=65.08299573518599&long=-25.69921875&zoom=5:

Dataset Sample path
ATL06 s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2023/04/16/ATL06_20230416235213_04061911_006_02.h5
ATL07 s3://nsidc-cumulus-prod-protected/ATLAS/ATL07/005/2022/10/12/ATL07-02_20221012220720_03391701_005_01.h5
ATL10 s3://nsidc-cumulus-prod-protected/ATLAS/ATL10/005/2022/10/12/ATL10-01_20221012220720_03391701_005_01.h5
ATL11 s3://nsidc-cumulus-prod-protected/ATLAS/ATL11/005/2022/03/27/ATL11_006305_0315_005_03.h5
ATL14 s3://nsidc-cumulus-prod-protected/ATLAS/ATL14/002/2019/ATL14_IS_0314_100m_002_01.nc
ATL16 s3://nsidc-cumulus-prod-protected/ATLAS/ATL16/004/2022/ATL16_20220722003637_04601601_004_01.h5
ATL20 s3://nsidc-cumulus-prod-protected/ATLAS/ATL20/003/2022/ATL20-01_20220901002201_10861601_003_01.h5

Would it be possible to generalize this code to both non-gridded and gridded products?

Copy link
Contributor Author

@rwegener2 rwegener2 Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for catching this @weiji14! I'll work on a fix and let you know when I'm ready for another review!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question for @JessicaS11 and @weiji14 -- What do we think of getting the version and product name from inside the file instead of parsing it from the filename? I've only checked a handful of products, but those fields seem to be available in top-level metadata in a consistent way. I've been trying to parse those things out of the filename, which is how I believe it is also done elsewhere in the module, but this limits the files icepyx can process to those named in a very specific way. If we grab product/version from inside the file we are able to process more files (ex. cloud icesat-2 files not in nsidc bucket, or local files that have had their name changed). Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that the place I was thinking about this last was in the branch to remove intake from icepyx. I just pushed a WIP PR (#438) so there is a place to discuss questions. Hopefully whatever we decide there about accessing the product/version from that can be used for this PR later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion summarized in #438 (comment) indicates our intention to move away from requiring the user provide the product as input (unless they are also feeding in a directory containing files from multiple products). This should address the template issues noted here.

'{year}/{month}/{day}/{filename}')
s3_pathinfo = parse.parse(template, path)
self._version = s3_pathinfo['version']
self._product = s3_pathinfo['product']
self.path = path

# @property
# def wanted(self):
Expand All @@ -101,7 +108,7 @@ def avail(self, options=False, internal=False):
# return self._avail
# else:
if not hasattr(self, "_avail") or self._avail == None:
if self._vartype == "order":
if self._vartype in ["order", "nsidc-s3"]:
self._avail = is2ref._get_custom_options(
self.session, self.product, self._version
)["variables"]
Expand Down