-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding marble specific fields to item properties #30
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to merge #28 first.
On follow PRs, tests will have to pass.
I add to patch some tests. Expected features/support were still not working.
@@ -6,7 +6,7 @@ CATALOG = https://daccs.cs.toronto.edu/twitcher/ows/proxy/thredds/catalog/datase | |||
# CATALOG = https://daccs.cs.toronto.edu/twitcher/ows/proxy/thredds/catalog/datasets/CMIP6/CMIP/AS-RCEC/catalog.html | |||
|
|||
testcmip6: | |||
python $(IMP_DIR)/CMIP6_UofT/add_CMIP6.py $(STAC_HOST) $(CATALOG) | |||
python $(IMP_DIR)/CMIP6_UofT/add_CMIP6.py $(STAC_HOST) $(CATALOG) PAVICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this command is becoming too cluttered with optional arguments.
Can you add a --host-node
option or similar?
I think it should use the URL of the STAC_HOST
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can streamline this.
item_data["attributes"]["host_node"] = self.host_node | ||
|
||
item = STAC_item_from_metadata(iid, item_data, self.item_properties_model, self.item_geometry_model) | ||
# try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove debugging code
@@ -173,6 +187,7 @@ def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) | |||
parser = argparse.ArgumentParser(prog="CMIP6 STAC populator") | |||
parser.add_argument("stac_host", type=str, help="STAC API address") | |||
parser.add_argument("thredds_catalog_URL", type=str, help="URL to the CMIP6 THREDDS catalog") | |||
parser.add_argument("host_node", type=str, help="Name of the Marble node on which the data resides") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @fmigneault that this could be a flagged option. If we don't make this optional then we can never use this implementation to populate a stac instance that is not part of the Marble network.
However, I think that we should add some validation checks here as well if --host_node
is specified:
- check that there exists a node with the same name as "host_node" in the registry
- check that the registry contains both a thredds and stac service for that node
- check that the "stac_host" argument matches the url of the stac service in the registry
- check that the "thredds_catalog_host" argument matches the url of the thredds service in the registry
If any of these conditions are false, we should raise an error with an appropriate message.
is_replica: bool = Field(..., serialization_alias="mrbl:is_replica") | ||
host_node: str = Field(..., serialization_alias="mrbl:host_node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These attributes would apply to every dataset hosted on the node, so I suggest they should be defined inside a Parent class that every collection inherits from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could also avoid the mrbl
shortcut, and simply have marble
explicitly.
It is not really that much a long name to justify shortening it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should use marble
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These attributes would apply to every dataset hosted on the node, so I suggest they should be defined inside a Parent class that every collection inherits from.
I understood your intention and I think the proposal can be useful. But I didn't understand (or can't quite visualize) your proposal for how to do that. Could you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class inherits from STACItemProperties
, which would be common to all implementations, correct ? So I'd add those two attributes in that class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I see.
# with the data residing on that node. Replica STAC items on other nodes in the Marble network will be | ||
# populated using a different app, and there the is_replica flag will be True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is meant exactly by replica here. In the ESGF world, this has to do with who mirrors from whom. Will MARBLE include replication of datasets?
Also, more generally, if you are going to curate copies of ESGF files that you aggregate yourself, how are you going to determine whether one of the files making the aggregation has been updated, and this update needs to be propagated to your own version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you say that replica items would be populated by another app. Why would that be? The replica is going to be a netCDF file sitting on a THREDDS server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean the replica of a STAC item (just the catalog entry, not the file) ? In that case, I think we need to work out how STAC catalogs are going to be shared from one node to another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is meant exactly by replica here. In the ESGF world, this has to do with who mirrors from whom. Will MARBLE include replication of datasets?
Not, datasets will not be replicated. We don't have enough space on the network. What I meant is that we have plans to ensure that STAC entries from one node are visible in the STAC browser of another node. As an example, at some point the Ouranos STAC catalog will also have a "CMIP6-UofT" collection whose entries be (near) duplicates of the entries on UofT's STAC catalog for the "CMIP6-UofT" (all the properties will be same and crucially, access links will be same), but marble:host_node
property will be UofT
and marble:is_replica
property will be true
. Similarly, UofT will have replica entries for catalogs on Ouranos. This way, a user can query any single node on the network to get info on data across the network.
Also, more generally, if you are going to curate copies of ESGF files that you aggregate yourself, how are you going to determine whether one of the files making the aggregation has been updated, and this update needs to be propagated to your own version?
It is my understanding that when files are updated on ESGF, they are marked with a new version number, existing files are never overridden. We should follow the same approach. The aggregations will be for all files in a single version. If, for example, one of the files in say an EC-Earth
dataset (which come in 165 individual files) are updated, then it will be up to the manager of the node to decide if he/she wants to create a new aggregation using the updated file. My approach for UofT node would be to not touch do that unless the changes are substantial. Let me know if I didn't understand your question correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean the replica of a STAC item (just the catalog entry, not the file) ? In that case, I think we need to work out how STAC catalogs are going to be shared from one node to another.
Yes just STAC items. We will be doing this by completing the implementation of STACLoader. This will crawl a STAC catalog, get the JSON data, modify it as needed (for example to change marble:is_replica
) and post it to another STAC catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider for mirrors or "replica": https://github.com/stac-extensions/alternate-assets
This PR adds two properties to a STAC item, and in doing so helps describe aspects of the data or the STAC catalog that are useful on the Marble network:
mrbl:is_replica
andmrbl:host_node
.mrbl:host_node
, specifies the name of the Marble node on which the data is stored. Valid values for this (validity is not checked in this PR) are the values of the keys in the Marble node registry.mrbl:is_replica
is set tofalse
in the STAC entry for an item when both the item and the catalog are onmrbl:host_node
, and istrue
in any replicated STAC entries on any other node in the network (the purpose is very similar to the ESGFreplica
attribute, except on Marble the data is not replicated, only the catalog).The purpose of this PR is to get a sense of what others think about this, and to solicit useful feedback. "Implementation" itself is not critical for this PR, ideally, down the line, we would write a "Marble STAC extension" to add these properties and/or other future Marble related properties. Right now, we can just do it in a hack-ish way.