-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process and archive complete IWP dataset (high, medium, & low) #6
Comments
Data Processing Methods:
To Do:
|
Questions for PDG teamPlease provide either answers or links to resources where I can find the following information: IWP data
and the slightly longer filename, with an inserted
Separate data package for coastal water, inland water, & glacier mask
|
Either options would be fine by me. I suggest you to implement most effective one. As per the description, it would be Option 2. |
Generally temporal coverage of images could fall in between 2001 - 2021. But majority are post 2008 or 2010. Spatial coverage of ALL processed files generally fall within Arctic Tundra region and confined to low-, medium, and high-ice areas within Tundra. These terms (high, low, medium ) are Brown et al. 1998 |
@ChandiWitharana Thank you for the feedback. The Abstract and Methods were drafted for option 1, rather than option 2. We can split the package to go with option 2 if @mbjones thinks option 2 makes more sense as well. That would mean submitting 2 more tickets for a total of 3 repositories published by Monday. |
I searched for the Brown et al. 1998 paper you mentioned, and found this: https://nsidc.org/sites/default/files/heginbottometal_1993.pdf Please let me know if you are referring to a different publication, we can include a formal citation for it in the metadata. |
Better we use this link (https://nsidc.org/data/ggd318/versions/2) |
From Anna:
|
⭐️ Note ⭐️ @julietcohen: @dvirlar2 pre-issued a DOI for the IWP dataset once it is published: |
@robyngit @julietcohen I'm happy to publish the dataset with the above DOI once things are ready to go, just let me know! Our code for has changed a little since Juliet was on the curation team, and I wouldn't want y'all to use old code 🙂 |
@robyngit Thanks for the DOI. For now, I think we could manually configure the DOI to point at a manually-created landing page for the dataset. Once it is published in the ADC, the DOI would then be updated to point at the ADC landing page. Does that sounds reasonable? |
Overview of package entity relationships, with the processing steps we associate with each: flowchart LR
A[A. Maxar]-->|MAPLE| B(B. IWP Shapefiles)
B --> |Staging| C(C. IWP Geopackages)
C --> |Rasterization| D(D. IWP Geotiffs)
D --> |Web tiling| E[E. IWP PNGs]
C --> |3dTiling| F(F. IWP 3DTiles)
Note[Square boxes\n likely not\n to be archived]
|
For the initial release of the IWP layer we are aiming for mid-July or later to correspond with other announcements. Since this might not leave sufficient time to get all of the metadata in order, we discussed initially publishing a minimal version of the data package so at least we can have a DOI in place, that points to relevant information, in case anyone needs to cite or reference the data. We envisioned that this MVP data package would comprise just 1) citation info, and 2) abstract, and 3) link to file tree for downloads. However, there are more fields that are required in order to publish a package on the ADC, thus I think the package should contain all the fields that are marked as mandatory in the editor, but exclude all of the entity information for now. I created a test version of this minimal package that is mostly a copy of what @julietcohen already created but without any files (so no python scripts and no data object descriptions). 📑 The MVP test version is available here. There are some outstanding issues with the metadata we have:
|
High, Medium, Low ice regions are categorization adapted from Brown et al. 2002 for image selection and processing purposes. (Brown, J., O. Ferrians, J. A. Heginbottom, and E. Melnikov. (2002). Circum-Arctic Map of Permafrost and Ground-Ice Conditions, Version 2 [Data Set]. Boulder, Colorado USA. National Snow and Ice Data Center. https://doi.org/10.7265/skbg-kf16. Date Accessed 12-08-2022.) |
From UConn side, the team would be: |
Location: Whats given is fine. |
Metadata should be fine |
Thanks @ChandiWitharana! What can we include to define what is meant by "high", "medium", and "low" ice regions? |
In @robyngit 's to-do list above, methods steps 3-5 require a release for the PDG packages. I will take care of this so we can include it in the metadata. |
Preliminary thoughts after viewing the latest version of the package: Title, Abstract, and Keywords:
I have ideas for how to flesh out the methods section, but I'll get to that later when I have more time. |
@dvirlar2 Thank you for the feedback 👍🏼 Regarding your first suggestion: The title does already include "high ice Arctic regions". Would you suggest wording it in a different way or is that sufficient? Regarding your second suggestion: The Sampling section does already include a short description of what "high ice" means and mentions that there are also medium and low ice regions. "The geographic area sampled is the "high ice" regions of the Arctic, which are those the dataset authors identified to contain a relatively high proportion of ice. The study extent encompasses all high ice regions masked for coastal oceans, glaciers, and surface water. Further additions to this dataset will include "medium ice" and "low ice" regions of the Arctic as well. These regions were classified by less ice content." Perhaps this is not sufficient, or we could put it in a different section so it's more obvious? |
I can move the high ice / medium ice / low ice descriptions from the Sampling section to the Abstract, since it seems that is what you are suggesting since that part is all you had time to review so far. |
The high/med/low ice distinctions are not that critical - they essentially signal solely the order in which different spatial regions were processed. Its helpful for people to know that the spatial extent of the dataset will grow over time, but the results in each region are the same, and the divisions between the regions are pretty arbitrary. |
Given Chandi's earlier comment and link to the NSIDC dataset, I found this information about the designations between high/med/low ice. From the user guide:
Medium ice is characterized by 10-20%, and low ice is 0-10%, with no internal breakdown in terrain like the high ice. From the Hegginbottom paper under the "ATDBs" section:
Given the above descriptions, I think it's reasonable to include some combination of the above content that would explain the difference between the High, Medium, and Low ice datasets to users. I think these specific descriptions should go in the Sampling Description section of the dataset like Juliet mentioned above, but there should also be a sentence in the abstract mentioning that an explanation is provided further on in the dataset. In the Sampling Description section, we should also link to the NSIDC dataset and provide brief direction for users to view the User Guide and Hegginbottom paper for more in-depth information. |
Given my own confusion reading the dataset title earlier in this thread, and my experience of having a harder time distinguishing between datasets with very similar titles, I would recommend putting changing the title to something along the lines of
That way, the High, Medium, and Low distinctions are more immediately clear to users. Food for thought |
List of Orcid IDs:
Need to verify:
Still need to include, if desired by person:
|
@dvirlar2 The plan is to update the dataset with new version releases to include all of the high, med, and low ice regions. And we plan to do that soon. So, I think the title should not include that distinction. A proposed title:
|
The ORCiD Daphne suggested for Amal is correct |
(1) explanation for file naming template for shapefiles,: (2) What is the NSF award number? Such as "NSF Award 2240912" and any non-NSF funding info (3) What are the ORCiD's for Mahendra R. Udawalpola and Amit Hasan? |
@mbjones thank you for clarifying that! I got confused between this ticket and our meeting last week on how the datasets were going to be broken up. I agree with the title you proposed 👍🏽 |
I've added @julietcohen 's test version onto the production site. ADC people can view it here. I haven't checked yet who has access to the test version, but I can do that at a later date. From a curation standpoint, I've:
To-Do:
|
The following is the placeholder filename for the dummy shapefile in the package: After reading the structure provided by @amalshehan, I have a few questions:
|
Regarding Daphne's to-do items above:
I would also add Kastan Day to the list of dataset contributors for the geopackages and rasters, and Anna Liljedahl. |
Regarding the discipline choices, let's ask @amalshehan and @ChandiWitharana review the proposal with a pointer to the ADCAD vocabulary for choices. |
To Do for the IWP metadata package:
|
Access to dataset added via ORCIDs:
I also added Chandi to the list of editors. His orcid is already in the dataset. I'll follow up with Howard and Ronald. Also, I'd like to re-emphasize that if any of the above people should be listed in the dataset citation, they should be listed under the Data Set Creator section, and not to "people and associated parties" 🙂 If not, then no worries |
Thanks Daphne! I emailed Kastan to confirm that's his ORCiD. You are correct about distinguishing between the Data Set Creator section and the "people and associated parties", sorry to cause confusion there. |
ORCID Updates:
|
Kastan also confirmed that ORCiD listed above is his |
@dvirlar2 Based on my discussions with @ChandiWitharana we would like to point to PGC data docs for extra details on file naming as the original data was acquired from PGC and the names were maintained as is. If you think that we should document (archive) this I can respond to the specific details you request above. The PGC data doc I am referring to are PDF: PGC Commercial Satellite Imagery Documentation (umn.edu)
There is no difference. Original time stamp is given by the vendor and the acquisition time stamp is added by PGC.
Georectified images are corrected for any geometric distortions that may be present in the original/standard image due to the approach used to acquire the image.
No |
@dvirlar2, Good to also have Earth Science, Computer Vision, Geo AI, Big Data. |
Should verify at some point how Torre Jorgenson wants to be identified in the dataset. Seems he goes by Torre among peers, but is professionally known as Mark. For now I'm putting him down as "M. Torre" in this dataset, and including information based on this recent dataset |
Dataset has been finalized from my POV, and I've sent it to Matt and Juliet to review before sending off to others. Can view things here: Also, I thought I had sent the Academic Ontology for the dataset annotations, but I see that I did not! @amalshehan For context, this ontology is where we pull our "dataset annotations" from. Earlier I had mentioned cryology, soil science, and data science as possible choices. I ended up going with data science and cryology based off of your earlier comments! Let me know if you have any questions 🙂 |
@julietcohen I rearranged the IWP dataset to streamline the directory structure as we discussed. Here's what I did, and the final file layout: cd /var/data/10.18739/A2KW57K57/
cd iwp_geopackage_high/
mv staged/gpub020/WGS1984Quad .
mv staged/staging_summary.csv .
mv staged /var/data/submission/pdg/ice-wedge-polygon-data/
cd ../iwp_geotiff_high/
mv geotiff/WGS1984Quad .
mv geotiff/raster_events.csv .
mv geotiff/raster_summary.csv .
mv geotiff/raster_summary_duplicate.csv .
rmdir geotiff
cd ..
tree -L 2 .
.
├── cleaning_materials
│ ├── add_date_attribute_footprints.py
│ └── cleaning_data
├── iwp_geopackage_high
│ ├── staging_summary.csv
│ └── WGS1984Quad
├── iwp_geotiff_high
│ ├── raster_events.csv
│ ├── raster_summary.csv
│ ├── raster_summary_duplicate.csv
│ └── WGS1984Quad
├── iwp_shapefile_detections
│ ├── high
│ ├── low
│ └── medium
└── iwp_shapefile_footprints
├── high
├── low
└── medium I also revised the Mermaid diagram to reflect these changes, and worked a bit on the wording in that diagram: flowchart LR
A["Maxar <br> (satellite images)"] -->|MAPLE| B("`**/iwp_shapefile_detections/**
Format: Shapefile
Irregularly shaped vector files, one per image`")
B -->|Create Tiles and <br> Identify Duplicates| C("`**/iwp_geopackage_high/**
Format: GeoPackage
Evenly-spaced vector tiles, with duplicates flagged`")
C -->|Rasterize and remove <br> flagged duplicates| D("`**/iwp_geotiff_high/**
Format: GeoTIFF
Evenly-spaced raster tiles, with duplicates removed`")
|
Edits for next release:
|
@dvirlar2 For the IWP mapping the HPC resources used are from TACC allocation DPP20001 and ACCESS allocation DPP190001. Do you need any other details such as the specific systems used? |
Kenton provided the following to help fill in the ACCESS / TTAC grant info: National Science Foundation - Leadership Resource Allocation (LRAC): Harnessing big satel- National Science Foundation - ACCESS Explore: Permafrost Discovery Gateway Pan-Arctic And based on the format of the above ACCEES award, the new allocation info is: National Science Foundation - ACCESS Discover: Permafrost Discovery Gateway Pan-Arctic |
More info from Kenton, the IBM acknowledgement: IBM-Illinois Discovery Accelerator Institute - Scaling Data-Intensive Discovery Workflows on IBM-Illinois Discovery Accelerator Institute - HDC: A Full-Stack Solution for the Hybrid Cloud |
Thanks @amalshehan and @julietcohen! I think that's all the info I need, but I'll let you know if that changes. |
Since this issue has been stagnant for some time, an update: Since one run on Delta processed the high ice (more than half the data), and the other run processed the low and medium ice, deduplication between those 2 tilesets was not executed. This is because the merging steps executes deduplication for gpkg files that were staged on different nodes. Because merging so many files takes days and depletes our Delta credits, it would be best to finish developing the kubernetes and parsl workflow to run on the NCEAS server (or another server, such as Google Cloud Platform) so we can take advantage of fast and powerful hardware without run time limitations, credit limitations, and memory limitations we experience on Delta. Tickets to describe the progress of the kubernetes workflow are documented in the viz-workfow repo. |
- Also make minor changes to the clip_to_footprint deduplication method Relates to #6
@ChandiWitharana and Elias, I’d like your opinions regarding how to archive the PDG data and metadata for the IWP and water/glacier clipped datasets. Elias processed the .shp files last week, and Kastan is running the workflow to stage, rasterize, and create the web tiles. We’re processing both these datasets in advance of NNA (prioritizing the IWP dataset), and we’ll have the
.shp
,.gpkg
, and.tif
data files archived on the Arctic Data Center.Matt suggested 2 ways to archive the data:
Which option is best?
As I document each file type, I'll be checking in with you about metadata and authorship questions.
The text was updated successfully, but these errors were encountered: