Add torben conversion script along with the output data #5

jayrbolton · 2021-01-20T00:30:30Z

No description provided.

jayrbolton · 2021-01-20T22:42:54Z

@realmarcin I just updated this PR to separate data into its own directory, and added some minimal documentation.

realmarcin · 2021-01-21T07:56:17Z

transform/torben_transform_samples.py

+    and _OUT_EDGE_HEADERS.
+    """
+    noext = os.path.splitext(_SOURCE_PATH)[0].replace('/source/', '/out/')
+    node_path = noext + '.biolink-nodes.tsv'


can you align this file naming to:
IMGVR_extra_KGX_edges.tsv
IMGVR_extra_KGX_nodes.tsv

so subKG (Torben) dataset (sample) _KGX_edges.tsv etc

Also the format is KGX (and overlaps with many graph formats) and we are using some biolink terms to annotate our nodes/edges.

realmarcin · 2021-01-25T22:21:06Z

A few notes for the edges file:

please leave out lat/long for now since we can't ingest values directly (we have pairwise sample distance data, we could try to bin that but separate ingest ok there)
All the IMG: prefixes should just be GOLD: (for now, we can deal with the few IMG ids later)
some stray asterisk here GOLD:*Microbiome?
Your primary (subject) id should start with Ga (for GOLD analysis sample id) so maybe currently its not the right column eg IMG:403934
'located_in' for now is just strictly geography, so all other environment labels just get has:attribute
Assembly_size and Gene_count values need to be log10 to get the bins
Not sure where these are coming from, but possibly can remove -- unless its the same field as in IMGVR_to_tsv.py:
IMG:403936 biolink:has_attribute GOLD:DOE Joint Genome Institute (JGI) biolink:has_attribute
I think here you are using TaxonOID? But that's not an actual taxon id, so these also need to be removed:
IMG:403936 biolink:has_taxonomic_rank IMG:3300021138 biolink:has_taxonomic_rank

realmarcin · 2021-01-25T22:24:18Z

A few notes for the nodes file:

The edges comment apply here too, most of the issues I think.
This one looks odd, no prefix? Not sure what this field is but also mismatch between id and label:
WI - Practice 03JUN2009 epilimnion biolink:NamedThing
Looks like all nodes are type biolink:NamedThing? If you follow the IMGVR sample script there are more specific categories for all but two cases I think, take a look.
Also note the field regex to clean them up, eg remove white spaces, applies to both nodes and edges -- I just do it on the entire data frame:
df = df.replace(',', '', regex=True)
df = df.replace(' ', '', regex=True)
df = df.replace(':', '', regex=True)
df = df.replace('__', '', regex=True)

realmarcin · 2021-01-21T07:59:01Z

data/README.md

@@ -0,0 +1,2 @@
+* `/data/source` are source files that scripts in `/transform` take as input


here let's align to the KGX dir structure:
ls -1 data/
raw
transform
merged

and then subdirs for each source as in the merge.yaml
https://github.com/kbaseIncubator/KE_KG/blob/main/merge.yaml

Add torben conversion script along with the output data

202a6eb

jayrbolton requested a review from realmarcin January 20, 2021 00:30

Organize data into a separate directory; add some minimal docs

52155b9

realmarcin reviewed Jan 21, 2021

View reviewed changes

realmarcin reviewed Jan 26, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torben conversion script along with the output data #5

Add torben conversion script along with the output data #5

jayrbolton commented Jan 20, 2021

jayrbolton commented Jan 20, 2021

realmarcin Jan 21, 2021

realmarcin commented Jan 25, 2021 •

edited

Loading

realmarcin commented Jan 25, 2021

realmarcin Jan 21, 2021

		@@ -0,0 +1,2 @@
		* `/data/source` are source files that scripts in `/transform` take as input

Add torben conversion script along with the output data #5

Are you sure you want to change the base?

Add torben conversion script along with the output data #5

Conversation

jayrbolton commented Jan 20, 2021

jayrbolton commented Jan 20, 2021

realmarcin Jan 21, 2021

Choose a reason for hiding this comment

realmarcin commented Jan 25, 2021 • edited Loading

realmarcin commented Jan 25, 2021

realmarcin Jan 21, 2021

Choose a reason for hiding this comment

realmarcin commented Jan 25, 2021 •

edited

Loading