Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add torben conversion script along with the output data #5

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jayrbolton
Copy link

No description provided.

@jayrbolton jayrbolton requested a review from realmarcin January 20, 2021 00:30
@jayrbolton
Copy link
Author

@realmarcin I just updated this PR to separate data into its own directory, and added some minimal documentation.

and _OUT_EDGE_HEADERS.
"""
noext = os.path.splitext(_SOURCE_PATH)[0].replace('/source/', '/out/')
node_path = noext + '.biolink-nodes.tsv'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you align this file naming to:
IMGVR_extra_KGX_edges.tsv
IMGVR_extra_KGX_nodes.tsv

so subKG (Torben) dataset (sample) _KGX_edges.tsv etc

Also the format is KGX (and overlaps with many graph formats) and we are using some biolink terms to annotate our nodes/edges.

@realmarcin
Copy link
Collaborator

realmarcin commented Jan 25, 2021

A few notes for the edges file:

  • please leave out lat/long for now since we can't ingest values directly (we have pairwise sample distance data, we could try to bin that but separate ingest ok there)
  • All the IMG: prefixes should just be GOLD: (for now, we can deal with the few IMG ids later)
  • some stray asterisk here GOLD:*Microbiome?
  • Your primary (subject) id should start with Ga (for GOLD analysis sample id) so maybe currently its not the right column eg IMG:403934
  • 'located_in' for now is just strictly geography, so all other environment labels just get has:attribute
  • Assembly_size and Gene_count values need to be log10 to get the bins
  • Not sure where these are coming from, but possibly can remove -- unless its the same field as in IMGVR_to_tsv.py:
    IMG:403936 biolink:has_attribute GOLD:DOE Joint Genome Institute (JGI) biolink:has_attribute
  • I think here you are using TaxonOID? But that's not an actual taxon id, so these also need to be removed:
    IMG:403936 biolink:has_taxonomic_rank IMG:3300021138 biolink:has_taxonomic_rank

@realmarcin
Copy link
Collaborator

A few notes for the nodes file:

  • The edges comment apply here too, most of the issues I think.

  • This one looks odd, no prefix? Not sure what this field is but also mismatch between id and label:
    WI - Practice 03JUN2009 epilimnion biolink:NamedThing

  • Looks like all nodes are type biolink:NamedThing? If you follow the IMGVR sample script there are more specific categories for all but two cases I think, take a look.

  • Also note the field regex to clean them up, eg remove white spaces, applies to both nodes and edges -- I just do it on the entire data frame:
    df = df.replace(',', '', regex=True)
    df = df.replace(' ', '
    ', regex=True)
    df = df.replace(':', '', regex=True)
    df = df.replace('__', '
    ', regex=True)

@@ -0,0 +1,2 @@
* `/data/source` are source files that scripts in `/transform` take as input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here let's align to the KGX dir structure:
ls -1 data/
raw
transform
merged

and then subdirs for each source as in the merge.yaml
https://github.com/kbaseIncubator/KE_KG/blob/main/merge.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants