-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to access the taxon IDs if the NeXML source contains them #129
Comments
A couple ways, depending on how you like. You can always query the S4 object structure, as described in the S4 vignette (https://cran.r-project.org/web/packages/RNeXML/vignettes/S4.html), which is the natural R way. You can query by xpath, but that's less easy in RNeXML since we assumed few users would know xpath (or if they did, would just be doing the parsing directly with XML library) Um, stupid question just to be clear: In your example, which is the id? The value of the id attribute on the otu element? or the href of the subsequent meta element? (though of course these are related). For RDFa meta elements, there's more tooling, including first generating the corresponding RDF-XML document and then performing full SPARQL queries if you like, (as well as XML/Xpath-based queries of the RDF-XML). HTH, |
Isn't the S4 way to use API methods rather than accessing the object structure directly? (Otherwise, why even use an S4 object?)
The The
Yes, but I find this rather dissatisfying as an answer for how to get at one of the most important pieces of information about an OTU. I do think that there should be an API method for it. |
Hi @hlapp, Yes, as I mentioned there are several ways, most of which are in the documentation. For instance, you can get the metadata from Sorry to be disappointing, without understanding the use case and the user's preferences better it's hard to know what is the best way to get what. e.g. someone who really wants to make semantic sparql queries might find the R-level API functions dissatisfying. I don't have much intuition about what is the "most important information about X", but we have tried to document the different methods. It seemed silly to just write S4 accessor methods for everything so these focus on things like the metadata elements. Suggestions and PRs always welcome. |
(Whoops, forgot the link to the metadata vignette: https://cran.r-project.org/web/packages/RNeXML/vignettes/metadata.html) |
The use case is to extract a table that maps OTU labels (which are the row labels in the data.frame returned by Does |
@hlapp not sure, example to play with? |
It looks like @hlapp can you give us a bit more detail as to what would be the most desired behavior for the return objects of |
Looks like x <- nexml_read("http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0036217%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0008897%3E")
get_characters(x)
#> pelvic splint anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
#> Ictalurus pricei 1 1 1
#> Ictalurus lupus 1 1 1
#> Ictalurus balsanus 1 0 <NA>
#> Ictalurus furcatus 1 0 1
#> Ictalurus punctatus <NA> 1 1
#> Ictalurus australis 1 1 1
#> Ictalurus sp. (Mo 1991) 0 <NA> <NA>
#> Ictalurus dugesii 1 <NA> 1
#> Ictalurus mexicanus 1 <NA> 1
get_metadata(x, level = "otu")
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036225"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061498"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061495"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036221"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036218"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036223"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036220"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061497"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061496"
get_taxa(x)
#> [1] "Ictalurus punctatus" "Ictalurus mexicanus" "Ictalurus australis" "Ictalurus balsanus" "Ictalurus pricei" "Ictalurus furcatus" "Ictalurus lupus" "Ictalurus dugesii" "Ictalurus sp. (Mo 1991)" |
So is the real issue that the rows in the matrix need to be reordered to be consistent with what the other methods return? |
Does the link to the issue on the rphenoscape tracker help establish enough context as to where this is coming from? Essentially the use-case is to create mapping tables from name to identifier(s). To apply these painlessly and with the least gotcha's, the names in the mapping table should be in the same order as everywhere else. Does that make sense? The alternative is to get metadata back directly linked to what they annotate. One way or another, there needs to be good way to establish that mapping. |
@hlapp yeah, the rphenoscape example is helpful, but I'm still processing the details here. It's not clear to me why the order in which the characters are returned in the table matters; in that generally we expect methods that operate on data.frames to be agnostic of the order of the rows. I thought NeXML had the same philosophy that in general it encodes all data explicitly in fields, rather than implicitly through structure (such as ordering or nestedness). Do I have this wrong? It seems like the problems with the above methods are not so much the order as the lack of an additional id column. My reading of the phenoscape issue is that we want three data.frames as the return objects. I think they should return data.frames, and I think it sounds like they need another column that contains the data you refer to as being represented by the ordering, but I'm not quite sure what that is. (e.g. I suppose it is what they are annotating, but be really explicit for me: is it that the id of the element, or the label, or something else?) Anyway, I completely agree that we need a good way to establish the mapping and that the current return objects are failing to preserve that information. |
Isn't it pretty common in R to use the index for subsetting? Say I wanted to subset the matrix to keep all rows for taxa that have an identifier: data <- get_characters(nexml)
ids <- get_metadata(nexml, level="otu") # this won't work this way right now, of course
data <- data[!is.na(ids), ] Say you have a function data <- get_characters(nexml)
ids <- get_metadata(nexml, level="otu") # this won't work this way right now, of course
data <- data[pk_is_descendant("Mammalia", ids), ] Does that make sense? |
The piece that is being used elsewhere for row and column labels. For example, the data matrix uses apparently the taxon labels for its row labels, and the character labels for its column labels. |
Yup. I think that was a bad choice though. The row labels should be a column, and the rows should be unlabelled. This is more consistent with database design and more robust for manipulation. Yes, people subset in R with index vectors, but typically only when the index vector is constructed from a column of the data.frame you are subsetting (where clearly you cannot have the problem of different orderings). Your example is very helpful, but it sounds like |
Hi @cboettig,
@hlapp Do you agree? |
Yes, I agree, that seems more natural. That said, it would be easy enough to splice out the ID column in rphenoscape (and to construct a separate data.frame with metadata mapped to taxon labels) before returning the result to the user, in case @cboettig would rather put it into the data.frame returned by |
Thanks for the advice here, very helpful. I'm still a tad leary of using labels instead of ids as keys for indexing and joining tables, (I'm guessing labels can have more weird UTF-8 chars then ids, and thus cause trouble if a user has not configured locales sensibly), isn't that why we have ids in the first place? Are we always guaranteed to have both id and label elements available (i.e. are they both required by the schema? guess I should know that...) |
No! The label attribute is optional. |
@rvosa very good, that makes much more sense. I'll return ids as the key column for each of the data.frames |
Which IDs? The ones in the |
Right, they are local to the document, but they'd still be better than On Sat, Oct 17, 2015, 3:55 PM Hilmar Lapp [email protected] wrote:
|
Also, there is no requirement that all otu attributes, even if they're there, are unique. @hlapp do you think it would ever be problematic that the ids are ephemeral? I mean, in practice? They are unique keys for managing referential integrity within the document, anything else you should use an annotation for (example: any kind of database ID). |
Exactly (and @cboettig, this is the answer to your question - they are in essence local primary keys, which are never really useful to expose or export to anything else, including not from XML documents) . So you might as well use a sequential numbering - it's unique for each row, and will obviously be local to a data matrix (whereas that fact might be much less obvious from the XML doc's primary keys). So I know there's been concern with and objection to using the row order of the data matrix, but in essence we're back to that. |
I must apologize for not having read the thread closely enough. If @hlapp's The consequence is that the names column may legally have empty or Programmatically we therefore can't rely on them to act like primary keys
|
Just as an FYI, now that @balhoff implemented identifier annotations for character definitions (see phenoscape/phenoscape-kb-services#20), we can see that the order in which So right now, Perhaps this is a good point to arrange a conference call to move this issue forward? |
Sounds good to me. I've implemented a new version of get_metadata now on the
No idea if this is wise or not, but shows what I am thinking. I'd like |
I guess I'm curious what you mean by similar data.frame. ID, label, and? And for |
Good questions, and please push back if I'm saying something silly; you, @xu-hong and @balhoff have a better idea than me about the actual use cases here. For For the characters matrix, I would probably only add the value of the @rvosa does the |
@hlapp @xu-hong others what do you think of this approach (now implemented on the Note that I've left the |
Perhaps there is an easy way in the API already - how does one get at the the taxon IDs as annotated, for example, in the form of dwc:taxonID metadata? Like here in the NeXML produced by the Phenoscape API:
cc @xu-hong.
The text was updated successfully, but these errors were encountered: