-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use struct for node metadata #642
Conversation
3786aea
to
3fb7b94
Compare
This has the new metadata scheme, where the tsinfer-specific metadata is in a "tsinfer" key. |
Codecov Report
@@ Coverage Diff @@
## main #642 +/- ##
=======================================
Coverage 93.08% 93.09%
=======================================
Files 17 17
Lines 5136 5140 +4
Branches 954 954
=======================================
+ Hits 4781 4785 +4
Misses 235 235
Partials 120 120
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This is ready for review. I reckon if we merge it (or whatever we end up with) at the same time as #607 then we make 2 changes to the formats at roughly the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I'm not sure about the schema. I think we're over complicating it.
The metadata column should be interpretable as a 2D numpy array of int32s, so all the unused values are explicitly filled in as (-1,-1). There's no defaults then. I think this is much more useful in practice than using nulls and saving a few bytes.
"codec": "struct", | ||
"type": ["object", "null"], | ||
"properties": { | ||
"tsinfer": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point in the "tsinfer" top key here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was cleaner, because then when we e.g. add values inferred from tsdate
we can store them in a tsdate
property. That way it is clear where the different metadata values have come from, and they don't really risk overwriting each other etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addendum: I see this as a nice advantage of struct
: we can be verbose in the explanation (and nesting) of the properties in the metadata, and it doesn't take up extra space in the encoding of the metadata for each row. That's my understanding, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nice thing is that there is no storage cost to the additional level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nice thing is that there is no storage cost to the additional level.
Yes, that's what I was trying to say (but you said it better)
As I say, the main point of using struct here for me would be to enable direct access to the metadata as numpy arrays. Let's get some info on doing that first before discussing the other details any further. |
Ah, that's not the main reason I would be using a struct, TBH. I didn't even think about wanting to get the entire array out. I can see that would be useful sometimes, though. BTW, and not that it matters, but the |
If the info is sparse and you mainly plan on leaving things blank then there's no point at all in struct, I think. But being able to process these things efficiently in Python seems attractive to me. |
The Edit - that's at least half a gigabyte in duplicated text in the tree sequence, I think? |
Maybe some concrete numbers for chr1 would be helpful. |
OK, for the p-arm of chr1, which is 123.4 MB (i.e. about 1/25th of the whole), we are storing 30Mb in node metadata, of which we have just over a million copies of the byte string "ancestor_data_id" (= 16 Mb), and no copies of the string "sample_data_id" (because we only use this for historically inserted samples, which we don't do in the unified ts). |
So what would that be for 8 bytes per node?Around 8MB I guess? |
3fb7b94
to
eb4fce8
Compare
Yep, 8.9MB. Note that the released unified genealogy tree sequences also don't contain the
to the JSON, which I estimate would multiply the metadata storage requirements in the JSON format by ~ 2.5 times, and would either increase the struct version by another 8.9 MB if we store mean and var as 4 byte floats, or 17.8 if we use doubles. |
So if we store tsdate mean and var, for all chromosomes in total, we use up roughly 1.9 GB in a JSON encoding, and either 445 MB (if storing floats) or 667MB (if storing doubles) for the struct encoding. If we don't store tsdate means and variances, we use up ~ 750MB for JSON and 220MB for struct. |
I've come and gone several times on the "top-level key with name of generating software" concept several times, but I think I've convinced myself that it is a bad idea. Here's the thinking - if downstream software wants to use a metric stored in metadata it will have to know what software wrote it, likely hard-coding the name of that software. This would prevent a different upstream software from replacing the first, or even worse cause it to write metrics to a key that had the name of the first piece of software in order to be compatible! |
Thanks @benjeffery. I guess there might be a difference between metadata that is generalisable, and metadata that is very tightly linked to the software package. In the case of the Having said which, I'm perfectly happy to label them simply |
Yes, good point. What downstream uses are there of |
Some of the rationale is at #301 and #604 (comment) Basically, it is useful for manipulating the tree sequences for matching historical individuals into a tree sequence, and inserting them etc. This could either be done in the standard internal tsinfer routines, or as part of a post-processing step |
Closing as we decided simply to use permissive json, at least for the time being |
Fixes #416