-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of a manifest object. #548
Conversation
An array of data set manifests, where a data set manifest is a set of single files of different types of AIRR objects that are related.
Basic iReceptor Gateway manifest for files in a download would look like this:
|
This currently has one file of each object type per data set. I think the main question is do we need an array of each file type per dataset? I suspect we do. So a manifest is an array of data sets, and each data set has an array of files of each type. |
So an iReceptor Gateway download would look like:
|
A typical study as a single data set might look like:
|
Hmm, yeah, that's an interesting question if we want the manifest to represent a single AIRR data set, or if it should represent multiple data sets. I guess I was imagining it to represent just a single one, but multiple data sets shouldn't be that hard to support, as you say, it's just an array of arrays. Certainly, I think we want to support multiple files of the same type within a single data set, e.g. multiple rearrangement files |
I think I would definitely prefer that - as without it we can't really represent an iReceptor Gateway download (or a download from the ADC) as a single manifest file. Which seems kind of silly? A data set in this context is something where the files are expected to have any cross referencing ID fields to be unique in the context. It is a grouping of files that can be processed together without further manipulation. That is in a data set, if there is a repertoire_id in a repertoire file, any of the other files (rearrangement, clone, cell) that have the same repertoire_id would be considered to be talking about the same Repertoire. I kind of thing of the Manifest mapping to having a similar uniqueness criteria to what we have for a repository. From an ADC context, I think we want a manifest to be able to represent more than one data set where that uniqueness criteria is true but it isn't necessarily true across all data sets within the manifest. For example, we want to have a data set from N repositories, where the uniqueness criteria is valid within each of the N data sets but the N data sets might have a repertoire_id in common. But that is OK, because it is an error to process a rearrangement file with a repertoire file that is not part of the same data set... Similarly for processing files in a manifest. |
That's right, you never really were on board with making repertoire_ids be globally unique... Sigh, tool developers are going to be annoyed with us. If I do a query for all cancer data, now I (the tool) am going to have manage all of the conflicting IDs if I try creating repertoire groups and do analysis that crosses "data sets", what a pain... |
I don't understand the reason for having Anyway, iReceptor should be returning everything as part of a single |
I'm not against it, we just have to figure out how to do it. In #246 (comment) we kind of came to the conclusion that we would probably use another field to get uniqueness globally and that repertoire_id would be unique in the context of a data set (repository, study, etc) being considered (eg. a DataSetManifest above). In #347 we talk about global uniqueness and persistence. So this is still an open question. |
Are you concerned about having separate YAML objects or the structure of having an array of DataSetManfests. They certainly don't have to be separate, I did that for clarity of definition. My rationale for the structure is that: A A |
I don't think it can, and I am not sure it wants to. It can't because of the issue raised by Scott above. The iReceptor Gateway pulls data from many repositories, so repertoire_id is not unique across repositories. So we can't have a single RepertoireGroup with a list of repertoire_ids because there could be conflicts/collisions. It doesn't want to because it doesn't know how to group the Repertoires. The data that is eventually downloaded from a query can be generated from a very complex process of data refinement. So the Repertoires can be grouped in a widely varying set of ways and the Gateway has no clue as to how the researcher might want to group the Repertoires from that query. The only grouping that really makes sense is to group at the repository level. Now one could create a RepertoireGroup at the repository level, but that is pretty well identical to the list of Repertoires that you get in Repertoire JSON file for the repository, so it is redundant. So a manifest for an iReceptor Gateway download would look like this: #548 (comment). It is this Manifest that the consumer of the iReceptor Download would use to process the data. The user can then determine if they want to split out the data for their own uses. At that point they might want to create RepertoireGroups (create RepertoireGroups per subject) and perhaps a more detailed Manifest that captures the relationship between the files in those RepertoireGroups. |
OK good point. So I guess one
Well, ok for now, but the long term goal is precisely for |
I don't see how those groups are particularly useful.
My understanding is that you want the Regardless, that is an interesting enhancement for API V2. The |
For me, I think a RepertoireGroup file would be redundant for the iReceptor Gateway. Essentially an iReceptor Gateway download comes from a /repertoire query generating a set of repertoire_ids, which is then used to download all the AIRR rearrangement data for those repertoire_ids. You get a repertoire JSON file that contains the list of Repertoires and an AIRR TSV file that contains the rearrangements from those repertoires (unless you filter at the sequence/rearrangement level). The manifest is simple:
Recall that the Repertoire JSON file is simply an array of Repertoire information, one per repertoire_id. So this is essentially the same as the RepertoireGroup file that would be used to represent the data. The RepertoireGroup would be the same array of repertoire_ids as that in the Repertoire JSON file, without all the metadata for the Repertoire. So in this case, it would be redundant. A RepertoireGroup file for us would essentially be a subset of the full Repertoire JSON file with just the fields Now if the user wanted to split the data that they downloaded to do comparative analyses (say they downloaded all Homo sapiens COVID-19 IGH data) and wanted to split the data to look at gender differences. For that analysis I could see how you would want to slice the data based on gender and generate two RepertoireGroup files, one for each gender. You can slice that data in hundreds of different ways, but that is part of the downstream analysis, where RepertoireGroup files would be extremely important. But that isn't the role of the Gateway. |
A function that would split a Repertoire JSON file based on a specific Repertoire field into a set of RepertoireGroup files would be VERY useful IMHO. To do the gender split I mention above, you just give the utility a JSON file and a field (e.g. subject.sex) and it would go through the JSON file, gather the possible field values from the file, construct a set of repertoire_ids for each, and then generate a RepertoireGroup file for each possible value.
Would generate N files where N is the number of distinct values that existed in the subject.sex field. For example, if the file contained "male", "female", and "not collected" you would get three RepertoireGroup files as follows: covid-19-1-metadata_male.json You could then use the repertoire_id fields in the above RepertoireGroup files to easily analyze rearrangement data from male and female subjects from covid-19-1.tsv |
From the call:
|
FYI we went ahead and implemented something like this for downloads from the iReceptor Gateway. If you do a download from the Gateway you get a manifest.json file that describes the download from each repository. It follows the idea of the manifest object here but it might not match the spec as we have it - it is our own internal implementation. We use it to explicitly describe how Repertoire files are linked to Rearrangement/Clone/Cell/GEX data you get from a repository. Our manifest file looks like this:
|
Thanks @bcorrie, this is definitely a useful starting point. However, in rereading the discussion above, it still feels like we might need to resolve Two comments on what you have, though:
to a local repertoire. Any complications that you see?
|
Thoughts after the call today, and based on the above:
If I understand correctly, I don't think so. If you had a local data set you were describing, you would just have a null (or totally left out)
Yes, no objections to other files/file types, the ones I listed above map directly to the AIRR Objects such as |
Also, I am pretty sure we want to go back to having an array of file names for each file type, more like what we had here: I would want to be able to do something like this:
I think the manifest I provided above from the Gateway might be from a bug in our manifest generation code - I think the intent was always to have an array of files, but our code possibly drops the array when there is only one file (and in our downloads there is always only one file) per dataset. |
Another comment, we have This also follows into have an array of file names, as all of the Another thing if you are considering ease of use and computational efficiency in your design. Without some link between the repertoires and the rearrangements (or any objects actually), one might have to search all of the rearrangement data, to find the entries for a specific repertoire. It might be nice if the structure kept inter-related objects together. |
Yes, we could use the AIRR Object name as the tag (I think that is what you are suggesting). We could have:
Two potential issues:
|
To me that is what the
All data from a repository is in a single But you could easily represent that same download grouped as DataSets in other ways. The download from VDJServer in this download has two different Repertoires, so you could change the VDJServer DataSet to look like:
In this case all of the Repertoire data is in one file but the rearrangements are spread across two files. Or you can:
In this case all of the Repertoire and Rearrangement data for each Repertoire is described by a single DataSet. All are valid manifests for the same data, just organized differently, but in all cases all related data is grouped together. The granularity of how the data is split is just different. In the case above, the download has 2 Repertoires from VDJServer, 28 from VDJBase, and 44 from covid19-1. So the Gateway chooses to represent this data as 3 DataSets (one per repository), but it could be represented as 74 DataSets, one per Repertoire. The Manifest here supports any of those. The user of the manifest knows that if they find a repertoire_id in a Repertoire JSON file (e.g. repertoire2.json) then all of the Rearrangements for that repertoire_id should be found in the related Rearrangement file (e.g. repertoire2.tsv) and vice versa. |
Looks good to me. Should |
I think the key remaining issue might be which objects are missing. There is also no Node, Tree, or Alignment objects... Mostly because I wasn't sure of their status (are they going to move from experimental to production or disappear? 8-) I agree GermlineSet should be there for sure... |
Right - fwiw my intent is for Tree and Node to come out, though obviously that's not finalized... |
The object names are backwards now. :)
Even if they stay in, it might make sense for them to live in the same file as On another topic, I pushed a few suggestions to FileObject:
Also, |
If I understand correctly, this is just renaming the YAML objects, correct? To me, a Manifest lists a group of DataSets where the DataSet describes a set of related files. That is how we have it now. You are suggesting a ManifestGroup lists a group of Manifests where a Manifest describes a set of related files. To me the current naming logic makes the most sense but it is just object naming so I could be convinced otherwise... |
Works for me... |
It would also mean changing the recommended use (assuming I understand the current intended use) such that the primary usage will be a standalone How it's setup now implies to me that you will always have groups (arrays) of DataSets even if the group size is 1. |
Personally, I would prefer to not have two different entities that I have to deal with. In your scenario users producing data and code consuming data would have to deal with both It would be equally correct and valid to represent a single "data set" as a In the current case there is only on thing that one deals with when talking about data on disk, it is a |
This is exactly what our Analysis Pipeline on the iReceptor Gateway does (based on our current implementation of
It basically searches In each repertoire directory, I have a
Note I am no longer using the AIRR GEX/JSON files, I am using a custom file type that is an H5AD file that is generated when I split out the GEX/JSON by repertoire. I then run whatever the analysis tool that was chosen by the iReceptor Gateway user on this At the end, the output you get is a Manifest that describes the files used as well as the output from CellTypist as it was run on the data described by the Manifest. All of the processing is guided by the Manifest file, and in each case there is one |
I ...agree with @bcorrie?? =P
This is already the case for other parts of the schema, like |
To add another point, we likely want to be able to validate that a |
should there be a |
|
From the call:
|
Add GermlineSet and GenotypeSet to DataSet Sync specs.
For discussion about format and changes.
Closes #426