Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of a manifest object. #548

Merged
merged 34 commits into from
Mar 26, 2024
Merged

Addition of a manifest object. #548

merged 34 commits into from
Mar 26, 2024

Conversation

bcorrie
Copy link
Contributor

@bcorrie bcorrie commented Sep 14, 2021

For discussion about format and changes.

Closes #426

An array of data set manifests, where a data set manifest is a set of single files of different types of AIRR objects that are related.
@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 14, 2021

Basic iReceptor Gateway manifest for files in a download would look like this:

{
"Info":{},
"DataSets":
[
    {"repertoire_file":"covid-19-1-metadata.json", "rearrangement_file":"covid-19-1.tsv"},
    {"repertoire_file":"covid-19-2-metadata.json", "rearrangement_file":"covid-19-2.tsv"},
    {"repertoire_file":"covid-19-3-metadata.json", "rearrangement_file":"covid-19-3.tsv"},
    {"repertoire_file":"covid-19-4-metadata.json", "rearrangement_file":"covid-19-4.tsv"},
    {"repertoire_file":"ireceptor-public-archive-3-metadata.json", "rearrangement_file":"ireceptor-public-archive-3.tsv",
    {"repertoire_file":"vdjserver-metadata.json", "rearrangement_file":"vdjserver.tsv"}
]
}

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 14, 2021

This currently has one file of each object type per data set. I think the main question is do we need an array of each file type per dataset?

I suspect we do. So a manifest is an array of data sets, and each data set has an array of files of each type.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 14, 2021

So an iReceptor Gateway download would look like:

{
"Info":{},
"DataSets":
[
    {"repertoire_file":["covid-19-1-metadata.json"], "rearrangement_file":["covid-19-1.tsv"]},
    {"repertoire_file":["covid-19-2-metadata.json"], "rearrangement_file":["covid-19-2.tsv"]},
    {"repertoire_file":["covid-19-3-metadata.json"], "rearrangement_file":["covid-19-3.tsv"]},
    {"repertoire_file":["covid-19-4-metadata.json"], "rearrangement_file":["covid-19-4.tsv"]},
    {"repertoire_file":["ireceptor-public-archive-3-metadata.json"], "rearrangement_file":["ireceptor-public-archive-3.tsv]",
    {"repertoire_file":["vdjserver-metadata.json"], "rearrangement_file":["vdjserver.tsv"]}
]
}

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 14, 2021

A typical study as a single data set might look like:

{
"Info":{},
"DataSets":
[
  {    
        "repertoire_file":["PRJNA42.json"],
        "rearrangement_file":
        [
            "subject1.tsv","subject2.tsv","subject3.tsv","subject4.tsv","subject5.tsv"
        ]
  }
]
}

@schristley
Copy link
Member

schristley commented Sep 14, 2021

This currently has one file of each time per data set. I think the main question is do we need an array of each file type per dataset?

I suspect we do. So a manifest is an array of data sets, and each data set has an array of files of each type.

Hmm, yeah, that's an interesting question if we want the manifest to represent a single AIRR data set, or if it should represent multiple data sets. I guess I was imagining it to represent just a single one, but multiple data sets shouldn't be that hard to support, as you say, it's just an array of arrays.

Certainly, I think we want to support multiple files of the same type within a single data set, e.g. multiple rearrangement files

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 14, 2021

This currently has one file of each time per data set. I think the main question is do we need an array of each file type per dataset?
I suspect we do. So a manifest is an array of data sets, and each data set has an array of files of each type.

Hmm, yeah, that's an interesting question if we want the manifest to represent a single AIRR data set, or if it should represent multiple data sets. I guess I was imagining it to represent just a single one, but multiple data sets shouldn't be that hard to support, as you say, it's just an array of arrays.

I think I would definitely prefer that - as without it we can't really represent an iReceptor Gateway download (or a download from the ADC) as a single manifest file. Which seems kind of silly?

A data set in this context is something where the files are expected to have any cross referencing ID fields to be unique in the context. It is a grouping of files that can be processed together without further manipulation. That is in a data set, if there is a repertoire_id in a repertoire file, any of the other files (rearrangement, clone, cell) that have the same repertoire_id would be considered to be talking about the same Repertoire. I kind of thing of the Manifest mapping to having a similar uniqueness criteria to what we have for a repository.

From an ADC context, I think we want a manifest to be able to represent more than one data set where that uniqueness criteria is true but it isn't necessarily true across all data sets within the manifest. For example, we want to have a data set from N repositories, where the uniqueness criteria is valid within each of the N data sets but the N data sets might have a repertoire_id in common. But that is OK, because it is an error to process a rearrangement file with a repertoire file that is not part of the same data set... Similarly for processing files in a manifest.

@schristley
Copy link
Member

A data set in this context is something where the files are expected to have any cross referencing ID fields to be unique in the context. It is a grouping of files that can be processed together without further manipulation. That is in a data set, if there is a repertoire_id in a repertoire file, any of the other files (rearrangement, clone, cell) that have the same repertoire_id would be considered to be talking about the same Repertoire. I kind of thing of the Manifest mapping to having a similar uniqueness criteria to what we have for a repository.

That's right, you never really were on board with making repertoire_ids be globally unique... Sigh, tool developers are going to be annoyed with us. If I do a query for all cancer data, now I (the tool) am going to have manage all of the conflicting IDs if I try creating repertoire groups and do analysis that crosses "data sets", what a pain...

@scharch
Copy link
Contributor

scharch commented Sep 17, 2021

I don't understand the reason for having DataSetManifest and Manifest be separate objects...

Anyway, iReceptor should be returning everything as part of a single RepertoireGroup. But the slots for file names still have to be arrays, because we have no requirements on keeping things together (or how to organize them if broken up.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 17, 2021

That's right, you never really were on board with making repertoire_ids be globally unique... Sigh, tool developers are going to be annoyed with us. If I do a query for all cancer data, now I (the tool) am going to have manage all of the conflicting IDs if I try creating repertoire groups and do analysis that crosses "data sets", what a pain...

I'm not against it, we just have to figure out how to do it.

In #246 (comment) we kind of came to the conclusion that we would probably use another field to get uniqueness globally and that repertoire_id would be unique in the context of a data set (repository, study, etc) being considered (eg. a DataSetManifest above). In #347 we talk about global uniqueness and persistence. So this is still an open question.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 17, 2021

I don't understand the reason for having DataSetManifest and Manifest be separate objects...

Are you concerned about having separate YAML objects or the structure of having an array of DataSetManfests. They certainly don't have to be separate, I did that for clarity of definition.

My rationale for the structure is that:

A DataSet consists of a group of logically related files and the DataSetManifest describes their relationship. It groups a set of repertoire_files that describes the metadata about the rearrangement_files and clone_files.

A Manifest simply holds an array of DataSetManifests. This allows a manifest to describe N data sets that might have different file groupings.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 17, 2021

Anyway, iReceptor should be returning everything as part of a single RepertoireGroup. But the slots for file names still have to be arrays, because we have no requirements on keeping things together (or how to organize them if broken up.

I don't think it can, and I am not sure it wants to.

It can't because of the issue raised by Scott above. The iReceptor Gateway pulls data from many repositories, so repertoire_id is not unique across repositories. So we can't have a single RepertoireGroup with a list of repertoire_ids because there could be conflicts/collisions.

It doesn't want to because it doesn't know how to group the Repertoires. The data that is eventually downloaded from a query can be generated from a very complex process of data refinement. So the Repertoires can be grouped in a widely varying set of ways and the Gateway has no clue as to how the researcher might want to group the Repertoires from that query. The only grouping that really makes sense is to group at the repository level. Now one could create a RepertoireGroup at the repository level, but that is pretty well identical to the list of Repertoires that you get in Repertoire JSON file for the repository, so it is redundant.

So a manifest for an iReceptor Gateway download would look like this: #548 (comment). It is this Manifest that the consumer of the iReceptor Download would use to process the data. The user can then determine if they want to split out the data for their own uses. At that point they might want to create RepertoireGroups (create RepertoireGroups per subject) and perhaps a more detailed Manifest that captures the relationship between the files in those RepertoireGroups.

@scharch
Copy link
Contributor

scharch commented Sep 20, 2021

The iReceptor Gateway pulls data from many repositories, so repertoire_id is not unique across repositories.

OK good point. So I guess one RepertoireGroup from each repository (each with its own Manifest) then?

It doesn't want to because it doesn't know how to group the Repertoires. The data that is eventually downloaded from a query can be generated from a very complex process of data refinement.

Well, ok for now, but the long term goal is precisely for RepertoireGroup to capture that refinement.

@schristley
Copy link
Member

The iReceptor Gateway pulls data from many repositories, so repertoire_id is not unique across repositories.

OK good point. So I guess one RepertoireGroup from each repository (each with its own Manifest) then?

I don't see how those groups are particularly useful.

It doesn't want to because it doesn't know how to group the Repertoires. The data that is eventually downloaded from a query can be generated from a very complex process of data refinement.

Well, ok for now, but the long term goal is precisely for RepertoireGroup to capture that refinement.

My understanding is that you want the RepertoireGroup to reflect the query results, i.e. here's all the repertoires that were returned from the query? I guess that's primarily a convenience because you can also walk through the list of repertoires returned and construct that group yourself.

Regardless, that is an interesting enhancement for API V2. The repertoire end point would return repertoires as normal from a query but also return a RepertoireGroup with that query. We could also have a simple function in the reference libraries that construct a RepertoireGroup given a metadata file.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 20, 2021

The iReceptor Gateway pulls data from many repositories, so repertoire_id is not unique across repositories.

OK good point. So I guess one RepertoireGroup from each repository (each with its own Manifest) then?

I don't see how those groups are particularly useful.

For me, I think a RepertoireGroup file would be redundant for the iReceptor Gateway. Essentially an iReceptor Gateway download comes from a /repertoire query generating a set of repertoire_ids, which is then used to download all the AIRR rearrangement data for those repertoire_ids. You get a repertoire JSON file that contains the list of Repertoires and an AIRR TSV file that contains the rearrangements from those repertoires (unless you filter at the sequence/rearrangement level).

The manifest is simple:

{
"Info":{},
"DataSets":
[
    {"repertoire_file":["covid-19-1-metadata.json"], "rearrangement_file":["covid-19-1.tsv"]}
]
}

Recall that the Repertoire JSON file is simply an array of Repertoire information, one per repertoire_id. So this is essentially the same as the RepertoireGroup file that would be used to represent the data. The RepertoireGroup would be the same array of repertoire_ids as that in the Repertoire JSON file, without all the metadata for the Repertoire. So in this case, it would be redundant. A RepertoireGroup file for us would essentially be a subset of the full Repertoire JSON file with just the fields repertoire_id, repertoire_description, and subject.sample.collection_time_point_relative.

Now if the user wanted to split the data that they downloaded to do comparative analyses (say they downloaded all Homo sapiens COVID-19 IGH data) and wanted to split the data to look at gender differences. For that analysis I could see how you would want to slice the data based on gender and generate two RepertoireGroup files, one for each gender. You can slice that data in hundreds of different ways, but that is part of the downstream analysis, where RepertoireGroup files would be extremely important. But that isn't the role of the Gateway.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 20, 2021

Regardless, that is an interesting enhancement for API V2. The repertoire end point would return repertoires as normal from a query but also return a RepertoireGroup with that query. We could also have a simple function in the reference libraries that construct a RepertoireGroup given a metadata file.

A function that would split a Repertoire JSON file based on a specific Repertoire field into a set of RepertoireGroup files would be VERY useful IMHO.

To do the gender split I mention above, you just give the utility a JSON file and a field (e.g. subject.sex) and it would go through the JSON file, gather the possible field values from the file, construct a set of repertoire_ids for each, and then generate a RepertoireGroup file for each possible value.

repertoire_split subject.sex covid-19-1-metadata.json

Would generate N files where N is the number of distinct values that existed in the subject.sex field. For example, if the file contained "male", "female", and "not collected" you would get three RepertoireGroup files as follows:

covid-19-1-metadata_male.json
covid-19-1-metadata_female.json
covid-19-1-metadata_not_collected.json

You could then use the repertoire_id fields in the above RepertoireGroup files to easily analyze rearrangement data from male and female subjects from covid-19-1.tsv

@bcorrie bcorrie added this to the AIRR v1.4.0 milestone Jan 17, 2022
@bussec bussec modified the milestones: AIRR v1.4.0, AIRR v2.0.0 Mar 21, 2022
@javh
Copy link
Contributor

javh commented Oct 16, 2023

From the call:

  • We still want to do this.
  • @scharch will meditate on this concept and the strategy.

@bcorrie
Copy link
Contributor Author

bcorrie commented Oct 17, 2023

FYI we went ahead and implemented something like this for downloads from the iReceptor Gateway. If you do a download from the Gateway you get a manifest.json file that describes the download from each repository. It follows the idea of the manifest object here but it might not match the spec as we have it - it is our own internal implementation. We use it to explicitly describe how Repertoire files are linked to Rearrangement/Clone/Cell/GEX data you get from a repository.

Our manifest file looks like this:

{
    "Info": {
        "title": "AIRR Manifest",
        "version": "3.0",
        "description": "List of files for each repository",
        "contact": {
            "name": "iReceptor Gateway",
            "url": "https://gateway-staging.ireceptor.org",
            "email": "[email protected]"
        }
    },
    "DataSets": [
        {
            "repository": "IPA 6",
            "repository_url": "https://ipa6.ireceptor.org/airr/v1/",
            "repertoire_file": "ireceptor-public-archive-metadata.json",
            "rearrangement_file": "ireceptor-public-archive-rearrangement.tsv",
            "cell_file": "ireceptor-public-archive-cell.json",
            "expression_file": "ireceptor-public-archive-gex.json"
        }
    ]
}

@scharch
Copy link
Contributor

scharch commented Nov 13, 2023

FYI we went ahead and implemented something like this for downloads from the iReceptor Gateway.

Thanks @bcorrie, this is definitely a useful starting point. However, in rereading the discussion above, it still feels like we might need to resolve RepertoireGroup before coming back around to this.

Two comments on what you have, though:

  1. It looks like it should be relatively easy to adapt
        "repository": "IPA 6",
        "repository_url": "https://ipa6.ireceptor.org/airr/v1/",
        "repertoire_file": "ireceptor-public-archive-metadata.json",

to a local repertoire. Any complications that you see?

  1. I feel like we might want to make the schema a little more flexible/less pre-specified. In the current PR, there's no slot for GEX, because it wasn't on our radar two years ago. Or raw fastqs (Define a manifest mechanism #426 (comment)_). Or splitting different parts of the schema across multiple files (Define a manifest mechanism #426 (comment)_)

@bcorrie
Copy link
Contributor Author

bcorrie commented Nov 13, 2023

Thoughts after the call today, and based on the above:

Two comments on what you have, though:

  1. It looks like it should be relatively easy to adapt
        "repository": "IPA 6",
        "repository_url": "https://ipa6.ireceptor.org/airr/v1/",
        "repertoire_file": "ireceptor-public-archive-metadata.json",

to a local repertoire. Any complications that you see?

If I understand correctly, I don't think so. If you had a local data set you were describing, you would just have a null (or totally left out) repository_url and repository field. In fact, repository as a field should probably be renamed dataset_name or something like that. So in that case the manifest is describing files on disk, and nothing to do with a repository at all.

  1. I feel like we might want to make the schema a little more flexible/less pre-specified. In the current PR, there's no slot for GEX, because it wasn't on our radar two years ago. Or raw fastqs (Define a manifest mechanism #426 (comment)). Or splitting different parts of the schema across multiple files (Define a manifest mechanism #426 (comment))

Yes, no objections to other files/file types, the ones I listed above map directly to the AIRR Objects such as Repertoire, Rearrangement, Cell and CellExpression but the manifest format should probably be more flexible/extensible. Indeed, the iReceptor Gateway downloads now also provide query JSON files that I would like to be able to add to the manifest.

@bcorrie
Copy link
Contributor Author

bcorrie commented Nov 13, 2023

Also, I am pretty sure we want to go back to having an array of file names for each file type, more like what we had here:

#548 (comment)

I would want to be able to do something like this:

{
"Info":{},
"DataSets":
[
    {"repertoire_file":["study-metadata.json"], 
     "rearrangement_file":["subject1.tsv","subject2.tsv", ... "subjectN.tsv"]}
]
}

I think the manifest I provided above from the Gateway might be from a bug in our manifest generation code - I think the intent was always to have an array of files, but our code possibly drops the array when there is only one file (and in our downloads there is always only one file) per dataset.

@schristley
Copy link
Member

Another comment, we have DataFile defined which specifies the format of the AIRR JSON. It isn't restricted to Repertoire but can hold many of the AIRR objects (rearrangements excluded). This presents a slight conundrum because that data might be put all in one AIRR JSON file, or it might be spread across multiple AIRR JSON files. For example, germline sets coming from one source, repertoire and repertoire groups from another, and a third providing clone data, etc. The repertoire_file tag/type points to the AIRR JSON that hold the Repertoire data but then that means you need tags for all the other ones as well. If you want to be a bit self-reflective, you can use the same name for the tag/type as the object itself, essentially how DataFile is.

This also follows into have an array of file names, as all of the Repertoires might not be in the same file either but spread across multiple.

Another thing if you are considering ease of use and computational efficiency in your design. Without some link between the repertoires and the rearrangements (or any objects actually), one might have to search all of the rearrangement data, to find the entries for a specific repertoire. It might be nice if the structure kept inter-related objects together.

@bcorrie
Copy link
Contributor Author

bcorrie commented Nov 14, 2023

The repertoire_file tag/type points to the AIRR JSON that hold the Repertoire data but then that means you need tags for all the other ones as well. If you want to be a bit self-reflective, you can use the same name for the tag/type as the object itself, essentially how DataFile is.

Yes, we could use the AIRR Object name as the tag (I think that is what you are suggesting). We could have:

{
"Info":{},
"DataSets":
[ {
     "Repertoire":["study-metadata.json"], 
     "Rearrangement":["subject1.tsv","subject2.tsv", ... "subjectN.tsv"]
    }
] }

Two potential issues:

  • I used an explicit tag with _file because I wanted to make it clear that the field was a file name since not all fields in a Manifest are files. In our schemas we use tags like "Repertoire" when we are describing the full object (like in Datafile) but this is different.
  • If we are considering having files with non-AIRR objects in them that might be a bit complicated/confusing???

@bcorrie
Copy link
Contributor Author

bcorrie commented Nov 14, 2023

Another thing if you are considering ease of use and computational efficiency in your design. Without some link between the repertoires and the rearrangements (or any objects actually), one might have to search all of the rearrangement data, to find the entries for a specific repertoire. It might be nice if the structure kept inter-related objects together.

To me that is what the DataSets in a Manifest are for. In the iReceptor Gateway case, if a download comes from 3 different repositories then there are 3 DataSets in the Manifest for that download, and it would look like this:

{
    "Info": {
        "title": "AIRR Manifest",
        "version": "3.0",
        "description": "List of files for each repository",
        "contact": {
            "name": "iReceptor Gateway",
            "url": "https://gateway.ireceptor.org",
            "email": "[email protected]"
        }
    },
    "DataSets": [
        {
            "repository": "VDJServer",
            "repository_url": "https://vdjserver.org/airr/v1/",
            "repertoire_file": "vdjserver-metadata.json",
            "rearrangement_file": ["vdjserver.tsv"]
        },
        {
            "repository": "VDJbase",
            "repository_url": "https://airr-seq.vdjbase.org/airr/v1/",
            "repertoire_file": "vdjbase-metadata.json",
            "rearrangement_file": ["vdjbase.tsv"]
        },
        {
            "repository": "COVID 19-1",
            "repository_url": "https://covid19-1.ireceptor.org/airr/v1/",
            "repertoire_file": "airr-covid-19-metadata.json",
            "rearrangement_file": ["airr-covid-19.tsv"]
        }
    ]
}

All data from a repository is in a single DataSet and all files for that data set are listed so all of the related objects are linked.

But you could easily represent that same download grouped as DataSets in other ways.

The download from VDJServer in this download has two different Repertoires, so you could change the VDJServer DataSet to look like:

[Stuff deleted]
    "DataSets": [
        {
            "repository": "VDJServer",
            "repository_url": "https://vdjserver.org/airr/v1/",
            "repertoire_file": "vdjserver-metadata.json",
            "rearrangement_file": ["repertoire1.tsv", "repertoire2.tsv"]
        },
[Stuff deleted]

In this case all of the Repertoire data is in one file but the rearrangements are spread across two files.

Or you can:

[Stuff deleted]
    "DataSets": [
        {
            "repository": "VDJServer",
            "repository_url": "https://vdjserver.org/airr/v1/",
            "repertoire_file": "repertoire1.json",
            "rearrangement_file": ["repertoire1.tsv"]
        },
        {
            "repository": "VDJServer",
            "repository_url": "https://vdjserver.org/airr/v1/",
            "repertoire_file": "repertoire2.json",
            "rearrangement_file": ["repertoire2.tsv"]
        },
[Stuff deleted]

In this case all of the Repertoire and Rearrangement data for each Repertoire is described by a single DataSet.

All are valid manifests for the same data, just organized differently, but in all cases all related data is grouped together. The granularity of how the data is split is just different.

In the case above, the download has 2 Repertoires from VDJServer, 28 from VDJBase, and 44 from covid19-1. So the Gateway chooses to represent this data as 3 DataSets (one per repository), but it could be represented as 74 DataSets, one per Repertoire. The Manifest here supports any of those.

The user of the manifest knows that if they find a repertoire_id in a Repertoire JSON file (e.g. repertoire2.json) then all of the Rearrangements for that repertoire_id should be found in the related Rearrangement file (e.g. repertoire2.tsv) and vice versa.

@scharch
Copy link
Contributor

scharch commented Mar 15, 2024

Looks good to me. Should GermlineSet also be in the list...?

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 15, 2024

Looks good to me. Should GermlineSet also be in the list...?

I think the key remaining issue might be which objects are missing. There is also no Node, Tree, or Alignment objects... Mostly because I wasn't sure of their status (are they going to move from experimental to production or disappear? 8-)

I agree GermlineSet should be there for sure...

@scharch
Copy link
Contributor

scharch commented Mar 15, 2024

Right - fwiw my intent is for Tree and Node to come out, though obviously that's not finalized...

@javh
Copy link
Contributor

javh commented Mar 15, 2024

If you look at DataFile, you should see a similarity, both are type-keyed, though in the case of DataFile the contents are the actual data while with DataSet it's a file object.

The object names are backwards now. :)

Right - fwiw my intent is for Tree and Node to come out, though obviously that's not finalized...

Even if they stay in, it might make sense for them to live in the same file as Clone, so we could dodge the need for separate keys by setting that convention if they remain.

On another topic, I pushed a few suggestions to FileObject:

  • Added checksum and version fields.
  • Renamed file_type to format.

Also, Manifest feels rather empty to me now. And I'm wondering about the typical use case. What about having what is currently DataSet be the manifest and having a ManifestGroup object for the collections of data sets? Same number of objects, but seems more amenable to the structure where you might have one dataset per folder with a manifest in that folder.

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 15, 2024

Also, Manifest feels rather empty to me now. And I'm wondering about the typical use case. What about having what is currently DataSet be the manifest and having a ManifestGroup object for the collections of data sets? Same number of objects, but seems more amenable to the structure where you might have one dataset per folder with a manifest in that folder.

If I understand correctly, this is just renaming the YAML objects, correct?

To me, a Manifest lists a group of DataSets where the DataSet describes a set of related files. That is how we have it now.

You are suggesting a ManifestGroup lists a group of Manifests where a Manifest describes a set of related files.

To me the current naming logic makes the most sense but it is just object naming so I could be convinced otherwise...

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 15, 2024

On another topic, I pushed a few suggestions to FileObject:

  • Added checksum and version fields.
  • Renamed file_type to format.

Works for me...

@javh
Copy link
Contributor

javh commented Mar 15, 2024

If I understand correctly, this is just renaming the YAML objects, correct?

It would also mean changing the recommended use (assuming I understand the current intended use) such that the primary usage will be a standalone Manifest (DataSet) that is not nested in a ManifestGroup (analogous to Repertoire and RepertoireGroup).

How it's setup now implies to me that you will always have groups (arrays) of DataSets even if the group size is 1.

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 15, 2024

Personally, I would prefer to not have two different entities that I have to deal with. In your scenario users producing data and code consuming data would have to deal with both ManifestGroup and Manifest. In the code they would need to be handled differently. For users producing data, the user would be confused as to what they should produce.

It would be equally correct and valid to represent a single "data set" as a ManifestGroup with one Manifest or just as a Manifest. That is confusing to me...

In the current case there is only on thing that one deals with when talking about data on disk, it is a Manifest. If you are describing files with a single DataSet the Manifest has a data set list of size 1. No confusion.

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 15, 2024

This is exactly what our Analysis Pipeline on the iReceptor Gateway does (based on our current implementation of Manifest. It takes a Manifest that is a download from a repository (see below) and processes it as follows:

{
    "Info": {
        "title": "AIRR Manifest",
        "version": "3.0",
        "description": "List of files for each repository",
        "contact": {
            "name": "iReceptor Gateway",
            "url": "https://gateway.ireceptor.org",
            "email": "[email protected]"
        }
    },
    "DataSets": [
        {
            "repository": "COVID 19-1",
            "repository_url": "https://covid19-1.ireceptor.org/airr/v1/",
            "repertoire_file": "airr-covid-19-metadata.json",
            "rearrangement_file": "airr-covid-19-rearrangement.tsv",
            "cell_file": "airr-covid-19-cell.json",
            "expression_file": "airr-covid-19-gex.json"
        }
    ]
}

It basically searches airr-covid-19-metadata.json for all repertoires in the DataSet and splits the data into a directory per repertoire. Ideally I would be able to use RepetoireGroup for this rather than search a file for all of the Repertoires.

In each repertoire directory, I have a DataSet that represents a single Repertoire with a Manifest that looks like this:

{"Info":{},"DataSets":[{
"cell_file":["64d2a2028ab142c7479d0ee1-cell.json"],
"expression_file":["64d2a2028ab142c7479d0ee1-gex.h5ad"],
"rearrangement_file":["64d2a2028ab142c7479d0ee1-rearrangement.tsv"]
}]}

Note I am no longer using the AIRR GEX/JSON files, I am using a custom file type that is an H5AD file that is generated when I split out the GEX/JSON by repertoire.

I then run whatever the analysis tool that was chosen by the iReceptor Gateway user on this Manifest file, in the case of this analysis it was CellTypist. I have a simple tool that reads a Manifest file and uses that to run CellTypist on the data. Ideally, CellTypist would be able to read AIRR Manifest files and not need that conversion.

At the end, the output you get is a Manifest that describes the files used as well as the output from CellTypist as it was run on the data described by the Manifest.

All of the processing is guided by the Manifest file, and in each case there is one DataSet per Manifest. This is trivial to handle and makes data processing quite simple.

@scharch
Copy link
Contributor

scharch commented Mar 15, 2024

Personally, I would prefer to not have two different entities that I have to deal with. In your scenario users producing data and code consuming data would have to deal with both ManifestGroup and Manifest. In the code they would need to be handled differently. For users producing data, the user would be confused as to what they should produce.

I ...agree with @bcorrie?? =P

you will always have groups (arrays) of DataSets even if the group size is 1.

This is already the case for other parts of the schema, like sample and data_processing within Repertoire or diagnosis within Subject. I assume that the most common use for all of those is an array of length 1...

@javh
Copy link
Contributor

javh commented Mar 15, 2024

Good points. Thanks, @bcorrie and @scharch. I'm sold on leaving Manifest and DataSet as is.

@schristley
Copy link
Member

To add another point, we likely want to be able to validate that a Manifest is correct, so I think that involves adding an additional parameter to airr-tools to load the manifest and run it through validate_object

@schristley
Copy link
Member

should there be a load_manifest function, or set of functions, in interface.py?

@schristley
Copy link
Member

If you look at DataFile, you should see a similarity, both are type-keyed, though in the case of DataFile the contents are the actual data while with DataSet it's a file object.

The object names are backwards now. :)

FileSets ? bleh ;-)

@javh
Copy link
Contributor

javh commented Mar 25, 2024

From the call:

  • Add example to FileObject.compression (zip, gz, bz2).
  • Add GermlineSet and GenotypeSet to DataSet.
  • Sync specs (/specs seems to be current).
  • Then merge.
  • Handle validation routines for Manifest in separate PR.

Add GermlineSet and GenotypeSet to DataSet
Sync specs.
@bcorrie bcorrie merged commit e93d902 into master Mar 26, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define a manifest mechanism
5 participants