Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs in airr-schema.yaml / airr-schema-openapi3.yaml #813

Open
8 of 15 tasks
LonnekeScheffer opened this issue Nov 12, 2024 · 18 comments · May be fixed by #773, #814 or #815
Open
8 of 15 tasks

bugs in airr-schema.yaml / airr-schema-openapi3.yaml #813

LonnekeScheffer opened this issue Nov 12, 2024 · 18 comments · May be fixed by #773, #814 or #815

Comments

@LonnekeScheffer
Copy link

LonnekeScheffer commented Nov 12, 2024

Some issues encountered when auto-generating LinkML from airr-schema.yaml / airr-schema-openapi3.yaml
These are all relatively easy/small fixes.

Issues for both AIRR version 1.5 and 2.0

  • Property release_version has inconsistent type. In AlleleDescription, type=integer; in GermlineSet, type=number. Should be integer in both cases.
  • Ontology species is missing source node in some places. This ontology is correctly defined in Subject, but in AlleleDescription and GermlineSet, the expected field x-airr/ontology/top_node/id is missing.
  • (does not need to be handled here, see comments) Some enums are defined several times, and differ in that they contain 'null' in some places, but not in others: locus (contains 'null' in Rearrangement, but not in AlleleDescription, GermlineSet, Genotype); mhc_class (contains 'null' in Reactivity, not in MHCGenotype)
  • For Ontology 'Property' in class 'CellExpression' (v1.5) / 'Expression' (v2.0) the expected field x-airr/ontology/top_node/id is missing.
  • For ontology antigen in class. 'ReceptorReactivity' (v1.5) / 'Reactivity' (v2.0) the expected field x-airr/ontology/top_node/id is missing.
  • (does not need to be handled here, see comments) Property nodes of class Tree should have $ref: '#/Node' directly under nodes, now it is placed under a field called additionalProperties which is not used anywhere else
  • RepertoireGroup contains property repertoires. This is not simply an array of Repertoire objects, but rather, an array of items which are unnamed (!) objects, where each object contains a repertoire_id, repertoire_description, time_point. There are no other comparable occurrences in the schema files where new objects are defined in such a way. Since this object is listed under 'items', I cannot infer a name for this object when auto-generating LinkML. Therefore, I would suggest creating a new named class, e.g., RepertoireReference, which contains these three fields, and is used as a 'type' for this array.
  • New issue identified: the field 'name' of 'Acknowledgement' clashes with the AKC (means a 'Full name of individual' in AIRR but a 'name for a thing' in AKC). I would recommend changing 'name' to something like 'person_name'
  • For class TimePoint, field 'unit' is an Ontology (UO:0000033 time unit). This name is too generic and clashes with the AKC. Should be renamed time_unit instead.
  • Similarly to the above point, TimePoint has a field called 'value' which clashes in meaning with the AKC. Should be renamed time_value instead.
  • field 'sequencing_run_date' of class SequencingRun has format: date, whereas all other dates have format: date-time. While this does not cause direct issues, I would recommend changing this format to date-time for consistency.

Issues for AIRR version 2.0 only

  • Property unit occurs as an ontology across different classes. This ontology does not have the same meaning. E.g.: TimePoint (unit=UO:0000003), TimeInterval (unit=UO:0000033), PhysicalQuantity (unit=UO:0000024), TimeQuantity (unit=UO:0000033). The field 'unit' should be renamed to something unique (time_unit, time_interval_unit, physical_quantity_unit)
  • For ontology orcid_id in class Contributor the expected field x-airr/ontology/top_node/id is null.
  • For ontology affiliation in class Contributor the expected field x-airr/ontology/top_node/id is null.
  • Property reactivity_ref is defined with two different types. Inside class Reactivity, the type is array. Inside class Rearrangement, the type is string, but the description explicitly states that this string should be a comma separated list. I would propose changing the type to array. (and besides, Rearrangement also contains reactivity_id which is a comma separated list as string, the type for that should then probably change as well.)
@schristley
Copy link
Member

  • RepertoireGroup contains property repertoires. This is not simply an array of Repertoire objects, but rather, an array of items which are unnamed (!) objects, where each object contains a repertoire_id, repertoire_description, time_point. There are no other comparable occurrences in the schema files where new objects are defined in such a way. Since this object is listed under 'items', I cannot infer a name for this object when auto-generating LinkML. Therefore, I would suggest creating a new named class, e.g., RepertoireReference, which contains these three fields, and is used as a 'type' for this array.

Yes, this is a good point that we haven't addressed in #773 yet, and is a criteria we applied to the germline gene set schema. Adding a new class is the solution, though I'm not sure the best name. Maybe RepertoireDetail as its details about how the repertoire is organized in the group.

@schristley
Copy link
Member

Hi @LonnekeScheffer , we need to differentiate these issues between v1.5 and v2.0. If v2.0 ends up being broken for awhile, that's ok, all the data is currently in v1.5 so that's the priority. Maybe some of these issues are only present in v2.0 so we can ignore them for now.

@LonnekeScheffer
Copy link
Author

Hi @schristley, I have now updated the list to separate the airr v1.5 issues (which also occur in v2.0) from the v2.0-only issues

This was linked to pull requests Nov 26, 2024
@schristley
Copy link
Member

  • Some enums are defined several times, and differ in that they contain 'null' in some places, but not in others: locus (contains 'null' in Rearrangement, but not in AlleleDescription, GermlineSet, Genotype); mhc_class (contains 'null' in Reactivity, not in MHCGenotype)

For locus, this is the difference between being required or not, as it is required for the Germline classes but not for Rearrangement. We will need to handle this as an exception in the AKC convert script.

Likewise for mhc_class

@schristley
Copy link
Member

schristley commented Nov 26, 2024

  • For Ontology 'Property' in class 'CellExpression' (v1.5) / 'Expression' (v2.0) the expected field x-airr/ontology/top_node/id is missing.
  • For ontology antigen in class. 'ReceptorReactivity' (v1.5) / 'Reactivity' (v2.0) the expected field x-airr/ontology/top_node/id is missing.

@bussec @bcorrie This seems to be incorrect usage by AIRR standards. Gene names are not ontology-based exactly, i.e. each individual gene is not an ontology term. This is more like a controlled vocabulary from an external source. I don't think we have a similar situation elsewhere in the AIRR schema, so I think we need to define this differently.

Likewise for antigen, each antigen is not an ontology term.

@schristley
Copy link
Member

  • Property nodes of class Tree should have $ref: '#/Node' directly under nodes, now it is placed under a field called additionalProperties which is not used anywhere else

@LonnekeScheffer this is actually correct, it is just uncommon. What this means is that instead of the keys/properties of the dictionary being pre-defined, they are dynamic. The additionalProperties then indicates the schema for those dynamic properties.

For LinkML, this gets translated into a multivalued range

@schristley schristley linked a pull request Jan 28, 2025 that will close this issue
@bcorrie
Copy link
Contributor

bcorrie commented Jan 29, 2025

  • For Ontology 'Property' in class 'CellExpression' (v1.5) / 'Expression' (v2.0) the expected field x-airr/ontology/top_node/id is missing.[ ] For ontology antigen in class. 'ReceptorReactivity' (v1.5) / 'Reactivity' (v2.0) the expected field x-airr/ontology/top_node/id is missing.

@bussec @bcorrie This seems to be incorrect usage by AIRR standards. Gene names are not ontology-based exactly, i.e. each individual gene is not an ontology term. This is more like a controlled vocabulary from an external source. I don't think we have a similar situation elsewhere in the AIRR schema, so I think we need to define this differently.

Likewise for antigen, each antigen is not an ontology term.

@bussec any thoughts on this. These are CURIEs with a label, not Ontologies.

Should we create a LabeledCURIE top level object as follows:

# Properties that are based upon a CURIE with an associated label
LabeledCURIE:
    type: object
    properties:
        id:
            type: string
            description: CURIE of the concept, encoding the ontology and the local ID
        label:
            type: string
            description: Label of the concept in the respective ontology

The format is identical to the Ontology object which is not ideal.

The only difference between an Ontology and Labeled Curie is that in the AIRR attributes the format is ontology and an ontology field is expected to have a AIRR attribute ontology which defines the top node. For a labeled curie the format is CURIE.

So we could get rid of Ontology, just use LabeledCURIE, and rely on the AIRR attributes to define whether the field is an Ontology or a LabeledCURIE by using format: Ontology or format: CURIE respectively

@bcorrie
Copy link
Contributor

bcorrie commented Jan 29, 2025

  • Property unit occurs as an ontology across different classes. This ontology does not have the same meaning. E.g.: TimePoint (unit=UO:0000003), TimeInterval (unit=UO:0000033), PhysicalQuantity (unit=UO:0000024), TimeQuantity (unit=UO:0000033). The field 'unit' should be renamed to something unique (time_unit, time_interval_unit, physical_quantity_unit)
  • New issue identified: the field 'name' of 'Acknowledgement' clashes with the AKC (means a 'Full name of individual' in AIRR but a 'name for a thing' in AKC). I would recommend changing 'name' to something like 'person_name'[x]
  • For class TimePoint, field 'unit' is an Ontology (UO:0000033 time unit). This name is too generic and clashes with the AKC. Should be renamed time_unit instead.[x]
  • Similarly to the above point, TimePoint has a field called 'value' which clashes in meaning with the AKC. Should be renamed time_value instead.

I don't think this is a problem in the AIRR Standard. I think this is a problem/limitation with LinkML and the way it represents the slots. In my opinion at least, it is incredibly limiting to have to have a slot named unit be required to globally have the same characteristics across many classes. I would suggest the solution is to have the AIRR to AKC class conversion convert the AIRR TimePoint.unit to a LinkML slot time_point_unit rather than change the AIRR Standard. Similar for the other fields.

@bcorrie
Copy link
Contributor

bcorrie commented Jan 29, 2025

  • For ontology orcid_id in class Contributor the expected field x-airr/ontology/top_node/id is null.[ ] For ontology affiliation in class Contributor the expected field x-airr/ontology/top_node/id is null.

These are both LabeledCURIES not ontologies - see: #813 (comment)

Once we resolve how to handle LabeledCURIE and Ontology this should be resolved.

@bcorrie
Copy link
Contributor

bcorrie commented Jan 29, 2025

  • Property reactivity_ref is defined with two different types. Inside class Reactivity, the type is array. Inside class Rearrangement, the type is string, but the description explicitly states that this string should be a comma separated list. I would propose changing the type to array. (and besides, Rearrangement also contains reactivity_id which is a comma separated list as string, the type for that should then probably change as well.)

We thought about this, but... Rearrangement.reactivity_ref is a Rearrangement field, and the main file format is a TSV file. Arrays are difficult to serialize to a TSV structured format. This field is similar to v_call, d_call, j_call, which can also have multiple, comma separated values. So we decided that this field should follow the same convention as the the VDJ call fields and be a string with comma separated CURIEs. The same is true for Rearrangement.reactivity_id

The two different definitions for the same field in different objects is the same problem as here: #813 (comment) and I think needs to be resolved at the AKC/LinkML end. The AIRR Standard doesn't have a problem with having two field names the same with different definitions in different objects.

@bcorrie bcorrie linked a pull request Jan 29, 2025 that will close this issue
@LonnekeScheffer
Copy link
Author

LonnekeScheffer commented Jan 29, 2025

  • Property unit occurs as an ontology across different classes. This ontology does not have the same meaning. E.g.: TimePoint (unit=UO:0000003), TimeInterval (unit=UO:0000033), PhysicalQuantity (unit=UO:0000024), TimeQuantity (unit=UO:0000033). The field 'unit' should be renamed to something unique (time_unit, time_interval_unit, physical_quantity_unit)
  • New issue identified: the field 'name' of 'Acknowledgement' clashes with the AKC (means a 'Full name of individual' in AIRR but a 'name for a thing' in AKC). I would recommend changing 'name' to something like 'person_name'[x]
  • For class TimePoint, field 'unit' is an Ontology (UO:0000033 time unit). This name is too generic and clashes with the AKC. Should be renamed time_unit instead.[x]
  • Similarly to the above point, TimePoint has a field called 'value' which clashes in meaning with the AKC. Should be renamed time_value instead.

I don't think this is a problem in the AIRR Standard. I think this is a problem/limitation with LinkML and the way it represents the slots. In my opinion at least, it is incredibly limiting to have to have a slot named unit be required to globally have the same characteristics across many classes. I would suggest the solution is to have the AIRR to AKC class conversion convert the AIRR TimePoint.unit to a LinkML slot time_point_unit rather than change the AIRR Standard. Similar for the other fields.

I could implement a workaround such as a list with 'illegal' AKC slot names (which would for now only contain 'unit' and 'value'), where if the new slot has such an illegal name, the class name is concatenated to it ('time_point_unit'). My only concern with this is that in any current or future conversion to/from AKC and AIRR, this needs to be explicitly taken into account and some kind of similar or reverse workaround needs to be implemented. The difficulty is not the implementation itself, but to remember (/be aware that it's necessary) to do it.

  • Property reactivity_ref is defined with two different types. Inside class Reactivity, the type is array. Inside class Rearrangement, the type is string, but the description explicitly states that this string should be a comma separated list. I would propose changing the type to array. (and besides, Rearrangement also contains reactivity_id which is a comma separated list as string, the type for that should then probably change as well.)

We thought about this, but... Rearrangement.reactivity_ref is a Rearrangement field, and the main file format is a TSV file. Arrays are difficult to serialize to a TSV structured format. This field is similar to v_call, d_call, j_call, which can also have multiple, comma separated values. So we decided that this field should follow the same convention as the the VDJ call fields and be a string with comma separated CURIEs. The same is true for Rearrangement.reactivity_id

The two different definitions for the same field in different objects is the same problem as here: #813 (comment) and I think needs to be resolved at the AKC/LinkML end. The AIRR Standard doesn't have a problem with having two field names the same with different definitions in different objects.

The same thing goes here. I can implement this, but it is a hack, and the more such explicit workarounds I implement for specific cases, the more likely that it will cause conversion problems in the future.

@schristley
Copy link
Member

  • Property unit occurs as an ontology across different classes. This ontology does not have the same meaning. E.g.: TimePoint (unit=UO:0000003), TimeInterval (unit=UO:0000033), PhysicalQuantity (unit=UO:0000024), TimeQuantity (unit=UO:0000033). The field 'unit' should be renamed to something unique (time_unit, time_interval_unit, physical_quantity_unit)
  • New issue identified: the field 'name' of 'Acknowledgement' clashes with the AKC (means a 'Full name of individual' in AIRR but a 'name for a thing' in AKC). I would recommend changing 'name' to something like 'person_name'[x]
  • For class TimePoint, field 'unit' is an Ontology (UO:0000033 time unit). This name is too generic and clashes with the AKC. Should be renamed time_unit instead.[x]
  • Similarly to the above point, TimePoint has a field called 'value' which clashes in meaning with the AKC. Should be renamed time_value instead.

I don't think this is a problem in the AIRR Standard. I think this is a problem/limitation with LinkML and the way it represents the slots.

I don't consider it a problem either with the AIRR Standard, just a discussion about conventions. I feel like we've always had this informal guideline to make field names unique, thus why we have explicitly named id fields instead of just calling them all id, and we've seemed to do that almost everywhere in the schema.

Also, these are new fields, so we aren't breaking anything by changing the names. If we want them unique then now is the best time to change them.

In my opinion at least, it is incredibly limiting to have to have a slot named unit be required to globally have the same characteristics across many classes. I would suggest the solution is to have the AIRR to AKC class conversion convert the AIRR TimePoint.unit to a LinkML slot time_point_unit rather than change the AIRR Standard. Similar for the other fields.

I would argue that we still do this in AIRR Standards, regardless of AKC and linkml. We call the field repertoire_id in all the objects that have it, which the intention that field (ok don't call it a "slot" ;-) is the same everywhere throughout the standards.

I get your point about AKC and linkml, but setting that aside. Is there any reason why we cannot do this for AIRR Standards? What will break or why is bad to change the names?

@bcorrie
Copy link
Contributor

bcorrie commented Jan 30, 2025

I don't think this is a problem in the AIRR Standard. I think this is a problem/limitation with LinkML and the way it represents the slots.

I don't consider it a problem either with the AIRR Standard, just a discussion about conventions.

I have always thought this was a problem with LinkML - not sure if it is just the way we are using it or a fundamental issue. Having a global list of slots that are used throughout all classes is like have a global list of variables in a large complex piece of software. It just seems wrong. The fact that within the AKC Schema we can't have two classes A and B with slots named the same unless they are of identical type seems ridiculous to me. If this problem went away then there would be no problems with the schema transformation.

@bcorrie
Copy link
Contributor

bcorrie commented Jan 30, 2025

I would argue that we still do this in AIRR Standards, regardless of AKC and linkml. We call the field repertoire_id in all the objects that have it, which the intention that field (ok don't call it a "slot" ;-) is the same everywhere throughout the standards.

I don't think that is a good example - repertoire_id is always talking about the same thing. It is an ID for a repertoire, and it appears in many objects, so the name of field identifies what type of ID it is. The only place where there is an extra qualifier on the _id is in the Repertoire object where we could have Repertoire.id but we chose to have Repertoire.repertoire_id

I would suggest that we gave up on unique field names in the AIRR Standard as soon as we started having complex object structures beyond just having Repertoire and Rearrangement. We now have complex time, quantity, person objects all of which represent similar but different things. As such they have similar fields that are sometime subtly different.

@bcorrie
Copy link
Contributor

bcorrie commented Jan 30, 2025

I get your point about AKC and linkml, but setting that aside. Is there any reason why we cannot do this for AIRR Standards? What will break or why is bad to change the names?

I don't think it is practical or logical to have unique field names across all objects. Objects encapsulate concepts, and those concepts have attributes. The attributes should be named such that they make sense in the context of the Object, not globally.

It makes no sense to me that if we want to add a unit attribute to an object in the AIRR Standard that we can't just say Object.unit and instead we have to use Object.object_unit. In addition, we only have to do this if there is another object in the AIRR Standard that has a unit attribute that causes a global attribute conflict, otherwise it is OK to use Object.unit. Even worse, when at time T we only have object A in the AIRR Standard with a A.unit field, and then at some point later we add a new object B with a unit field - we then need to use B.b_unit because we have a conflict? So some units have qualifiers and others don't. Do we go back and change A.unit to A.a_unit to be consistent? That seems like a mess to me...

@schristley
Copy link
Member

I would argue that we still do this in AIRR Standards, regardless of AKC and linkml. We call the field repertoire_id in all the objects that have it, which the intention that field (ok don't call it a "slot" ;-) is the same everywhere throughout the standards.

I don't think that is a good example - repertoire_id is always talking about the same thing. It is an ID for a repertoire, and it appears in many objects, so the name of field identifies what type of ID it is.

sequence, junction, junction_aa, pub_ids, locus, species, etc. We use the same names for the same concepts all over the place in the AIRR Standards.

The only place where there is an extra qualifier on the _id is in the Repertoire object where we could have Repertoire.id but we chose to have Repertoire.repertoire_id

??? but we do have them, there's contributor_id, sequence_id, germline_set_id, receptor_genotype_id, and on and on. Pretty much every object that has an ID has a uniquely named field for it.

@schristley
Copy link
Member

I get your point about AKC and linkml, but setting that aside. Is there any reason why we cannot do this for AIRR Standards? What will break or why is bad to change the names?

I don't think it is practical or logical to have unique field names across all objects. Objects encapsulate concepts, and those concepts have attributes. The attributes should be named such that they make sense in the context of the Object, not globally.

It makes no sense to me that if we want to add a unit attribute to an object in the AIRR Standard that we can't just say Object.unit and instead we have to use Object.object_unit. In addition, we only have to do this if there is another object in the AIRR Standard that has a unit attribute that causes a global attribute conflict, otherwise it is OK to use Object.unit. Even worse, when at time T we only have object A in the AIRR Standard with a A.unit field, and then at some point later we add a new object B with a unit field - we then need to use B.b_unit because we have a conflict? So some units have qualifiers and others don't. Do we go back and change A.unit to A.a_unit to be consistent? That seems like a mess to me...

This is an abstract and theoretical concern. In practicality, we haven't run into this situation. Partially, I would argue, because we've been using unique names for semantically distinct concepts, and not using polymorphism such that the same name is used across objects to refer to completely different concepts.

All of the other objects avoid this ambiguity, except for these new objects where unit has different meanings based upon the object they are in (either its a time unit or its a physical unit)

@bcorrie
Copy link
Contributor

bcorrie commented Jan 30, 2025

I will let others comment... 8-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment