Encoding specified after the encoded data #9

Juul · 2017-03-11T03:04:58Z

In the currently available SBOL examples the encoding tag within the sequence tag is specified after the end of the elements tag. This is problematic for streaming parsers since they then have to buffer the entire contents of each elements tag before it can be decoded.

If the elements tag contains a lot of data e.g. if a user of SBOL compliant software decides to save a whole unannotated genome in SBOL format then the entire genome would have to be loaded into memory in such a parser.

Possibly something to improve for future SBOL versions?

The text was updated successfully, but these errors were encountered:

cjmyers · 2017-03-11T06:26:08Z

Thanks for the suggestion. However, it is not actually prescribed that it comes before or after. In XML, there is no order to the tags. I don’t think we can prescribe one. We could make our libraries serialize in one order, but we can never be sure some other implementation did not swap them, so we would need to be equipped for both. Keep in mind for DNA encoding is nearly always going to be the simple IUPAC code, and it really does not require any “decoding”.

…

On Mar 11, 2017, at 3:04 AM, Marc Juul ***@***.***> wrote: In the currently available SBOL examples the encoding tag within the sequence tag is specified after the end of the elements tag. This is problematic for streaming parsers since they then have to buffer the entire contents of each elements tag before it can be decoded. If the elements tag contains a lot of data e.g. if a user of SBOL compliant software decides to save a whole unannotated genome in SBOL format then the entire genome would have to be loaded into memory in such a parser. Possibly something to improve for future SBOL versions? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADWD93t_TszEM5gkGsOitAqyuJuPXSbIks5rkg9bgaJpZM4MaEyb>.

Juul · 2017-03-12T07:33:46Z

Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor.

The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format.

My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data.

Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors.

graik · 2017-03-12T08:04:48Z

I think this is a pretty typical example of why we need more tightly specified formats for everyday applications (i.e. "fully specified sequence"). It also shows that the sbol:type field should really be an rdf:type field defining specific sub-classes for DNA, protein, RNA and chemicals. The fact that one needs to parse all the fields of a ComponentDefinition before knowing whether it is the expected DNA or RNA or protein or even an abstract ray of light really complicates everyday use. A DNAComponentDefinition then could be tightly specified to guarantee a certain encoding.

…

On Sun, Mar 12, 2017 at 10:33 AM, Marc Juul ***@***.***> wrote: Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor. The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format. My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data. Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABxs3T-08Qcg7jL-yBFUiP45SSxPaT1eks5rk5_agaJpZM4MaEyb> .

--

___________________________________ Raik Grünberg http://www.raiks.de/contact.html

___________________________________

Juul · 2017-03-12T08:45:23Z

Hm, yes @graik that would definitely solve the problem. I don't know enough about SBOL to say if that might prevent some legitimate use-cases that mix DNA, protein and RNA.

cjmyers · 2017-03-12T10:15:31Z

Ah, I understand you now. You would like something like this: <sbol:elements rdf:datatype="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html">attaaagaggagaaa</sbol:elements> I just tested this with libSBOLj, and it does not cause any problems to include the datatype in this way. Currently, libSBOLj will ignore this datatype field, meaning it gets dropped. However, I believe is should be preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL handle it. Would be worth a test. In any case, I believe that even with SBOL today, you should be allowed to do this in your files, and it is, in my opinion, still legal SBOL serialization. I will log an issue to libSBOLj’s tracker to preserve this information. Hopefully, this will address your issue.

…

On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> wrote: Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor. The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format. My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data. Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_3wTE-TF-7ks5rk5_agaJpZM4MaEyb>.

graik · 2017-03-12T11:34:30Z

This would indeed be a pretty straightforward solution but is it also valid RDF? From the OWL reference:

NOTE: It is not illegal, although not recommended, for applications to

define their own datatypes by defining an instance of rdfs:Datatype. Such datatypes are "unrecognized", but are treated in a similar fashion as "unsupported datatypes" (see Sec. 6.3 <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how these should be treated by OWL tools). I don't know whether this applies only to ontology definitions but I doubt it. It could create a problem if "elements" receives a data type that is unknown to normal RDF tools instead of the "String" that it really is. Some parsers may decide to skip the entry at the low level. Instead we could use the standard rdf:type field on the level of "Sequence" to point to something like DNA sequence, Protein sequence, etc. This would essentially mean we define sub-classes of Sequence in the SBOL data model, which is still a pretty minimal solution. Sub-classing ComponentDefinition would be much better, IMO, but is a larger change.

…

On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***> wrote: Ah, I understand you now. You would like something like this: <sbol:elements rdf:datatype="http://www.chem.qmul.ac.uk/iubmb/misc/naseq. html">attaaagaggagaaa</sbol:elements> I just tested this with libSBOLj, and it does not cause any problems to include the datatype in this way. Currently, libSBOLj will ignore this datatype field, meaning it gets dropped. However, I believe is should be preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL handle it. Would be worth a test. In any case, I believe that even with SBOL today, you should be allowed to do this in your files, and it is, in my opinion, still legal SBOL serialization. I will log an issue to libSBOLj’s tracker to preserve this information. Hopefully, this will address your issue. > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> wrote: > > Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor. > > The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format. > > My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data. > > Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub < #9# issuecomment-285927873>, or mute the thread <https://github.com/ notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>. > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb> .

--

___________________________________ Raik Grünberg http://www.raiks.de/contact.html

___________________________________

cjmyers · 2017-03-12T11:57:52Z

Think datatype should be fine since when ignored in my experience it is treated as string. Certainly it creates no issues with SBOL tools. Will check how virtuoso handles it. Chris

…

Sent from my iPhone

On Mar 12, 2017, at 11:34 AM, Raik Grünberg ***@***.***> wrote: This would indeed be a pretty straightforward solution but is it also valid RDF? From the OWL reference: > NOTE: It is not illegal, although not recommended, for applications to define their own datatypes by defining an instance of rdfs:Datatype. Such datatypes are "unrecognized", but are treated in a similar fashion as "unsupported datatypes" (see Sec. 6.3 <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how these should be treated by OWL tools). I don't know whether this applies only to ontology definitions but I doubt it. It could create a problem if "elements" receives a data type that is unknown to normal RDF tools instead of the "String" that it really is. Some parsers may decide to skip the entry at the low level. Instead we could use the standard rdf:type field on the level of "Sequence" to point to something like DNA sequence, Protein sequence, etc. This would essentially mean we define sub-classes of Sequence in the SBOL data model, which is still a pretty minimal solution. Sub-classing ComponentDefinition would be much better, IMO, but is a larger change. On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***> wrote: > Ah, I understand you now. You would like something like this: > > <sbol:elements rdf:datatype="http://www.chem.qmul.ac.uk/iubmb/misc/naseq. > html">attaaagaggagaaa</sbol:elements> > > I just tested this with libSBOLj, and it does not cause any problems to > include the datatype in this way. Currently, libSBOLj will ignore this > datatype field, meaning it gets dropped. However, I believe is should be > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL > handle it. Would be worth a test. > > In any case, I believe that even with SBOL today, you should be allowed to > do this in your files, and it is, in my opinion, still legal SBOL > serialization. I will log an issue to libSBOLj’s tracker to preserve this > information. Hopefully, this will address your issue. > > > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> wrote: > > > > Yes I am proposing that it should not be a tag at all but rather an > attribute of either the sequence or element tags. The fact that you > currently can encounter the encoding tag after the elements tag is causing > issues with my streaming processor. > > > > The reason why I need to know the encoding is that I don't even know if > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I > could look at the data itself but you can have AA or SMILE data that > consists only of characters that are legal in either format. > > > > My streaming processor is building a BLAST database from a large amount > of user-uploaded files and it needs to discard the SMILE data (and > sometimes DNA or Amino Acid sequence data depending on parameters) or the > BLAST database command will exit with an error. I cannot even easily > pre-categorize the sbol files on user upload since a single sbol file could > contain sequences with different encodings, so I'm left with no option but > to buffer an unknown and potentially very large amount of sequence data. > > > > Regardless it's always good practice to keep metadata before the actual > data, rather than leaving that decision to the implementors. > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub < > #9# > issuecomment-285927873>, or mute the thread <https://github.com/ > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>. > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#9>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb> > . > -- ___________________________________ Raik Grünberg http://www.raiks.de/contact.html ___________________________________ — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

graik · 2017-03-12T12:06:47Z

"rdf:datatype" pointing to an .html address looks very wrong... sub-classing Sequence would be the much cleaner solution. Probably with additional benefits if we, further down the road, also implement it in the library data model.

…

On Sun, Mar 12, 2017 at 2:57 PM, cjmyers ***@***.***> wrote: Think datatype should be fine since when ignored in my experience it is treated as string. Certainly it creates no issues with SBOL tools. Will check how virtuoso handles it. Chris Sent from my iPhone > On Mar 12, 2017, at 11:34 AM, Raik Grünberg ***@***.***> wrote: > > This would indeed be a pretty straightforward solution but is it also valid > RDF? From the OWL reference: > > > NOTE: It is not illegal, although not recommended, for applications to > define their own datatypes by defining an instance of rdfs:Datatype. Such > datatypes are "unrecognized", but are treated in a similar fashion as > "unsupported datatypes" (see Sec. 6.3 > <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how > these should be treated by OWL tools). > > I don't know whether this applies only to ontology definitions but I doubt > it. It could create a problem if "elements" receives a data type that is > unknown to normal RDF tools instead of the "String" that it really is. Some > parsers may decide to skip the entry at the low level. Instead we could use > the standard rdf:type field on the level of "Sequence" to point to > something like DNA sequence, Protein sequence, etc. This would essentially > mean we define sub-classes of Sequence in the SBOL data model, which is > still a pretty minimal solution. Sub-classing ComponentDefinition would be > much better, IMO, but is a larger change. > > On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***> wrote: > > > Ah, I understand you now. You would like something like this: > > > > <sbol:elements rdf:datatype="http://www.chem. qmul.ac.uk/iubmb/misc/naseq. > > html">attaaagaggagaaa</sbol:elements> > > > > I just tested this with libSBOLj, and it does not cause any problems to > > include the datatype in this way. Currently, libSBOLj will ignore this > > datatype field, meaning it gets dropped. However, I believe is should be > > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL > > handle it. Would be worth a test. > > > > In any case, I believe that even with SBOL today, you should be allowed to > > do this in your files, and it is, in my opinion, still legal SBOL > > serialization. I will log an issue to libSBOLj’s tracker to preserve this > > information. Hopefully, this will address your issue. > > > > > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> wrote: > > > > > > Yes I am proposing that it should not be a tag at all but rather an > > attribute of either the sequence or element tags. The fact that you > > currently can encounter the encoding tag after the elements tag is causing > > issues with my streaming processor. > > > > > > The reason why I need to know the encoding is that I don't even know if > > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I > > could look at the data itself but you can have AA or SMILE data that > > consists only of characters that are legal in either format. > > > > > > My streaming processor is building a BLAST database from a large amount > > of user-uploaded files and it needs to discard the SMILE data (and > > sometimes DNA or Amino Acid sequence data depending on parameters) or the > > BLAST database command will exit with an error. I cannot even easily > > pre-categorize the sbol files on user upload since a single sbol file could > > contain sequences with different encodings, so I'm left with no option but > > to buffer an unknown and potentially very large amount of sequence data. > > > > > > Regardless it's always good practice to keep metadata before the actual > > data, rather than leaving that decision to the implementors. > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub < > > #9# > > issuecomment-285927873>, or mute the thread <https://github.com/ > > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ > > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>. > > > > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#9# issuecomment-285935056>, > > or mute the thread > > <https://github.com/notifications/unsubscribe- auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb> > > . > > > > > > -- > ___________________________________ > Raik Grünberg > http://www.raiks.de/contact.html > ___________________________________ > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub, or mute the thread. > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABxs3YaxoLIkG5QyZBq1D2NwozWf2k_Qks5rk93BgaJpZM4MaEyb> .

--

___________________________________ Raik Grünberg http://www.raiks.de/contact.html

___________________________________

cjmyers · 2017-03-12T12:22:39Z

The problem with sub-typing is that all current SBOL tools using our libraries will not treat the object as a Sequence but rather a GeneticTopLevel, so the tools will no longer work. The subType solution requires a change to SBOL and all libraries and ultimately all software. So, this is a really heavy solution. Here is a better one: <sbol:Sequence rdf:about="http://synbiohub.org/public/igem/BBa_B0030_sequence/1"> <sbol:persistentIdentity rdf:resource="http://synbiohub.org/public/igem/BBa_B0030_sequence"/> <sbol:displayId>BBa_B0030_sequence</sbol:displayId> <sbol:version>1</sbol:version> <prov:wasDerivedFrom rdf:resource="http://parts.igem.org/Part:BBa_B0030"/> <prov:wasGeneratedBy rdf:resource="http://synbiohub.org/public/igem/igem2sbol/1"/> <sbh:ownedBy rdf:resource="http://synbiohub.org/user/james"/> <sbh:ownedBy rdf:resource="http://synbiohub.org/user/myers"/> <sbol:elements sbol:encoding="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html">attaaagaggagaaa</sbol:elements> <sbol:encoding rdf:resource="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html"/> </sbol:Sequence> I just checked this with SBOL Validator. The encoding is duplicated, so tools using our libraries will find it. It is also included in the elements field as was requested. I see no problem with Marc taking this approach. If this is useful, we can modify the library serialization to include this attribute. All existing tools will work, and new tools using a streaming parser should also be able to take advantage of this.

…

On Mar 12, 2017, at 12:06 PM, Raik Grünberg ***@***.***> wrote: "rdf:datatype" pointing to an .html address looks very wrong... sub-classing Sequence would be the much cleaner solution. Probably with additional benefits if we, further down the road, also implement it in the library data model. On Sun, Mar 12, 2017 at 2:57 PM, cjmyers ***@***.***> wrote: > Think datatype should be fine since when ignored in my experience it is > treated as string. Certainly it creates no issues with SBOL tools. Will > check how virtuoso handles it. > > Chris > > Sent from my iPhone > > > On Mar 12, 2017, at 11:34 AM, Raik Grünberg ***@***.***> > wrote: > > > > This would indeed be a pretty straightforward solution but is it also > valid > > RDF? From the OWL reference: > > > > > NOTE: It is not illegal, although not recommended, for applications to > > define their own datatypes by defining an instance of rdfs:Datatype. Such > > datatypes are "unrecognized", but are treated in a similar fashion as > > "unsupported datatypes" (see Sec. 6.3 > > <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how > > > these should be treated by OWL tools). > > > > I don't know whether this applies only to ontology definitions but I > doubt > > it. It could create a problem if "elements" receives a data type that is > > unknown to normal RDF tools instead of the "String" that it really is. > Some > > parsers may decide to skip the entry at the low level. Instead we could > use > > the standard rdf:type field on the level of "Sequence" to point to > > something like DNA sequence, Protein sequence, etc. This would > essentially > > mean we define sub-classes of Sequence in the SBOL data model, which is > > still a pretty minimal solution. Sub-classing ComponentDefinition would > be > > much better, IMO, but is a larger change. > > > > On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***> > wrote: > > > > > Ah, I understand you now. You would like something like this: > > > > > > <sbol:elements rdf:datatype="http://www.chem. > qmul.ac.uk/iubmb/misc/naseq. > > > html">attaaagaggagaaa</sbol:elements> > > > > > > I just tested this with libSBOLj, and it does not cause any problems to > > > include the datatype in this way. Currently, libSBOLj will ignore this > > > datatype field, meaning it gets dropped. However, I believe is should > be > > > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL > > > handle it. Would be worth a test. > > > > > > In any case, I believe that even with SBOL today, you should be > allowed to > > > do this in your files, and it is, in my opinion, still legal SBOL > > > serialization. I will log an issue to libSBOLj’s tracker to preserve > this > > > information. Hopefully, this will address your issue. > > > > > > > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> > wrote: > > > > > > > > Yes I am proposing that it should not be a tag at all but rather an > > > attribute of either the sequence or element tags. The fact that you > > > currently can encounter the encoding tag after the elements tag is > causing > > > issues with my streaming processor. > > > > > > > > The reason why I need to know the encoding is that I don't even know > if > > > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I > > > could look at the data itself but you can have AA or SMILE data that > > > consists only of characters that are legal in either format. > > > > > > > > My streaming processor is building a BLAST database from a large > amount > > > of user-uploaded files and it needs to discard the SMILE data (and > > > sometimes DNA or Amino Acid sequence data depending on parameters) or > the > > > BLAST database command will exit with an error. I cannot even easily > > > pre-categorize the sbol files on user upload since a single sbol file > could > > > contain sequences with different encodings, so I'm left with no option > but > > > to buffer an unknown and potentially very large amount of sequence > data. > > > > > > > > Regardless it's always good practice to keep metadata before the > actual > > > data, rather than leaving that decision to the implementors. > > > > > > > > — > > > > You are receiving this because you commented. > > > > Reply to this email directly, view it on GitHub < > > > #9# > > > issuecomment-285927873>, or mute the thread <https://github.com/ > > > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ > > > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>. > > > > > > > > > > — > > > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > > > <#9# > issuecomment-285935056>, > > > or mute the thread > > > <https://github.com/notifications/unsubscribe- > auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb> > > > . > > > > > > > > > > > -- > > ___________________________________ > > Raik Grünberg > > http://www.raiks.de/contact.html > > ___________________________________ > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the thread. > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#9>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABxs3YaxoLIkG5QyZBq1D2NwozWf2k_Qks5rk93BgaJpZM4MaEyb> > . > -- ___________________________________ Raik Grünberg http://www.raiks.de/contact.html ___________________________________ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADWD97ojL6N0fU81XNBNiJCpg2koLdYLks5rk9_ZgaJpZM4MaEyb>.

graik · 2017-03-12T14:33:31Z

I see. Keeping things within the sbol name space. Yes, this looks like a good fix. Perhaps Sequence sub-types can be raised again for sbol 3. Greetings Raik

…

On Sun, Mar 12, 2017 at 3:22 PM, cjmyers ***@***.***> wrote: The problem with sub-typing is that all current SBOL tools using our libraries will not treat the object as a Sequence but rather a GeneticTopLevel, so the tools will no longer work. The subType solution requires a change to SBOL and all libraries and ultimately all software. So, this is a really heavy solution. Here is a better one: <sbol:Sequence rdf:about="http://synbiohub.org/public/igem/BBa_B0030_ sequence/1"> <sbol:persistentIdentity rdf:resource="http:// synbiohub.org/public/igem/BBa_B0030_sequence"/> <sbol:displayId>BBa_B0030_sequence</sbol:displayId> <sbol:version>1</sbol:version> <prov:wasDerivedFrom rdf:resource="http://parts.igem.org/Part:BBa_B0030"/> <prov:wasGeneratedBy rdf:resource="http://synbiohub.org/public/igem/ igem2sbol/1"/> <sbh:ownedBy rdf:resource="http://synbiohub.org/user/james"/> <sbh:ownedBy rdf:resource="http://synbiohub.org/user/myers"/> <sbol:elements sbol:encoding="http://www.chem.qmul.ac.uk/iubmb/misc/ naseq.html">attaaagaggagaaa</sbol:elements> <sbol:encoding rdf:resource="http://www.chem.qmul.ac.uk/iubmb/misc/naseq. html"/> </sbol:Sequence> I just checked this with SBOL Validator. The encoding is duplicated, so tools using our libraries will find it. It is also included in the elements field as was requested. I see no problem with Marc taking this approach. If this is useful, we can modify the library serialization to include this attribute. All existing tools will work, and new tools using a streaming parser should also be able to take advantage of this. > On Mar 12, 2017, at 12:06 PM, Raik Grünberg ***@***.***> wrote: > > "rdf:datatype" pointing to an .html address looks very wrong... > sub-classing Sequence would be the much cleaner solution. Probably with > additional benefits if we, further down the road, also implement it in the > library data model. > > On Sun, Mar 12, 2017 at 2:57 PM, cjmyers ***@***.***> wrote: > > > Think datatype should be fine since when ignored in my experience it is > > treated as string. Certainly it creates no issues with SBOL tools. Will > > check how virtuoso handles it. > > > > Chris > > > > Sent from my iPhone > > > > > On Mar 12, 2017, at 11:34 AM, Raik Grünberg < ***@***.***> > > wrote: > > > > > > This would indeed be a pretty straightforward solution but is it also > > valid > > > RDF? From the OWL reference: > > > > > > > NOTE: It is not illegal, although not recommended, for applications to > > > define their own datatypes by defining an instance of rdfs:Datatype. Such > > > datatypes are "unrecognized", but are treated in a similar fashion as > > > "unsupported datatypes" (see Sec. 6.3 > > > <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how > > > > > these should be treated by OWL tools). > > > > > > I don't know whether this applies only to ontology definitions but I > > doubt > > > it. It could create a problem if "elements" receives a data type that is > > > unknown to normal RDF tools instead of the "String" that it really is. > > Some > > > parsers may decide to skip the entry at the low level. Instead we could > > use > > > the standard rdf:type field on the level of "Sequence" to point to > > > something like DNA sequence, Protein sequence, etc. This would > > essentially > > > mean we define sub-classes of Sequence in the SBOL data model, which is > > > still a pretty minimal solution. Sub-classing ComponentDefinition would > > be > > > much better, IMO, but is a larger change. > > > > > > On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***> > > wrote: > > > > > > > Ah, I understand you now. You would like something like this: > > > > > > > > <sbol:elements rdf:datatype="http://www.chem. > > qmul.ac.uk/iubmb/misc/naseq. > > > > html">attaaagaggagaaa</sbol:elements> > > > > > > > > I just tested this with libSBOLj, and it does not cause any problems to > > > > include the datatype in this way. Currently, libSBOLj will ignore this > > > > datatype field, meaning it gets dropped. However, I believe is should > > be > > > > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL > > > > handle it. Would be worth a test. > > > > > > > > In any case, I believe that even with SBOL today, you should be > > allowed to > > > > do this in your files, and it is, in my opinion, still legal SBOL > > > > serialization. I will log an issue to libSBOLj’s tracker to preserve > > this > > > > information. Hopefully, this will address your issue. > > > > > > > > > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.*** > > > wrote: > > > > > > > > > > Yes I am proposing that it should not be a tag at all but rather an > > > > attribute of either the sequence or element tags. The fact that you > > > > currently can encounter the encoding tag after the elements tag is > > causing > > > > issues with my streaming processor. > > > > > > > > > > The reason why I need to know the encoding is that I don't even know > > if > > > > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I > > > > could look at the data itself but you can have AA or SMILE data that > > > > consists only of characters that are legal in either format. > > > > > > > > > > My streaming processor is building a BLAST database from a large > > amount > > > > of user-uploaded files and it needs to discard the SMILE data (and > > > > sometimes DNA or Amino Acid sequence data depending on parameters) or > > the > > > > BLAST database command will exit with an error. I cannot even easily > > > > pre-categorize the sbol files on user upload since a single sbol file > > could > > > > contain sequences with different encodings, so I'm left with no option > > but > > > > to buffer an unknown and potentially very large amount of sequence > > data. > > > > > > > > > > Regardless it's always good practice to keep metadata before the > > actual > > > > data, rather than leaving that decision to the implementors. > > > > > > > > > > — > > > > > You are receiving this because you commented. > > > > > Reply to this email directly, view it on GitHub < > > > > #9# > > > > issuecomment-285927873>, or mute the thread <https://github.com/ > > > > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ > > > > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>. > > > > > > > > > > > > > — > > > > You are receiving this because you were mentioned. > > > > Reply to this email directly, view it on GitHub > > > > <#9# > > issuecomment-285935056>, > > > > or mute the thread > > > > <https://github.com/notifications/unsubscribe- > > auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb> > > > > . > > > > > > > > > > > > > > > > -- > > > ___________________________________ > > > Raik Grünberg > > > http://www.raiks.de/contact.html > > > ___________________________________ > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub, or mute the thread. > > > > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#9# issuecomment-285939767>, > > or mute the thread > > <https://github.com/notifications/unsubscribe-auth/ ABxs3YaxoLIkG5QyZBq1D2NwozWf2k_Qks5rk93BgaJpZM4MaEyb> > > . > > > > > > -- > ___________________________________ > Raik Grünberg > http://www.raiks.de/contact.html > ___________________________________ > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub < #9# issuecomment-285940284>, or mute the thread <https://github.com/ notifications/unsubscribe-auth/ADWD97ojL6N0fU81XNBNiJCpg2koLd YLks5rk9_ZgaJpZM4MaEyb>. > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABxs3UhGHmj5zOjts2fjGflfn5IWueVVks5rk-OQgaJpZM4MaEyb> .

--

___________________________________ Raik Grünberg http://www.raiks.de/contact.html

___________________________________

palchicz · 2019-03-28T22:53:01Z

@cjmyers has the change been incorporated into the library already and @Juul does this address your concern?

jakebeal · 2020-08-24T00:39:49Z

I believe this is now moot for SBOL 3, which uses RDF as a serialization format (such that we don't have control of ordering) and which also allows genome-scale sequences to be stored as ExternalReference objects instead.

cjmyers · 2020-08-24T14:16:31Z

Not sure about this one. I think he wants the encoding to be a data type attribute. Might be worth further thought.

cjmyers · 2020-10-08T11:10:48Z

Should be dealt with by creating some genome editing use cases to ensure we do not need to store and exchange very large sequences.

Juul added the enhancement New feature or request label Mar 11, 2017

cjmyers removed the enhancement New feature or request label Oct 9, 2020

jakebeal mentioned this issue Oct 6, 2022

Develop a mechanism for practices documents #16

Open

LukasBuecherl transferred this issue from SynBioDex/SBOL-specification Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding specified after the encoded data #9

Encoding specified after the encoded data #9

Juul commented Mar 11, 2017

cjmyers commented Mar 11, 2017 via email

Juul commented Mar 12, 2017

graik commented Mar 12, 2017 via email

Juul commented Mar 12, 2017

cjmyers commented Mar 12, 2017 via email

graik commented Mar 12, 2017 via email

cjmyers commented Mar 12, 2017 via email

graik commented Mar 12, 2017 via email

cjmyers commented Mar 12, 2017 via email

graik commented Mar 12, 2017 via email

palchicz commented Mar 28, 2019

jakebeal commented Aug 24, 2020

cjmyers commented Aug 24, 2020

cjmyers commented Oct 8, 2020

Encoding specified after the encoded data #9

Encoding specified after the encoded data #9

Comments

Juul commented Mar 11, 2017

cjmyers commented Mar 11, 2017 via email

Juul commented Mar 12, 2017

graik commented Mar 12, 2017 via email

Juul commented Mar 12, 2017

cjmyers commented Mar 12, 2017 via email

graik commented Mar 12, 2017 via email

cjmyers commented Mar 12, 2017 via email

graik commented Mar 12, 2017 via email

cjmyers commented Mar 12, 2017 via email

graik commented Mar 12, 2017 via email

palchicz commented Mar 28, 2019

jakebeal commented Aug 24, 2020

cjmyers commented Aug 24, 2020

cjmyers commented Oct 8, 2020