-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding specified after the encoded data #9
Comments
Thanks for the suggestion. However, it is not actually prescribed that it comes before or after. In XML, there is no order to the tags. I don’t think we can prescribe one. We could make our libraries serialize in one order, but we can never be sure some other implementation did not swap them, so we would need to be equipped for both. Keep in mind for DNA encoding is nearly always going to be the simple IUPAC code, and it really does not require any “decoding”.
… On Mar 11, 2017, at 3:04 AM, Marc Juul ***@***.***> wrote:
In the currently available SBOL examples the encoding tag within the sequence tag is specified after the end of the elements tag. This is problematic for streaming parsers since they then have to buffer the entire contents of each elements tag before it can be decoded.
If the elements tag contains a lot of data e.g. if a user of SBOL compliant software decides to save a whole unannotated genome in SBOL format then the entire genome would have to be loaded into memory in such a parser.
Possibly something to improve for future SBOL versions?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADWD93t_TszEM5gkGsOitAqyuJuPXSbIks5rkg9bgaJpZM4MaEyb>.
|
Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor. The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format. My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data. Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors. |
I think this is a pretty typical example of why we need more tightly
specified formats for everyday applications (i.e. "fully specified
sequence"). It also shows that the sbol:type field should really be an
rdf:type field defining specific sub-classes for DNA, protein, RNA and
chemicals. The fact that one needs to parse all the fields of a
ComponentDefinition before knowing whether it is the expected DNA or RNA or
protein or even an abstract ray of light really complicates everyday use. A
DNAComponentDefinition then could be tightly specified to guarantee a
certain encoding.
…On Sun, Mar 12, 2017 at 10:33 AM, Marc Juul ***@***.***> wrote:
Yes I am proposing that it should not be a tag at all but rather an
attribute of either the sequence or element tags. The fact that you
currently can encounter the encoding tag after the elements tag is causing
issues with my streaming processor.
The reason why I need to know the encoding is that I don't even know if
it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I
could look at the data itself but you can have AA or SMILE data that
consists only of characters that are legal in either format.
My streaming processor is building a BLAST database from a large amount of
user-uploaded files and it needs to discard the SMILE data (and sometimes
DNA or Amino Acid sequence data depending on parameters) or the BLAST
database command will exit with an error. I cannot even easily
pre-categorize the sbol files on user upload since a single sbol file could
contain sequences with different encodings, so I'm left with no option but
to buffer an unknown and potentially very large amount of sequence data.
Regardless it's always good practice to keep metadata before the actual
data, rather than leaving that decision to the implementors.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABxs3T-08Qcg7jL-yBFUiP45SSxPaT1eks5rk5_agaJpZM4MaEyb>
.
--
___________________________________
Raik Grünberg
http://www.raiks.de/contact.html
___________________________________
|
Hm, yes @graik that would definitely solve the problem. I don't know enough about SBOL to say if that might prevent some legitimate use-cases that mix DNA, protein and RNA. |
Ah, I understand you now. You would like something like this:
<sbol:elements rdf:datatype="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html">attaaagaggagaaa</sbol:elements>
I just tested this with libSBOLj, and it does not cause any problems to include the datatype in this way. Currently, libSBOLj will ignore this datatype field, meaning it gets dropped. However, I believe is should be preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL handle it. Would be worth a test.
In any case, I believe that even with SBOL today, you should be allowed to do this in your files, and it is, in my opinion, still legal SBOL serialization. I will log an issue to libSBOLj’s tracker to preserve this information. Hopefully, this will address your issue.
… On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> wrote:
Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor.
The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format.
My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data.
Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_3wTE-TF-7ks5rk5_agaJpZM4MaEyb>.
|
This would indeed be a pretty straightforward solution but is it also valid
RDF? From the OWL reference:
NOTE: It is not illegal, although not recommended, for applications to
define their own datatypes by defining an instance of rdfs:Datatype. Such
datatypes are "unrecognized", but are treated in a similar fashion as
"unsupported datatypes" (see Sec. 6.3
<https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how
these should be treated by OWL tools).
I don't know whether this applies only to ontology definitions but I doubt
it. It could create a problem if "elements" receives a data type that is
unknown to normal RDF tools instead of the "String" that it really is. Some
parsers may decide to skip the entry at the low level. Instead we could use
the standard rdf:type field on the level of "Sequence" to point to
something like DNA sequence, Protein sequence, etc. This would essentially
mean we define sub-classes of Sequence in the SBOL data model, which is
still a pretty minimal solution. Sub-classing ComponentDefinition would be
much better, IMO, but is a larger change.
…On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***> wrote:
Ah, I understand you now. You would like something like this:
<sbol:elements rdf:datatype="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.
html">attaaagaggagaaa</sbol:elements>
I just tested this with libSBOLj, and it does not cause any problems to
include the datatype in this way. Currently, libSBOLj will ignore this
datatype field, meaning it gets dropped. However, I believe is should be
preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL
handle it. Would be worth a test.
In any case, I believe that even with SBOL today, you should be allowed to
do this in your files, and it is, in my opinion, still legal SBOL
serialization. I will log an issue to libSBOLj’s tracker to preserve this
information. Hopefully, this will address your issue.
> On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> wrote:
>
> Yes I am proposing that it should not be a tag at all but rather an
attribute of either the sequence or element tags. The fact that you
currently can encounter the encoding tag after the elements tag is causing
issues with my streaming processor.
>
> The reason why I need to know the encoding is that I don't even know if
it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I
could look at the data itself but you can have AA or SMILE data that
consists only of characters that are legal in either format.
>
> My streaming processor is building a BLAST database from a large amount
of user-uploaded files and it needs to discard the SMILE data (and
sometimes DNA or Amino Acid sequence data depending on parameters) or the
BLAST database command will exit with an error. I cannot even easily
pre-categorize the sbol files on user upload since a single sbol file could
contain sequences with different encodings, so I'm left with no option but
to buffer an unknown and potentially very large amount of sequence data.
>
> Regardless it's always good practice to keep metadata before the actual
data, rather than leaving that decision to the implementors.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub <
#9#
issuecomment-285927873>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_
3wTE-TF-7ks5rk5_agaJpZM4MaEyb>.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb>
.
--
___________________________________
Raik Grünberg
http://www.raiks.de/contact.html
___________________________________
|
Think datatype should be fine since when ignored in my experience it is treated as string. Certainly it creates no issues with SBOL tools. Will check how virtuoso handles it.
Chris
…Sent from my iPhone
On Mar 12, 2017, at 11:34 AM, Raik Grünberg ***@***.***> wrote:
This would indeed be a pretty straightforward solution but is it also valid
RDF? From the OWL reference:
> NOTE: It is not illegal, although not recommended, for applications to
define their own datatypes by defining an instance of rdfs:Datatype. Such
datatypes are "unrecognized", but are treated in a similar fashion as
"unsupported datatypes" (see Sec. 6.3
<https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how
these should be treated by OWL tools).
I don't know whether this applies only to ontology definitions but I doubt
it. It could create a problem if "elements" receives a data type that is
unknown to normal RDF tools instead of the "String" that it really is. Some
parsers may decide to skip the entry at the low level. Instead we could use
the standard rdf:type field on the level of "Sequence" to point to
something like DNA sequence, Protein sequence, etc. This would essentially
mean we define sub-classes of Sequence in the SBOL data model, which is
still a pretty minimal solution. Sub-classing ComponentDefinition would be
much better, IMO, but is a larger change.
On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***> wrote:
> Ah, I understand you now. You would like something like this:
>
> <sbol:elements rdf:datatype="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.
> html">attaaagaggagaaa</sbol:elements>
>
> I just tested this with libSBOLj, and it does not cause any problems to
> include the datatype in this way. Currently, libSBOLj will ignore this
> datatype field, meaning it gets dropped. However, I believe is should be
> preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL
> handle it. Would be worth a test.
>
> In any case, I believe that even with SBOL today, you should be allowed to
> do this in your files, and it is, in my opinion, still legal SBOL
> serialization. I will log an issue to libSBOLj’s tracker to preserve this
> information. Hopefully, this will address your issue.
>
> > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***> wrote:
> >
> > Yes I am proposing that it should not be a tag at all but rather an
> attribute of either the sequence or element tags. The fact that you
> currently can encounter the encoding tag after the elements tag is causing
> issues with my streaming processor.
> >
> > The reason why I need to know the encoding is that I don't even know if
> it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I
> could look at the data itself but you can have AA or SMILE data that
> consists only of characters that are legal in either format.
> >
> > My streaming processor is building a BLAST database from a large amount
> of user-uploaded files and it needs to discard the SMILE data (and
> sometimes DNA or Amino Acid sequence data depending on parameters) or the
> BLAST database command will exit with an error. I cannot even easily
> pre-categorize the sbol files on user upload since a single sbol file could
> contain sequences with different encodings, so I'm left with no option but
> to buffer an unknown and potentially very large amount of sequence data.
> >
> > Regardless it's always good practice to keep metadata before the actual
> data, rather than leaving that decision to the implementors.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub <
> #9#
> issuecomment-285927873>, or mute the thread <https://github.com/
> notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_
> 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>.
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#9>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb>
> .
>
--
___________________________________
Raik Grünberg
http://www.raiks.de/contact.html
___________________________________
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
"rdf:datatype" pointing to an .html address looks very wrong...
sub-classing Sequence would be the much cleaner solution. Probably with
additional benefits if we, further down the road, also implement it in the
library data model.
…On Sun, Mar 12, 2017 at 2:57 PM, cjmyers ***@***.***> wrote:
Think datatype should be fine since when ignored in my experience it is
treated as string. Certainly it creates no issues with SBOL tools. Will
check how virtuoso handles it.
Chris
Sent from my iPhone
> On Mar 12, 2017, at 11:34 AM, Raik Grünberg ***@***.***>
wrote:
>
> This would indeed be a pretty straightforward solution but is it also
valid
> RDF? From the OWL reference:
>
> > NOTE: It is not illegal, although not recommended, for applications to
> define their own datatypes by defining an instance of rdfs:Datatype. Such
> datatypes are "unrecognized", but are treated in a similar fashion as
> "unsupported datatypes" (see Sec. 6.3
> <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how
> these should be treated by OWL tools).
>
> I don't know whether this applies only to ontology definitions but I
doubt
> it. It could create a problem if "elements" receives a data type that is
> unknown to normal RDF tools instead of the "String" that it really is.
Some
> parsers may decide to skip the entry at the low level. Instead we could
use
> the standard rdf:type field on the level of "Sequence" to point to
> something like DNA sequence, Protein sequence, etc. This would
essentially
> mean we define sub-classes of Sequence in the SBOL data model, which is
> still a pretty minimal solution. Sub-classing ComponentDefinition would
be
> much better, IMO, but is a larger change.
>
> On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***>
wrote:
>
> > Ah, I understand you now. You would like something like this:
> >
> > <sbol:elements rdf:datatype="http://www.chem.
qmul.ac.uk/iubmb/misc/naseq.
> > html">attaaagaggagaaa</sbol:elements>
> >
> > I just tested this with libSBOLj, and it does not cause any problems to
> > include the datatype in this way. Currently, libSBOLj will ignore this
> > datatype field, meaning it gets dropped. However, I believe is should
be
> > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL
> > handle it. Would be worth a test.
> >
> > In any case, I believe that even with SBOL today, you should be
allowed to
> > do this in your files, and it is, in my opinion, still legal SBOL
> > serialization. I will log an issue to libSBOLj’s tracker to preserve
this
> > information. Hopefully, this will address your issue.
> >
> > > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***>
wrote:
> > >
> > > Yes I am proposing that it should not be a tag at all but rather an
> > attribute of either the sequence or element tags. The fact that you
> > currently can encounter the encoding tag after the elements tag is
causing
> > issues with my streaming processor.
> > >
> > > The reason why I need to know the encoding is that I don't even know
if
> > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I
> > could look at the data itself but you can have AA or SMILE data that
> > consists only of characters that are legal in either format.
> > >
> > > My streaming processor is building a BLAST database from a large
amount
> > of user-uploaded files and it needs to discard the SMILE data (and
> > sometimes DNA or Amino Acid sequence data depending on parameters) or
the
> > BLAST database command will exit with an error. I cannot even easily
> > pre-categorize the sbol files on user upload since a single sbol file
could
> > contain sequences with different encodings, so I'm left with no option
but
> > to buffer an unknown and potentially very large amount of sequence
data.
> > >
> > > Regardless it's always good practice to keep metadata before the
actual
> > data, rather than leaving that decision to the implementors.
> > >
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub <
> > #9#
> > issuecomment-285927873>, or mute the thread <https://github.com/
> > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_
> > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>.
> > >
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#9#
issuecomment-285935056>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-
auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb>
> > .
> >
>
>
>
> --
> ___________________________________
> Raik Grünberg
> http://www.raiks.de/contact.html
> ___________________________________
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABxs3YaxoLIkG5QyZBq1D2NwozWf2k_Qks5rk93BgaJpZM4MaEyb>
.
--
___________________________________
Raik Grünberg
http://www.raiks.de/contact.html
___________________________________
|
The problem with sub-typing is that all current SBOL tools using our libraries will not treat the object as a Sequence but rather a GeneticTopLevel, so the tools will no longer work. The subType solution requires a change to SBOL and all libraries and ultimately all software. So, this is a really heavy solution.
Here is a better one:
<sbol:Sequence rdf:about="http://synbiohub.org/public/igem/BBa_B0030_sequence/1">
<sbol:persistentIdentity rdf:resource="http://synbiohub.org/public/igem/BBa_B0030_sequence"/>
<sbol:displayId>BBa_B0030_sequence</sbol:displayId>
<sbol:version>1</sbol:version>
<prov:wasDerivedFrom rdf:resource="http://parts.igem.org/Part:BBa_B0030"/>
<prov:wasGeneratedBy rdf:resource="http://synbiohub.org/public/igem/igem2sbol/1"/>
<sbh:ownedBy rdf:resource="http://synbiohub.org/user/james"/>
<sbh:ownedBy rdf:resource="http://synbiohub.org/user/myers"/>
<sbol:elements sbol:encoding="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html">attaaagaggagaaa</sbol:elements>
<sbol:encoding rdf:resource="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html"/>
</sbol:Sequence>
I just checked this with SBOL Validator. The encoding is duplicated, so tools using our libraries will find it. It is also included in the elements field as was requested. I see no problem with Marc taking this approach. If this is useful, we can modify the library serialization to include this attribute. All existing tools will work, and new tools using a streaming parser should also be able to take advantage of this.
… On Mar 12, 2017, at 12:06 PM, Raik Grünberg ***@***.***> wrote:
"rdf:datatype" pointing to an .html address looks very wrong...
sub-classing Sequence would be the much cleaner solution. Probably with
additional benefits if we, further down the road, also implement it in the
library data model.
On Sun, Mar 12, 2017 at 2:57 PM, cjmyers ***@***.***> wrote:
> Think datatype should be fine since when ignored in my experience it is
> treated as string. Certainly it creates no issues with SBOL tools. Will
> check how virtuoso handles it.
>
> Chris
>
> Sent from my iPhone
>
> > On Mar 12, 2017, at 11:34 AM, Raik Grünberg ***@***.***>
> wrote:
> >
> > This would indeed be a pretty straightforward solution but is it also
> valid
> > RDF? From the OWL reference:
> >
> > > NOTE: It is not illegal, although not recommended, for applications to
> > define their own datatypes by defining an instance of rdfs:Datatype. Such
> > datatypes are "unrecognized", but are treated in a similar fashion as
> > "unsupported datatypes" (see Sec. 6.3
> > <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about how
>
> > these should be treated by OWL tools).
> >
> > I don't know whether this applies only to ontology definitions but I
> doubt
> > it. It could create a problem if "elements" receives a data type that is
> > unknown to normal RDF tools instead of the "String" that it really is.
> Some
> > parsers may decide to skip the entry at the low level. Instead we could
> use
> > the standard rdf:type field on the level of "Sequence" to point to
> > something like DNA sequence, Protein sequence, etc. This would
> essentially
> > mean we define sub-classes of Sequence in the SBOL data model, which is
> > still a pretty minimal solution. Sub-classing ComponentDefinition would
> be
> > much better, IMO, but is a larger change.
> >
> > On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***>
> wrote:
> >
> > > Ah, I understand you now. You would like something like this:
> > >
> > > <sbol:elements rdf:datatype="http://www.chem.
> qmul.ac.uk/iubmb/misc/naseq.
> > > html">attaaagaggagaaa</sbol:elements>
> > >
> > > I just tested this with libSBOLj, and it does not cause any problems to
> > > include the datatype in this way. Currently, libSBOLj will ignore this
> > > datatype field, meaning it gets dropped. However, I believe is should
> be
> > > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL
> > > handle it. Would be worth a test.
> > >
> > > In any case, I believe that even with SBOL today, you should be
> allowed to
> > > do this in your files, and it is, in my opinion, still legal SBOL
> > > serialization. I will log an issue to libSBOLj’s tracker to preserve
> this
> > > information. Hopefully, this will address your issue.
> > >
> > > > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***>
> wrote:
> > > >
> > > > Yes I am proposing that it should not be a tag at all but rather an
> > > attribute of either the sequence or element tags. The fact that you
> > > currently can encounter the encoding tag after the elements tag is
> causing
> > > issues with my streaming processor.
> > > >
> > > > The reason why I need to know the encoding is that I don't even know
> if
> > > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I
> > > could look at the data itself but you can have AA or SMILE data that
> > > consists only of characters that are legal in either format.
> > > >
> > > > My streaming processor is building a BLAST database from a large
> amount
> > > of user-uploaded files and it needs to discard the SMILE data (and
> > > sometimes DNA or Amino Acid sequence data depending on parameters) or
> the
> > > BLAST database command will exit with an error. I cannot even easily
> > > pre-categorize the sbol files on user upload since a single sbol file
> could
> > > contain sequences with different encodings, so I'm left with no option
> but
> > > to buffer an unknown and potentially very large amount of sequence
> data.
> > > >
> > > > Regardless it's always good practice to keep metadata before the
> actual
> > > data, rather than leaving that decision to the implementors.
> > > >
> > > > —
> > > > You are receiving this because you commented.
> > > > Reply to this email directly, view it on GitHub <
> > > #9#
> > > issuecomment-285927873>, or mute the thread <https://github.com/
> > > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_
> > > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>.
> > > >
> > >
> > > —
> > > You are receiving this because you were mentioned.
> > > Reply to this email directly, view it on GitHub
> > > <#9#
> issuecomment-285935056>,
> > > or mute the thread
> > > <https://github.com/notifications/unsubscribe-
> auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb>
> > > .
> > >
> >
> >
> >
> > --
> > ___________________________________
> > Raik Grünberg
> > http://www.raiks.de/contact.html
> > ___________________________________
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub, or mute the thread.
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#9>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABxs3YaxoLIkG5QyZBq1D2NwozWf2k_Qks5rk93BgaJpZM4MaEyb>
> .
>
--
___________________________________
Raik Grünberg
http://www.raiks.de/contact.html
___________________________________
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#9>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADWD97ojL6N0fU81XNBNiJCpg2koLdYLks5rk9_ZgaJpZM4MaEyb>.
|
I see. Keeping things within the sbol name space. Yes, this looks like a
good fix. Perhaps Sequence sub-types can be raised again for sbol 3.
Greetings
Raik
…On Sun, Mar 12, 2017 at 3:22 PM, cjmyers ***@***.***> wrote:
The problem with sub-typing is that all current SBOL tools using our
libraries will not treat the object as a Sequence but rather a
GeneticTopLevel, so the tools will no longer work. The subType solution
requires a change to SBOL and all libraries and ultimately all software.
So, this is a really heavy solution.
Here is a better one:
<sbol:Sequence rdf:about="http://synbiohub.org/public/igem/BBa_B0030_
sequence/1">
<sbol:persistentIdentity rdf:resource="http://
synbiohub.org/public/igem/BBa_B0030_sequence"/>
<sbol:displayId>BBa_B0030_sequence</sbol:displayId>
<sbol:version>1</sbol:version>
<prov:wasDerivedFrom rdf:resource="http://parts.igem.org/Part:BBa_B0030"/>
<prov:wasGeneratedBy rdf:resource="http://synbiohub.org/public/igem/
igem2sbol/1"/>
<sbh:ownedBy rdf:resource="http://synbiohub.org/user/james"/>
<sbh:ownedBy rdf:resource="http://synbiohub.org/user/myers"/>
<sbol:elements sbol:encoding="http://www.chem.qmul.ac.uk/iubmb/misc/
naseq.html">attaaagaggagaaa</sbol:elements>
<sbol:encoding rdf:resource="http://www.chem.qmul.ac.uk/iubmb/misc/naseq.
html"/>
</sbol:Sequence>
I just checked this with SBOL Validator. The encoding is duplicated, so
tools using our libraries will find it. It is also included in the elements
field as was requested. I see no problem with Marc taking this approach. If
this is useful, we can modify the library serialization to include this
attribute. All existing tools will work, and new tools using a streaming
parser should also be able to take advantage of this.
> On Mar 12, 2017, at 12:06 PM, Raik Grünberg ***@***.***>
wrote:
>
> "rdf:datatype" pointing to an .html address looks very wrong...
> sub-classing Sequence would be the much cleaner solution. Probably with
> additional benefits if we, further down the road, also implement it in
the
> library data model.
>
> On Sun, Mar 12, 2017 at 2:57 PM, cjmyers ***@***.***>
wrote:
>
> > Think datatype should be fine since when ignored in my experience it is
> > treated as string. Certainly it creates no issues with SBOL tools. Will
> > check how virtuoso handles it.
> >
> > Chris
> >
> > Sent from my iPhone
> >
> > > On Mar 12, 2017, at 11:34 AM, Raik Grünberg <
***@***.***>
> > wrote:
> > >
> > > This would indeed be a pretty straightforward solution but is it also
> > valid
> > > RDF? From the OWL reference:
> > >
> > > > NOTE: It is not illegal, although not recommended, for
applications to
> > > define their own datatypes by defining an instance of rdfs:Datatype.
Such
> > > datatypes are "unrecognized", but are treated in a similar fashion as
> > > "unsupported datatypes" (see Sec. 6.3
> > > <https://www.w3.org/TR/owl-ref/#DatatypeSupport> for details about
how
> >
> > > these should be treated by OWL tools).
> > >
> > > I don't know whether this applies only to ontology definitions but I
> > doubt
> > > it. It could create a problem if "elements" receives a data type
that is
> > > unknown to normal RDF tools instead of the "String" that it really
is.
> > Some
> > > parsers may decide to skip the entry at the low level. Instead we
could
> > use
> > > the standard rdf:type field on the level of "Sequence" to point to
> > > something like DNA sequence, Protein sequence, etc. This would
> > essentially
> > > mean we define sub-classes of Sequence in the SBOL data model, which
is
> > > still a pretty minimal solution. Sub-classing ComponentDefinition
would
> > be
> > > much better, IMO, but is a larger change.
> > >
> > > On Sun, Mar 12, 2017 at 1:15 PM, cjmyers ***@***.***>
> > wrote:
> > >
> > > > Ah, I understand you now. You would like something like this:
> > > >
> > > > <sbol:elements rdf:datatype="http://www.chem.
> > qmul.ac.uk/iubmb/misc/naseq.
> > > > html">attaaagaggagaaa</sbol:elements>
> > > >
> > > > I just tested this with libSBOLj, and it does not cause any
problems to
> > > > include the datatype in this way. Currently, libSBOLj will ignore
this
> > > > datatype field, meaning it gets dropped. However, I believe is
should
> > be
> > > > preserving it, and we should fix it do so. Not sure how
libSBOL/pySBOL
> > > > handle it. Would be worth a test.
> > > >
> > > > In any case, I believe that even with SBOL today, you should be
> > allowed to
> > > > do this in your files, and it is, in my opinion, still legal SBOL
> > > > serialization. I will log an issue to libSBOLj’s tracker to
preserve
> > this
> > > > information. Hopefully, this will address your issue.
> > > >
> > > > > On Mar 12, 2017, at 7:33 AM, Marc Juul ***@***.***
>
> > wrote:
> > > > >
> > > > > Yes I am proposing that it should not be a tag at all but rather
an
> > > > attribute of either the sequence or element tags. The fact that you
> > > > currently can encounter the encoding tag after the elements tag is
> > causing
> > > > issues with my streaming processor.
> > > > >
> > > > > The reason why I need to know the encoding is that I don't even
know
> > if
> > > > it's DNA, Amino Acids or SMILE data before I get to the encoding
tag. I
> > > > could look at the data itself but you can have AA or SMILE data
that
> > > > consists only of characters that are legal in either format.
> > > > >
> > > > > My streaming processor is building a BLAST database from a large
> > amount
> > > > of user-uploaded files and it needs to discard the SMILE data (and
> > > > sometimes DNA or Amino Acid sequence data depending on parameters)
or
> > the
> > > > BLAST database command will exit with an error. I cannot even
easily
> > > > pre-categorize the sbol files on user upload since a single sbol
file
> > could
> > > > contain sequences with different encodings, so I'm left with no
option
> > but
> > > > to buffer an unknown and potentially very large amount of sequence
> > data.
> > > > >
> > > > > Regardless it's always good practice to keep metadata before the
> > actual
> > > > data, rather than leaving that decision to the implementors.
> > > > >
> > > > > —
> > > > > You are receiving this because you commented.
> > > > > Reply to this email directly, view it on GitHub <
> > > > #9#
> > > > issuecomment-285927873>, or mute the thread <https://github.com/
> > > > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_
> > > > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>.
> > > > >
> > > >
> > > > —
> > > > You are receiving this because you were mentioned.
> > > > Reply to this email directly, view it on GitHub
> > > > <#9#
> > issuecomment-285935056>,
> > > > or mute the thread
> > > > <https://github.com/notifications/unsubscribe-
> > auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb>
> > > > .
> > > >
> > >
> > >
> > >
> > > --
> > > ___________________________________
> > > Raik Grünberg
> > > http://www.raiks.de/contact.html
> > > ___________________________________
> > > —
> > > You are receiving this because you commented.
> > > Reply to this email directly, view it on GitHub, or mute the thread.
> > >
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#9#
issuecomment-285939767>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/
ABxs3YaxoLIkG5QyZBq1D2NwozWf2k_Qks5rk93BgaJpZM4MaEyb>
> > .
> >
>
>
>
> --
> ___________________________________
> Raik Grünberg
> http://www.raiks.de/contact.html
> ___________________________________
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub <
#9#
issuecomment-285940284>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/ADWD97ojL6N0fU81XNBNiJCpg2koLd
YLks5rk9_ZgaJpZM4MaEyb>.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABxs3UhGHmj5zOjts2fjGflfn5IWueVVks5rk-OQgaJpZM4MaEyb>
.
--
___________________________________
Raik Grünberg
http://www.raiks.de/contact.html
___________________________________
|
I believe this is now moot for SBOL 3, which uses RDF as a serialization format (such that we don't have control of ordering) and which also allows genome-scale sequences to be stored as ExternalReference objects instead. |
Not sure about this one. I think he wants the encoding to be a data type attribute. Might be worth further thought. |
Should be dealt with by creating some genome editing use cases to ensure we do not need to store and exchange very large sequences. |
In the currently available SBOL examples the
encoding
tag within thesequence
tag is specified after the end of theelements
tag. This is problematic for streaming parsers since they then have to buffer the entire contents of eachelements
tag before it can be decoded.If the
elements
tag contains a lot of data e.g. if a user of SBOL compliant software decides to save a whole unannotated genome in SBOL format then the entire genome would have to be loaded into memory in such a parser.Possibly something to improve for future SBOL versions?
The text was updated successfully, but these errors were encountered: