Fully specified vs. incomplete designs #7

jakebeal · 2016-11-03T12:30:20Z

Something that has been coming up on the SBOL developers list and in other discussions is the challenge of determining when a design is "complete and consistent."

We need to specify a simple algorithm by which an SBOL structural design can be flattened into a single nucleic acid sequence. If this algorithm can be executed, then a design is complete and consistent; if it cannot, then the design is either incomplete (if there is missing information) and/or inconsistent (if there is information that conflicts).

mikebissell · 2016-11-03T21:46:42Z

Hmm. One ring to rule them all? I'm dubious.

First, biophysics is not constrained by bioinformatics. We might observe a compositional relationship that we cannot yet explain in terms of a program. We might encode a composition whose physical implementation details are not ready for publication. We might understand a number of possible compositional operations that the SBOL community has yet to contemplate in its deliberations. Nevertheless, we can still draw up a practically complete and consistent CAD model, even if certain op nodes are sort of black-boxed for now.

Second, abstract designs are complete; they are simply abstract. Abstract models are useful for all kinds of things: classification, constraints, generalization, templating, ...

Third, in general we cannot know, and do not seek to dictate to users, which clever recipes they might discover for implementing any arbitrary (potentially novel) compositional function that takes a number of subcomponents on input and returns an assembled component definition as output. There are multiple ways, physically and logically, to implement these genetic compositions, and many ways yet unknown.

Side note. This is why it's a good idea to encode the observed or predicted assembled sequence up on the parent node; we don't naively assume that the parent sequence can be magically inferred as the sum of its children's sequences. The [logical/physical] compositional [reaction/program] details are currently external to this particular layer of our model.

Now, if we still wish to classify known compositional modes or procedural implementations, then I suggest we implement them as a separate, extensible plug-in layer that refers to the SBOL2-structural elements. The structural layer itself should be like modeling clay, and standards typically lag well behind discovery.

jakebeal · 2016-11-03T21:51:31Z

I'd like to understand better some of the problems that you see with predicting sequence from a linked set of ComponentDefinitions. Can you give an example of a non-straightforward sequence prediction?

mikebissell · 2016-11-03T22:04:17Z

Okay, what if you don't know which assembly method I've chosen? You can't very well tell me that I've done it wrong just because I used an unfamiliar reaction.

What if we are studying an unfamiliar assembly process, and we need to encode the observed ingredient inputs and assembled outputs, but we haven't yet discovered an algorithm that captures its logic?

"Linkage" is biophysically nonspecific, right?

jakebeal · 2016-11-03T22:21:52Z

Do we actually need to know the assembly method? I'm thinking that it's better to not know the method. Rather than saying "how" (assembly method) we could say "what", and simply annotate "this part of the sequence goes away."

Can you give an example of an assembly where you think that wouldn't work?

cjmyers · 2016-11-04T10:12:19Z

This problem stems from the fact that people publish incomplete designs. This is the rule and not the exception. In virtually every supplemental to a paper we have examined, there is no complete sequence given. They give a few parts, but not the whole thing. Therefore, the paper is not reproducible. So, we are not saying this is a rule for all SBOL to be valid. We definitely want to allow people to encode anything they want to in SBOL. We are saying if you have built something and you want to publish it, you should publish at least one SBOL file that is “complete”. This is the SBOL file definition that we want to make. Note you can certainly submit more SBOL files that show more of the steps of assembly, etc., but there should be at least one SBOL file that can be flattened down to a complete annotated sequence.

Chris

On Nov 3, 2016, at 10:21 PM, Jacob Beal [email protected] wrote:

Do we actually need to know the assembly method? I'm thinking that it's better to not know the method. Rather than saying "how" (assembly method) we could say "what", and simply annotate "this part of the sequence goes away."

Can you give an example of an assembly where you think that wouldn't work?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #7, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWD93WXYU-xp8awFggH8E4Rut-xFRDmks5q6l6CgaJpZM4KoTJc.

graik · 2016-11-04T14:07:10Z

Side note. This is why it's a good idea to encode the observed or
predicted assembled sequence up on the parent node; we don't naively
assume that the parent sequence can be magically inferred as the sum of its
children's sequences. The [logical/physical] compositional
[reaction/program] details are currently external to this particular layer
of our model.

Interesting. That's exactly what I was thinking of as the best pragmatic
solution to the "full sequence specified?" problem. See my other mail from
some minutes ago. So we define a fully-sequence specified SBOL record as
one where the full sequence is attached to the highest parent node.
We might need some terms for saying "this is only one possible reference
sequence out of several" instead of "this is it". And we might want to say
whether conflicts are resolved in favour of the parent reference or in
favour of subpart sequences. And we need more than one parent sequence if
there more than one molecule involved (2 plasmids, for example). And we
need some way to specify insertion points into a genome.

In any case, a definite reference sequence (if and only if already known)
will solve a lot of issues.

On Fri, Nov 4, 2016 at 1:12 PM, cjmyers [email protected] wrote:

This problem stems from the fact that people publish incomplete designs.
This is the rule and not the exception. In virtually every supplemental to
a paper we have examined, there is no complete sequence given. They give a
few parts, but not the whole thing. Therefore, the paper is not
reproducible. So, we are not saying this is a rule for all SBOL to be
valid. We definitely want to allow people to encode anything they want to
in SBOL. We are saying if you have built something and you want to publish
it, you should publish at least one SBOL file that is “complete”. This is
the SBOL file definition that we want to make. Note you can certainly
submit more SBOL files that show more of the steps of assembly, etc., but
there should be at least one SBOL file that can be flattened down to a
complete annotated sequence.

Chris

On Nov 3, 2016, at 10:21 PM, Jacob Beal [email protected]
wrote:

Do we actually need to know the assembly method? I'm thinking that it's
better to not know the method. Rather than saying "how" (assembly method)
we could say "what", and simply annotate "this part of the sequence goes
away."

Can you give an example of an assembly where you think that wouldn't
work?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <
#7
issuecomment-258291912>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/ADWD93WXYU-xp8awFggH8E4Rut-
xFRDmks5q6l6CgaJpZM4KoTJc>.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABxs3RRG6NQ3ddezm2qDr8JCKP-d-uGLks5q6wUEgaJpZM4KoTJc
.

Raik Grünberg
http://www.raiks.de/contact.html

mikebissell · 2016-11-17T00:45:36Z

Jake clarified:
Do we actually need to know the assembly method? I'm thinking that it's better to not know the method.

Okay, I may have misunderstood Jake's initial suggestion, but this conversation has nevertheless borne fruit.

Yes, I think that a useful blueprint will always contain sequences for each of the sub-parts in the exploded diagram, plus an "imploded" or merged sequence on the apex node(s), with sub-parts positioned on parent sequence(s) as needed, plus roles tagged as they are known.

So, how do we detect "unpublishably underspecified" material?

Look for ComponentDefinitions lacking a payload (sequence)...
...recursively? Is any subcomponent with no payload intolerably vague for this purpose?
What else?

I'm inclined to say that if a child component has a mergepoint where it integrates into the parent sequence, then it should be positioned on that sequence explicitly. But we can't validate that advice. So: would it be too restrictive to require that every child component have a position on the parent? I have nagging concerns:

In silico, in biocompiler toolchain logical operators for example, there may be cases where child entities in a model do not align to their parent -- because they are aggressively rewritten, or because they are consulted and discarded during an optimizer pass. Nevertheless they belong in the model (the language) because they were part of the intermediate representation which got reworked through successive passes of the compiler. People might be inclined to use the model for describing analogous physical configurations as well. Perhaps we could positively tag throwaway subcomponents as such, using a contextual role, and then require that for "publishable" material, any child that is not positioned on the parent must be tagged with a certain role on its Component node, indicating that the child was intentionally not aligned to its parent?

I realize I am recapitulating certain things that were said earlier on the mailing list. I suspect that some of these proposed roles may soon graduate into first-class attributes if they keep cropping up.

jakebeal · 2016-11-17T03:29:29Z

I think this is a very useful direction of discussion. If we take the notion of "unpublishably vague" as our target, then I would have two suggestions:

"Weakly well-specified:" such a ComponentDefinition (or each member of a Collection of CDs) must specify its entire sequence, or else every unspecified portion of its sequence must be precisely specified by one or more of its Components. Any unspecified portion covered by multiple Components must have no conflicts in the sequences they specify.
"Strongly well-specified:" such a ComponentDefinition (or each member of a Collection of CDs) must be weakly well-specified AND every Component in the CD must also be strongly well-specified and also have a precisely defined relationship between the sequence of its definition and the sequence of the CD.

graik · 2016-11-17T07:13:32Z

I think we are on an excellent path here

Case (1) (unambiguous)

I think this is a very useful direction of discussion. If we take the
notion of "unpublishably vague" as our target, then I would have two
suggestions:

"Weakly well-specified:" such a ComponentDefinition (or each member of
a Collection of CDs) must specify its entire sequence, or else every
unspecified portion of its sequence must be precisely specified by one or
more of its Components. Any unspecified portion covered by multiple
Components must have no conflicts in the sequences they specify.

In other words, there may be gaps in the overall sequence but these gaps
are explicit and clearly accounted for and there are no sequence conflicts.

Case (2) (fully specified)

"Strongly well-specified:" such a ComponentDefinition (or each member
of a Collection of CDs) must be weakly well-specified AND every Component
in the CD must also be strongly well-specified and also have a precisely
defined relationship between the sequence of its definition and the
sequence of the CD.

This means the complete overall sequence is given and there aren't any
ambiguities. The easiest way to handle this is indeed a "parent"
ComponenDefinition with the full sequence spelled out.

jakebeal · 2016-11-17T11:40:33Z

Close, but my intent is a little different: in case (1), there might be places that are conflicting (or apparently conflicting) down in the child components, but we don't worry about them as long as the parent has a sequence covering that area. This is to handle Mike's cases of assembly and optimization (though I think those may be better handled in another way, that's an orthogonal discussion).

In case (2), we have to understand every stage of composition, as well as the final sequence.

jakebeal · 2020-08-24T00:26:07Z

I think this is now handled by the "flattening-friendly" structure of SBOL3.

cjmyers · 2020-08-24T14:21:03Z

Not sure. Think this one needs a bit further thought.

jakebeal · 2020-08-24T14:25:46Z

@cjmyers If you think it needs further thought, shouldn't it still be open?

cjmyers · 2020-08-24T14:28:35Z

Oops pressed wrong button.

cjmyers · 2020-10-08T11:12:48Z

Need an algorithm to be worked out.

jakebeal · 2022-04-27T12:19:08Z

This is generalized flattening, vs. #8, which is only about sequences.

jakebeal added the enhancement New feature or request label Nov 3, 2016

graik mentioned this issue Oct 6, 2022

sequence flattening algorithm / pseudocode in Spec #8

Open

cjmyers assigned bbartley and cjmyers Jun 18, 2018

cjmyers closed this as completed Aug 24, 2020

jakebeal reopened this Aug 24, 2020

cjmyers unassigned bbartley and cjmyers Oct 7, 2020

PrashantVaidyanathan assigned cjmyers Mar 8, 2021

cjmyers removed the enhancement New feature or request label Apr 27, 2022

jakebeal mentioned this issue Oct 6, 2022

Develop a mechanism for practices documents #16

Open

LukasBuecherl transferred this issue from SynBioDex/SBOL-specification Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully specified vs. incomplete designs #7

Fully specified vs. incomplete designs #7

jakebeal commented Nov 3, 2016

mikebissell commented Nov 3, 2016 •

edited

Loading

jakebeal commented Nov 3, 2016

mikebissell commented Nov 3, 2016

jakebeal commented Nov 3, 2016

cjmyers commented Nov 4, 2016

graik commented Nov 4, 2016

mikebissell commented Nov 17, 2016

jakebeal commented Nov 17, 2016

graik commented Nov 17, 2016 •

edited by jakebeal

Loading

jakebeal commented Nov 17, 2016

jakebeal commented Aug 24, 2020

cjmyers commented Aug 24, 2020

jakebeal commented Aug 24, 2020

cjmyers commented Aug 24, 2020

cjmyers commented Oct 8, 2020

jakebeal commented Apr 27, 2022

Fully specified vs. incomplete designs #7

Fully specified vs. incomplete designs #7

Comments

jakebeal commented Nov 3, 2016

mikebissell commented Nov 3, 2016 • edited Loading

jakebeal commented Nov 3, 2016

mikebissell commented Nov 3, 2016

jakebeal commented Nov 3, 2016

cjmyers commented Nov 4, 2016

graik commented Nov 4, 2016

mikebissell commented Nov 17, 2016

jakebeal commented Nov 17, 2016

graik commented Nov 17, 2016 • edited by jakebeal Loading

jakebeal commented Nov 17, 2016

jakebeal commented Aug 24, 2020

cjmyers commented Aug 24, 2020

jakebeal commented Aug 24, 2020

cjmyers commented Aug 24, 2020

cjmyers commented Oct 8, 2020

jakebeal commented Apr 27, 2022

mikebissell commented Nov 3, 2016 •

edited

Loading

graik commented Nov 17, 2016 •

edited by jakebeal

Loading