Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully specified vs. incomplete designs #7

Open
jakebeal opened this issue Nov 3, 2016 · 16 comments
Open

Fully specified vs. incomplete designs #7

jakebeal opened this issue Nov 3, 2016 · 16 comments
Assignees

Comments

@jakebeal
Copy link
Contributor

jakebeal commented Nov 3, 2016

Something that has been coming up on the SBOL developers list and in other discussions is the challenge of determining when a design is "complete and consistent."

We need to specify a simple algorithm by which an SBOL structural design can be flattened into a single nucleic acid sequence. If this algorithm can be executed, then a design is complete and consistent; if it cannot, then the design is either incomplete (if there is missing information) and/or inconsistent (if there is information that conflicts).

@jakebeal jakebeal added the enhancement New feature or request label Nov 3, 2016
@mikebissell
Copy link

mikebissell commented Nov 3, 2016

Hmm. One ring to rule them all? I'm dubious.

First, biophysics is not constrained by bioinformatics. We might observe a compositional relationship that we cannot yet explain in terms of a program. We might encode a composition whose physical implementation details are not ready for publication. We might understand a number of possible compositional operations that the SBOL community has yet to contemplate in its deliberations. Nevertheless, we can still draw up a practically complete and consistent CAD model, even if certain op nodes are sort of black-boxed for now.

Second, abstract designs are complete; they are simply abstract. Abstract models are useful for all kinds of things: classification, constraints, generalization, templating, ...

Third, in general we cannot know, and do not seek to dictate to users, which clever recipes they might discover for implementing any arbitrary (potentially novel) compositional function that takes a number of subcomponents on input and returns an assembled component definition as output. There are multiple ways, physically and logically, to implement these genetic compositions, and many ways yet unknown.

Side note. This is why it's a good idea to encode the observed or predicted assembled sequence up on the parent node; we don't naively assume that the parent sequence can be magically inferred as the sum of its children's sequences. The [logical/physical] compositional [reaction/program] details are currently external to this particular layer of our model.

Now, if we still wish to classify known compositional modes or procedural implementations, then I suggest we implement them as a separate, extensible plug-in layer that refers to the SBOL2-structural elements. The structural layer itself should be like modeling clay, and standards typically lag well behind discovery.

@jakebeal
Copy link
Contributor Author

jakebeal commented Nov 3, 2016

I'd like to understand better some of the problems that you see with predicting sequence from a linked set of ComponentDefinitions. Can you give an example of a non-straightforward sequence prediction?

@mikebissell
Copy link

Okay, what if you don't know which assembly method I've chosen? You can't very well tell me that I've done it wrong just because I used an unfamiliar reaction.

What if we are studying an unfamiliar assembly process, and we need to encode the observed ingredient inputs and assembled outputs, but we haven't yet discovered an algorithm that captures its logic?

"Linkage" is biophysically nonspecific, right?

@jakebeal
Copy link
Contributor Author

jakebeal commented Nov 3, 2016

Do we actually need to know the assembly method? I'm thinking that it's better to not know the method. Rather than saying "how" (assembly method) we could say "what", and simply annotate "this part of the sequence goes away."

Can you give an example of an assembly where you think that wouldn't work?

@cjmyers
Copy link
Contributor

cjmyers commented Nov 4, 2016

This problem stems from the fact that people publish incomplete designs. This is the rule and not the exception. In virtually every supplemental to a paper we have examined, there is no complete sequence given. They give a few parts, but not the whole thing. Therefore, the paper is not reproducible. So, we are not saying this is a rule for all SBOL to be valid. We definitely want to allow people to encode anything they want to in SBOL. We are saying if you have built something and you want to publish it, you should publish at least one SBOL file that is “complete”. This is the SBOL file definition that we want to make. Note you can certainly submit more SBOL files that show more of the steps of assembly, etc., but there should be at least one SBOL file that can be flattened down to a complete annotated sequence.

Chris

On Nov 3, 2016, at 10:21 PM, Jacob Beal [email protected] wrote:

Do we actually need to know the assembly method? I'm thinking that it's better to not know the method. Rather than saying "how" (assembly method) we could say "what", and simply annotate "this part of the sequence goes away."

Can you give an example of an assembly where you think that wouldn't work?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #7, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWD93WXYU-xp8awFggH8E4Rut-xFRDmks5q6l6CgaJpZM4KoTJc.

@graik
Copy link

graik commented Nov 4, 2016

Side note. This is why it's a good idea to encode the observed or
predicted assembled sequence up on the parent node; we don't naively
assume that the parent sequence can be magically inferred as the sum of its
children's sequences. The [logical/physical] compositional
[reaction/program] details are currently external to this particular layer
of our model.

Interesting. That's exactly what I was thinking of as the best pragmatic
solution to the "full sequence specified?" problem. See my other mail from
some minutes ago. So we define a fully-sequence specified SBOL record as
one where the full sequence is attached to the highest parent node.
We might need some terms for saying "this is only one possible reference
sequence out of several" instead of "this is it". And we might want to say
whether conflicts are resolved in favour of the parent reference or in
favour of subpart sequences. And we need more than one parent sequence if
there more than one molecule involved (2 plasmids, for example). And we
need some way to specify insertion points into a genome.

In any case, a definite reference sequence (if and only if already known)
will solve a lot of issues.

On Fri, Nov 4, 2016 at 1:12 PM, cjmyers [email protected] wrote:

This problem stems from the fact that people publish incomplete designs.
This is the rule and not the exception. In virtually every supplemental to
a paper we have examined, there is no complete sequence given. They give a
few parts, but not the whole thing. Therefore, the paper is not
reproducible. So, we are not saying this is a rule for all SBOL to be
valid. We definitely want to allow people to encode anything they want to
in SBOL. We are saying if you have built something and you want to publish
it, you should publish at least one SBOL file that is “complete”. This is
the SBOL file definition that we want to make. Note you can certainly
submit more SBOL files that show more of the steps of assembly, etc., but
there should be at least one SBOL file that can be flattened down to a
complete annotated sequence.

Chris

On Nov 3, 2016, at 10:21 PM, Jacob Beal [email protected]
wrote:

Do we actually need to know the assembly method? I'm thinking that it's
better to not know the method. Rather than saying "how" (assembly method)
we could say "what", and simply annotate "this part of the sequence goes
away."

Can you give an example of an assembly where you think that wouldn't
work?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <
#7
issuecomment-258291912>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/ADWD93WXYU-xp8awFggH8E4Rut-
xFRDmks5q6l6CgaJpZM4KoTJc>.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABxs3RRG6NQ3ddezm2qDr8JCKP-d-uGLks5q6wUEgaJpZM4KoTJc
.


Raik Grünberg
http://www.raiks.de/contact.html


@mikebissell
Copy link

Jake clarified:
Do we actually need to know the assembly method? I'm thinking that it's better to not know the method.

Okay, I may have misunderstood Jake's initial suggestion, but this conversation has nevertheless borne fruit.

Yes, I think that a useful blueprint will always contain sequences for each of the sub-parts in the exploded diagram, plus an "imploded" or merged sequence on the apex node(s), with sub-parts positioned on parent sequence(s) as needed, plus roles tagged as they are known.

So, how do we detect "unpublishably underspecified" material?

  • Look for ComponentDefinitions lacking a payload (sequence)...
  • ...recursively? Is any subcomponent with no payload intolerably vague for this purpose?
  • What else?

I'm inclined to say that if a child component has a mergepoint where it integrates into the parent sequence, then it should be positioned on that sequence explicitly. But we can't validate that advice. So: would it be too restrictive to require that every child component have a position on the parent? I have nagging concerns:

In silico, in biocompiler toolchain logical operators for example, there may be cases where child entities in a model do not align to their parent -- because they are aggressively rewritten, or because they are consulted and discarded during an optimizer pass. Nevertheless they belong in the model (the language) because they were part of the intermediate representation which got reworked through successive passes of the compiler. People might be inclined to use the model for describing analogous physical configurations as well. Perhaps we could positively tag throwaway subcomponents as such, using a contextual role, and then require that for "publishable" material, any child that is not positioned on the parent must be tagged with a certain role on its Component node, indicating that the child was intentionally not aligned to its parent?

I realize I am recapitulating certain things that were said earlier on the mailing list. I suspect that some of these proposed roles may soon graduate into first-class attributes if they keep cropping up.

@jakebeal
Copy link
Contributor Author

I think this is a very useful direction of discussion. If we take the notion of "unpublishably vague" as our target, then I would have two suggestions:

  • "Weakly well-specified:" such a ComponentDefinition (or each member of a Collection of CDs) must specify its entire sequence, or else every unspecified portion of its sequence must be precisely specified by one or more of its Components. Any unspecified portion covered by multiple Components must have no conflicts in the sequences they specify.
  • "Strongly well-specified:" such a ComponentDefinition (or each member of a Collection of CDs) must be weakly well-specified AND every Component in the CD must also be strongly well-specified and also have a precisely defined relationship between the sequence of its definition and the sequence of the CD.

@graik
Copy link

graik commented Nov 17, 2016

I think we are on an excellent path here

Case (1) (unambiguous)

I think this is a very useful direction of discussion. If we take the
notion of "unpublishably vague" as our target, then I would have two
suggestions:

"Weakly well-specified:" such a ComponentDefinition (or each member of
a Collection of CDs) must specify its entire sequence, or else every
unspecified portion of its sequence must be precisely specified by one or
more of its Components. Any unspecified portion covered by multiple
Components must have no conflicts in the sequences they specify.

In other words, there may be gaps in the overall sequence but these gaps
are explicit and clearly accounted for and there are no sequence conflicts.

Case (2) (fully specified)

"Strongly well-specified:" such a ComponentDefinition (or each member
of a Collection of CDs) must be weakly well-specified AND every Component
in the CD must also be strongly well-specified and also have a precisely
defined relationship between the sequence of its definition and the
sequence of the CD.

This means the complete overall sequence is given and there aren't any
ambiguities. The easiest way to handle this is indeed a "parent"
ComponenDefinition with the full sequence spelled out.

@jakebeal
Copy link
Contributor Author

Close, but my intent is a little different: in case (1), there might be places that are conflicting (or apparently conflicting) down in the child components, but we don't worry about them as long as the parent has a sequence covering that area. This is to handle Mike's cases of assembly and optimization (though I think those may be better handled in another way, that's an orthogonal discussion).

In case (2), we have to understand every stage of composition, as well as the final sequence.

@jakebeal
Copy link
Contributor Author

I think this is now handled by the "flattening-friendly" structure of SBOL3.

@cjmyers
Copy link
Contributor

cjmyers commented Aug 24, 2020

Not sure. Think this one needs a bit further thought.

@cjmyers cjmyers closed this as completed Aug 24, 2020
@jakebeal
Copy link
Contributor Author

@cjmyers If you think it needs further thought, shouldn't it still be open?

@jakebeal jakebeal reopened this Aug 24, 2020
@cjmyers
Copy link
Contributor

cjmyers commented Aug 24, 2020

Oops pressed wrong button.

@cjmyers
Copy link
Contributor

cjmyers commented Oct 8, 2020

Need an algorithm to be worked out.

@jakebeal
Copy link
Contributor Author

This is generalized flattening, vs. #8, which is only about sequences.

@cjmyers cjmyers removed the enhancement New feature or request label Apr 27, 2022
@LukasBuecherl LukasBuecherl transferred this issue from SynBioDex/SBOL-specification Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants