-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMILES property #392
base: develop
Are you sure you want to change the base?
SMILES property #392
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments below!
Co-authored-by: Matthew Evans <[email protected]>
Co-authored-by: Matthew Evans <[email protected]>
Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics. | ||
That is, providers MUST NOT perform substructure search, just regular string comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics. | |
That is, providers MUST NOT perform substructure search, just regular string comparison. |
A molecule can have hundreds of valid SMILES descriptors. A client would have to include all of them in a query, to determine whether a particular molecule is present in the database.
I can imagine that such a query would be slow to execute.
A more efficient way, would be to convert the SMILES string of the query into a structure and then back into a SMILES string using the same method that was used to generate the SMILES strings in the database.
These lines however explicitly forbid databases from implementing this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JPBergsma, are you OK with leaving these two lines intact and marking the conversation as resolved?
From what I understand from the discussions in #392, it was agreed to implement the complex structure search functionality in a different way (e.g. by using SMARTS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see the two main conversations resolved before approving.
Co-authored-by: Antanas Vaitkus <[email protected]>
Co-authored-by: Antanas Vaitkus <[email protected]>
Co-authored-by: Antanas Vaitkus <[email protected]>
smiles | ||
~~~~~~ | ||
|
||
- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a bit more clarification of the expected use.
How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string? Obviously for nperiodic_dimensions=0
and a single molecule this makes sense, same for an nperiodic_dimensions=3
molecular crystal, but what about:
- co-crystal with two distinct molecules (does SMILES do something fancy for this already?)
- an inorganic surface with adsorbed molecule
- a hybrid perovskite structure with molecular unit as a cation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string?
This is a good point. I would say that every "site" has to be represented in SMILES. There surely will be situations where this is not attainable (i.e., OpenSMILES cannot express polymers and there will be difficulties in depicting mixture sites). Maybe at this point it would be easier to say that only the structures that are "expressible" using OpenSMILES should have smiles
, that is, no nonstandard approximations should be done.
Obviously for
nperiodic_dimensions=0
and a single molecule this makes sense, same for annperiodic_dimensions=3
molecular crystal, but what about:
- co-crystal with two distinct molecules (does SMILES do something fancy for this already?)
SMILES can contain many distinct molecules, disconnected components are joined with .
(if I get the question right)
- an inorganic surface with adsorbed molecule
- a hybrid perovskite structure with molecular unit as a cation
I would say these two fall under class "polymer", thus inexpressible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
co-crystal with two distinct molecules (does SMILES do something fancy for this already?
This can be described using a "dot bond", e.g. CuSO4.O.O
I suppose this was hashed through long ago (apologies), but honestly, this
makes no sense, and I think you would
find users quite dissatisfied.
Q: What's the use case here?
The whole idea of SMILES is that it doesn't matter how the user chooses to
format the SMILES. If this is implemented at a service, they should be
expected to treat it as all standard SMILES-accepting services do (PubChem,
NCI Chemical Identity Resolver, maybe COD?) -- with full SMILES semantics
and the ability to locally canonicalize the request (typically) so that on
their end they can do a regular string search. But that is an internal
choice of the service. For example, they might transform the SMILES to a
molecular graph and do the search internally that way (starting with
molecular formula, for instance). If it were a regular string search I
would have to have gone to this service previously, cached their SMILES
variant string and then used it for a search later. Why would I ever do
that?
"substructure searching" in the SMILES business means something different.
Substructure searching is done using SMARTS, not SMILES. Perhaps someday
SMARTS substructure searching could be implemented in OPTIMADE, but that is
a separate issue.
If you want to refer to substructure searching, perhaps: "SMARTS
substructure searching..."
Bob
(in Mumbai, GMT+5:30)
…On Tue, Jul 5, 2022 at 2:43 PM Matthew Evans ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In optimade.rst
<#392 (comment)>
:
> @@ -2439,6 +2439,22 @@ chemical\_formula\_anonymous
- A filter that matches an exactly given formula is :filter:`chemical_formula_anonymous="A2B"`.
+smiles
+~~~~~~
+
+- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure.
+- **Type**: string
+- **Requirements/Conventions**:
+
+ - **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`.
+ - **Query**: Support for queries on this property is OPTIONAL.
+ Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.
+ That is, providers MUST NOT perform substructure search, just regular string comparison.
+ - Value MUST adhere to the `OpenSMILES specification v1.0 <http://opensmiles.org/opensmiles.html>`__.
+ - When structures or their parts cannot be unambiguously represented in SMILES according to OpenSMILES recommendations, using the guidelines from `Quirós et al. 2018 <https://doi.org/10.1186/s13321-018-0279-6>`__ is RECOMMENDED.
+ - Providers MAY canonicalize (i.e., use rules to establish stable order of atoms) produced SMILES representations, but this is not mandatory.
+ Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified.
Can we provide a couple of examples for people who are unfamiliar with
SMILES (without needing them to click out of the spec and read the
paper/OpenSMILES spec)? Below is taken from wikipedia, so please check if
the string is actually OpenSMILES compliant...
⬇️ Suggested change
- Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified.
+ Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified.
+
+ - **Examples**:
+ - caffeine: `CN1C=NC2=C1C(=O)N(C(=O)N2C)C`
------------------------------
In optimade.rst
<#392 (comment)>
:
> @@ -2439,6 +2439,22 @@ chemical\_formula\_anonymous
- A filter that matches an exactly given formula is :filter:`chemical_formula_anonymous="A2B"`.
+smiles
+~~~~~~
+
+- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure.
I think we need a bit more clarification of the expected use.
How "much" of the structure should be described by the SMILES string for
it to be valid here (e.g., that it should appear in the results when
someone searches for it?) Do we need to require that every "site" in the
OPTIMADE structure is present in the SMILES string? Obviously for
nperiodic_dimensions=0 and a single molecule this makes sense, same for
an nperiodic_dimensions=3 molecular crystal, but what about:
- co-crystal with two distinct molecules (does SMILES do something
fancy for this already?)
- an inorganic surface with adsorbed molecule
- a hybrid perovskite structure with molecular unit as a cation
—
Reply to this email directly, view it on GitHub
<#392 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW7T22W2FXQVETAB5V3VSP4CTANCNFSM5JKAXHAA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Robert M. Hanson
Professor of Chemistry
St. Olaf College
Northfield, MN
http://www.stolaf.edu/people/hansonr
If nature does not answer first what we want,
it is better to take what answer we get.
-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900
*We stand on the homelands of the Wahpekute Band of the Dakota Nation. We
honor with gratitude the people who have stewarded the land throughout the
generations and their ongoing contributions to this region. We acknowledge
the ongoing injustices that we have committed against the Dakota Nation,
and we wish to interrupt this legacy, beginning with acts of healing and
honest storytelling about this place.*
|
The use case is just to allow databases to include a SMILES representation of a structure on whatever SMILES format the database likes (edit: whatever format which is compatible with OpenSMILES specification v1.0) . The field isn't really meant to allow any useful form of "search" - the discussion in #368 seemed to conclude that there is too little standardization of SMILES to support such search in a consistent standardized way, except possibly via SMILES. Hence, there is a separate PR for adding SMILES in #398. I think any thoughts on how a standardized SMILES-based search could work are very welcome in the discussion in #368 ( #368 ). If I understand you correctly, you want the user to be able to give a SMILES string on "any format" and have the server internally handle the interpretation/conversions of that SMILES string to return entires for which the given SMILES is equivalent to the one given? Is there a benefit of doing the search this way, instead of formulating it as a SMARTS search where the full structure is the substructure? (From the technical side I don't think the right way to express this kind of search is an expression that looks exactly like a string equality comparison. However, that is a technical discussion I think can be sorted out once we know more precisely how we would want a useful standardized SMILES search to work.) |
On Mon, Aug 8, 2022 at 6:27 AM Rickard Armiento ***@***.***> wrote:
@BobHanson <https://github.com/BobHanson>
Q: What's the use case here?
The use case is just to allow databases to include a SMILES representation
of a structure on whatever SMILES format the database likes. The field
isn't really meant to allow any useful form of "search" - the discussion in
#368 <#368> seemed
to conclude that there is too little standardization of SMILES to support
such search in a consistent standardized way, except possibly via SMILES.
Hence, there is a separate PR for adding SMILES in #398
<#398>.
I think any thoughts on how a standardized SMILES-based search could work
are very welcome in the discussion in #398
<#398> ( #368
<#368> ). If I
understand you correctly, you want the user to be able to give a SMILES
string on "any format" and have the server internally handle the
interpretation/conversions of that SMILES string to return entires for
which the given SMILES is equivalent to the one given?
That's right. This is the standard procedure. What a service does is to run
a very quick algorithm that transforms the queried SMILES to their
canonical form (that is, using whatever software was used to create their
saved SMILES strings). Then for them it is a straight string match. Very
simple. Really nothing to it.
I query "CC1=CC=CC=C1O" and you convert that to "c1(C)ccccc1O" because that
is how that is saved on your system. These are extremely simple algorithms
-- just create the molecular graph from the SMILES and then generate the
particular variant of SMILES from that that you need. It's just a quick
pass through a library method.
Is there a benefit of doing the search this way, instead of formulating it
as a SMARTS search where the full structure is the substructure?
Yes, certainly. SMARTS searching would be fantastic, but this is a more
specialized capability that takes more sophisticated cheminformatics tools
to do efficiently. So it is less likely that a service would have that
capability.
Consider the following four queries to PubChem:
[image: image.png]
[image: image.png]
[image: image.png]
[image: image.png]
It would be much MUCH less useful if I had to already know that their
canonicalization gave "CC1=CC=CC=C1O". How would I ever know that? Pretty
sure they just did a quick conversion of those four SMILES variants to
their "canonical" (meaning "the version our software creates") form and
then, most probably just did a straight string match. Milliseconds as most.
(From the technical side I don't think the right way to express this kind
of search is an expression that looks exactly like a string equality
comparison. However, that is a technical discussion I think can be sorted
out once we know more precisely how we would want a useful standardized
SMILES search to work.)
Sure. The key here is that there is no "standard" necessary. Every service
chooses some particular toolkit to create SMILES strings. The
"canonicalization" is with respect to the fact that, given a molecular
graph, their software will always spit out the same SMILES string -- thus
"locally" canonical. There is no such thing as "universally" canonical. Too
many toolkits out there with their own idea of how to do this and what to
call "aromatic" and how to represent that.
Bob
|
I agree with Andrius. Just pointing out that the "." in SMILES may or may
not indicate multiple components. It all depends upon if there are
connecting links creating a bond between what is on the left of the period
and what is on the right.
CCCO.O two components, one of them propanol, the other water
C1CCO.O1 one component, propane-1,3-diol
Bob
|
Co-authored-by: Matthew Evans <[email protected]>
I tend to disagree, conversion between different aromaticity depiction conventions is not straightforward. Richard L. Apodaca wrote a nice blogpost summarizing the issue and another one proposing an algorithm for conversion, which I tried to implement and gave up due to its complexity. There surely are libraries for this task, but there is no guarantee they correctly process various corner cases. As for SMARTS, I am not aware of a single specification. Different libraries understand SMARTS queries quite differently. Time for OpenSMARTS? 😅 Edit: There actually is a specification for OpenSMARTS! |
…SMILES data type.
To me it seems like a tricky issue. Just comparing SMILES as strings is of little use if the client and the server do not agree on canonical representation. Reconstructing graph and then querying from it is doable but definitely more complicated than just passing a call to a database back-end. InChI and InChI key are supposed to be more standard (and InChI keys must be queryable as strings, otherwise thye make no sense...); but in our hands InChI conversion also gives artefacts. What about "Inchified SMILES"? IMHO, to be useful for string searches, the SMILES string MUST be canonicalised in a reliable way, and this canonicalisation MUST be standard in OPTIMADE. |
To my knowledge, Inchified SMILES is only implemented in Open Babel. Thus putting Inchified SMILES in the standard would likely push towards unified usage of Open Babel, and likely tie to one particular version of it. In addition, I would personally like to avoid InChI, as recent versions of InChI library are not free software, at least not as understood by the Debian Free Software Guidelines. |
IMHO:
Agreed? Probably more options. For structure matching, canonicalization is primarily valuable within a local context, because canonicalization only means that the particular algorithm used guarantees that regardless of how the structure's atoms and bonds are organized, the same string will be created -- provided that the same input options have been used (and there are many options!). And generally only within a local context do we know what exact algorithm was used and what options were used with it. Furthermore, algorithms and implementations of algorithms are prone to multiple versioning. So one can never require any specifics regarding SMILES. Just to say, for example, "InChIfied SMILES" is not nearly enough. What version? What options? Would I somehow track down some old version and use it? Probably not. It's a classic rat's nest. So, I am not in favor of anything more than "smiles" here. It is a very narrow use-case where we need to know exactly what algorithm+options were used. If people feel that is necessary, then I suggest we follow the lead of PubChem. 1,2-dimethylbenzene here and allow for a second field that indicates at least something about the algorithm and options used: InChI=1S/C8H10/c1-7-5-3-4-6-8(7)2/h3-6H,1-2H3 CC1=CC=CC=C1C (Interesting that they do not indicate the options there -- Here we see a Kekulé form of the SMILES, but we could have also seen Cc1ccccc1C, so perhaps the "canonical" OEChem option requires that. Probably. Maybe. Or it was an option.) Just to make the point, if we go to ChEMBL, alas, we find that for them, the "Canonical SMILES" is, in fact, Cc1ccccc1C. ChEMBL does not specify what algorithm+options were used. My personal preference is noncanonical Kekulé SMILES, which is the basis for for SMILES searching targets (actual molecules), rather than aromatic SMILES, which are more useful for the pattern used to find the target, since it covers multiple Kekulé varieties. Saulius, I'm guessing that at COD, when I type in a SMILES string, you immediately canonicalize it to match your database, right? Or do you just consider everything entered to be a SMARTS search? Bob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like approving this PR, since SMILES is undoubtedly useful feature and easy to implement with just this spec. I am a bit wary about comparing SMILES as strings; this will only be useful if we know in advance what canonicalisation does the server use (if any). This limits the usefulness of such queries. But finding a canonical SMILES or other string representations of chemical structures is an ongoing research, and we should not stop merging the SMILES PR, provided that we can later change the mode of comparison.
And yes, we have another, more sophisticated search mechanism using SMARTS in the PR #398.
In #368 SMILES property for structures was proposed. After some discussion the following consensus emerged:
Fixes #368.