SMILES property #392

merkys · 2021-12-03T16:01:10Z

In #368 SMILES property for structures was proposed. After some discussion the following consensus emerged:

OpenSMILES specification for SMILES is to be used
Type of this property is String. Thus all queries on this property have to treat it as String, without analyzing underlying chemical structure.
For inorganic structures/parts, recommendations by Quirós et al. 2018 are suggested. (Disclosure: I am a co-author for this paper).

Fixes #368.

ml-evs

Some initial comments below!

optimade.rst

Co-authored-by: Matthew Evans <[email protected]>

JPBergsma · 2021-12-05T13:18:14Z

optimade.rst

+    Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.
+    That is, providers MUST NOT perform substructure search, just regular string comparison.


Suggested change

Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.

That is, providers MUST NOT perform substructure search, just regular string comparison.

A molecule can have hundreds of valid SMILES descriptors. A client would have to include all of them in a query, to determine whether a particular molecule is present in the database.
I can imagine that such a query would be slow to execute.
A more efficient way, would be to convert the SMILES string of the query into a structure and then back into a SMILES string using the same method that was used to generate the SMILES strings in the database.
These lines however explicitly forbid databases from implementing this method.

I have responded to these concerns already.

@JPBergsma, are you OK with leaving these two lines intact and marking the conversation as resolved?

From what I understand from the discussions in #392, it was agreed to implement the complex structure search functionality in a different way (e.g. by using SMARTS).

optimade.rst

vaitkus

I would like to see the two main conversations resolved before approving.

Co-authored-by: Antanas Vaitkus <[email protected]>

optimade.rst

ml-evs · 2022-07-05T09:12:57Z

optimade.rst

+smiles
+~~~~~~
+
+- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure.


I think we need a bit more clarification of the expected use.

How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string? Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about:

co-crystal with two distinct molecules (does SMILES do something fancy for this already?)

an inorganic surface with adsorbed molecule

a hybrid perovskite structure with molecular unit as a cation

How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string?

This is a good point. I would say that every "site" has to be represented in SMILES. There surely will be situations where this is not attainable (i.e., OpenSMILES cannot express polymers and there will be difficulties in depicting mixture sites). Maybe at this point it would be easier to say that only the structures that are "expressible" using OpenSMILES should have smiles, that is, no nonstandard approximations should be done.

Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about:

co-crystal with two distinct molecules (does SMILES do something fancy for this already?)

SMILES can contain many distinct molecules, disconnected components are joined with . (if I get the question right)

an inorganic surface with adsorbed molecule

a hybrid perovskite structure with molecular unit as a cation

I would say these two fall under class "polymer", thus inexpressible.

co-crystal with two distinct molecules (does SMILES do something fancy for this already?
This can be described using a "dot bond", e.g. CuSO4.O.O

BobHanson · 2022-07-24T01:02:36Z

I suppose this was hashed through long ago (apologies), but honestly, this makes no sense, and I think you would find users quite dissatisfied. Q: What's the use case here? The whole idea of SMILES is that it doesn't matter how the user chooses to format the SMILES. If this is implemented at a service, they should be expected to treat it as all standard SMILES-accepting services do (PubChem, NCI Chemical Identity Resolver, maybe COD?) -- with full SMILES semantics and the ability to locally canonicalize the request (typically) so that on their end they can do a regular string search. But that is an internal choice of the service. For example, they might transform the SMILES to a molecular graph and do the search internally that way (starting with molecular formula, for instance). If it were a regular string search I would have to have gone to this service previously, cached their SMILES variant string and then used it for a search later. Why would I ever do that? "substructure searching" in the SMILES business means something different. Substructure searching is done using SMARTS, not SMILES. Perhaps someday SMARTS substructure searching could be implemented in OPTIMADE, but that is a separate issue. If you want to refer to substructure searching, perhaps: "SMARTS substructure searching..." Bob (in Mumbai, GMT+5:30)

…

On Tue, Jul 5, 2022 at 2:43 PM Matthew Evans ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In optimade.rst <#392 (comment)> : > @@ -2439,6 +2439,22 @@ chemical\_formula\_anonymous - A filter that matches an exactly given formula is :filter:`chemical_formula_anonymous="A2B"`. +smiles +~~~~~~ + +- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure. +- **Type**: string +- **Requirements/Conventions**: + + - **Support**: OPTIONAL support in implementations, i.e., MAY be :val:`null`. + - **Query**: Support for queries on this property is OPTIONAL. + Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics. + That is, providers MUST NOT perform substructure search, just regular string comparison. + - Value MUST adhere to the `OpenSMILES specification v1.0 <http://opensmiles.org/opensmiles.html>`__. + - When structures or their parts cannot be unambiguously represented in SMILES according to OpenSMILES recommendations, using the guidelines from `Quirós et al. 2018 <https://doi.org/10.1186/s13321-018-0279-6>`__ is RECOMMENDED. + - Providers MAY canonicalize (i.e., use rules to establish stable order of atoms) produced SMILES representations, but this is not mandatory. + Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified. Can we provide a couple of examples for people who are unfamiliar with SMILES (without needing them to click out of the spec and read the paper/OpenSMILES spec)? Below is taken from wikipedia, so please check if the string is actually OpenSMILES compliant... ⬇️ Suggested change - Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified. + Generally, providers SHOULD NOT change the representation more frequently than the structure itself is modified. + + - **Examples**: + - caffeine: `CN1C=NC2=C1C(=O)N(C(=O)N2C)C` ------------------------------ In optimade.rst <#392 (comment)> : > @@ -2439,6 +2439,22 @@ chemical\_formula\_anonymous - A filter that matches an exactly given formula is :filter:`chemical_formula_anonymous="A2B"`. +smiles +~~~~~~ + +- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure. I think we need a bit more clarification of the expected use. How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string? Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about: - co-crystal with two distinct molecules (does SMILES do something fancy for this already?) - an inorganic surface with adsorbed molecule - a hybrid perovskite structure with molecular unit as a cation — Reply to this email directly, view it on GitHub <#392 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEHNCW7T22W2FXQVETAB5V3VSP4CTANCNFSM5JKAXHAA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Robert M. Hanson Professor of Chemistry St. Olaf College Northfield, MN http://www.stolaf.edu/people/hansonr If nature does not answer first what we want, it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900 *We stand on the homelands of the Wahpekute Band of the Dakota Nation. We honor with gratitude the people who have stewarded the land throughout the generations and their ongoing contributions to this region. We acknowledge the ongoing injustices that we have committed against the Dakota Nation, and we wish to interrupt this legacy, beginning with acts of healing and honest storytelling about this place.*

rartino · 2022-08-08T11:27:04Z

@BobHanson

Q: What's the use case here?

The use case is just to allow databases to include a SMILES representation of a structure on whatever SMILES format the database likes (edit: whatever format which is compatible with OpenSMILES specification v1.0) . The field isn't really meant to allow any useful form of "search" - the discussion in #368 seemed to conclude that there is too little standardization of SMILES to support such search in a consistent standardized way, except possibly via SMILES. Hence, there is a separate PR for adding SMILES in #398.

I think any thoughts on how a standardized SMILES-based search could work are very welcome in the discussion in #368 ( #368 ). If I understand you correctly, you want the user to be able to give a SMILES string on "any format" and have the server internally handle the interpretation/conversions of that SMILES string to return entires for which the given SMILES is equivalent to the one given? Is there a benefit of doing the search this way, instead of formulating it as a SMARTS search where the full structure is the substructure?

(From the technical side I don't think the right way to express this kind of search is an expression that looks exactly like a string equality comparison. However, that is a technical discussion I think can be sorted out once we know more precisely how we would want a useful standardized SMILES search to work.)

BobHanson · 2022-08-08T11:59:24Z

On Mon, Aug 8, 2022 at 6:27 AM Rickard Armiento ***@***.***> wrote: @BobHanson <https://github.com/BobHanson> Q: What's the use case here? The use case is just to allow databases to include a SMILES representation of a structure on whatever SMILES format the database likes. The field isn't really meant to allow any useful form of "search" - the discussion in #368 <#368> seemed to conclude that there is too little standardization of SMILES to support such search in a consistent standardized way, except possibly via SMILES. Hence, there is a separate PR for adding SMILES in #398 <#398>. I think any thoughts on how a standardized SMILES-based search could work are very welcome in the discussion in #398 <#398> ( #368 <#368> ). If I understand you correctly, you want the user to be able to give a SMILES string on "any format" and have the server internally handle the interpretation/conversions of that SMILES string to return entires for which the given SMILES is equivalent to the one given?

That's right. This is the standard procedure. What a service does is to run a very quick algorithm that transforms the queried SMILES to their canonical form (that is, using whatever software was used to create their saved SMILES strings). Then for them it is a straight string match. Very simple. Really nothing to it. I query "CC1=CC=CC=C1O" and you convert that to "c1(C)ccccc1O" because that is how that is saved on your system. These are extremely simple algorithms -- just create the molecular graph from the SMILES and then generate the particular variant of SMILES from that that you need. It's just a quick pass through a library method.

Is there a benefit of doing the search this way, instead of formulating it as a SMARTS search where the full structure is the substructure?

Yes, certainly. SMARTS searching would be fantastic, but this is a more specialized capability that takes more sophisticated cheminformatics tools to do efficiently. So it is less likely that a service would have that capability. Consider the following four queries to PubChem: [image: image.png] [image: image.png] [image: image.png] [image: image.png] It would be much MUCH less useful if I had to already know that their canonicalization gave "CC1=CC=CC=C1O". How would I ever know that? Pretty sure they just did a quick conversion of those four SMILES variants to their "canonical" (meaning "the version our software creates") form and then, most probably just did a straight string match. Milliseconds as most. (From the technical side I don't think the right way to express this kind

of search is an expression that looks exactly like a string equality comparison. However, that is a technical discussion I think can be sorted out once we know more precisely how we would want a useful standardized SMILES search to work.)

Sure. The key here is that there is no "standard" necessary. Every service chooses some particular toolkit to create SMILES strings. The "canonicalization" is with respect to the fact that, given a molecular graph, their software will always spit out the same SMILES string -- thus "locally" canonical. There is no such thing as "universally" canonical. Too many toolkits out there with their own idea of how to do this and what to call "aromatic" and how to represent that. Bob

BobHanson · 2022-08-19T17:37:11Z

I agree with Andrius. Just pointing out that the "." in SMILES may or may not indicate multiple components. It all depends upon if there are connecting links creating a bond between what is on the left of the period and what is on the right. CCCO.O two components, one of them propanol, the other water C1CCO.O1 one component, propane-1,3-diol Bob

Co-authored-by: Matthew Evans <[email protected]>

merkys · 2022-09-23T09:34:17Z

Very simple. Really nothing to it. I query "CC1=CC=CC=C1O" and you convert that to "c1(C)ccccc1O" because that is how that is saved on your system. These are extremely simple algorithms -- just create the molecular graph from the SMILES and then generate the particular variant of SMILES from that that you need. It's just a quick pass through a library method.

I tend to disagree, conversion between different aromaticity depiction conventions is not straightforward. Richard L. Apodaca wrote a nice blogpost summarizing the issue and another one proposing an algorithm for conversion, which I tried to implement and gave up due to its complexity. There surely are libraries for this task, but there is no guarantee they correctly process various corner cases.

As for SMARTS, I am not aware of a single specification. Different libraries understand SMARTS queries quite differently. Time for OpenSMARTS? 😅

Edit: There actually is a specification for OpenSMARTS!

…SMILES data type.

sauliusg · 2023-01-27T15:44:22Z

To me it seems like a tricky issue.

Just comparing SMILES as strings is of little use if the client and the server do not agree on canonical representation.

Reconstructing graph and then querying from it is doable but definitely more complicated than just passing a call to a database back-end.

InChI and InChI key are supposed to be more standard (and InChI keys must be queryable as strings, otherwise thye make no sense...); but in our hands InChI conversion also gives artefacts. What about "Inchified SMILES"?

IMHO, to be useful for string searches, the SMILES string MUST be canonicalised in a reliable way, and this canonicalisation MUST be standard in OPTIMADE.

merkys · 2023-01-27T15:56:03Z

InChI and InChI key are supposed to be more standard (and InChI keys must be queryable as strings, otherwise thye make no sense...); but in our hands InChI conversion also gives artefacts. What about "Inchified SMILES"?

To my knowledge, Inchified SMILES is only implemented in Open Babel. Thus putting Inchified SMILES in the standard would likely push towards unified usage of Open Babel, and likely tie to one particular version of it.

In addition, I would personally like to avoid InChI, as recent versions of InChI library are not free software, at least not as understood by the Debian Free Software Guidelines.

BobHanson · 2023-01-27T16:41:02Z

IMHO:

SMILES are fundamentally valuable with or without canonicalization.
To the extent that a SMILES is valuable depends upon the context.
One context is structure matching.
Another context is substructure searching.
Another context is 2D- or 3D-structure creation from 1D representation (SMILES or InChI, name, etc.)

Agreed? Probably more options.

For structure matching, canonicalization is primarily valuable within a local context, because canonicalization only means that the particular algorithm used guarantees that regardless of how the structure's atoms and bonds are organized, the same string will be created -- provided that the same input options have been used (and there are many options!). And generally only within a local context do we know what exact algorithm was used and what options were used with it.

Furthermore, algorithms and implementations of algorithms are prone to multiple versioning. So one can never require any specifics regarding SMILES. Just to say, for example, "InChIfied SMILES" is not nearly enough. What version? What options? Would I somehow track down some old version and use it? Probably not.

It's a classic rat's nest.

So, I am not in favor of anything more than "smiles" here. It is a very narrow use-case where we need to know exactly what algorithm+options were used. If people feel that is necessary, then I suggest we follow the lead of PubChem. 1,2-dimethylbenzene here and allow for a second field that indicates at least something about the algorithm and options used:

InChI=1S/C8H10/c1-7-5-3-4-6-8(7)2/h3-6H,1-2H3
Computed by InChI 1.0.6 (PubChem release 2021.05.07)

CC1=CC=CC=C1C
Computed by OEChem 2.3.0 (PubChem release 2021.05.07)

(Interesting that they do not indicate the options there -- Here we see a Kekulé form of the SMILES, but we could have also seen Cc1ccccc1C, so perhaps the "canonical" OEChem option requires that. Probably. Maybe. Or it was an option.)

Just to make the point, if we go to ChEMBL, alas, we find that for them, the "Canonical SMILES" is, in fact, Cc1ccccc1C.

ChEMBL does not specify what algorithm+options were used.

My personal preference is noncanonical Kekulé SMILES, which is the basis for for SMILES searching targets (actual molecules), rather than aromatic SMILES, which are more useful for the pattern used to find the target, since it covers multiple Kekulé varieties.

Saulius, I'm guessing that at COD, when I type in a SMILES string, you immediately canonicalize it to match your database, right? Or do you just consider everything entered to be a SMARTS search?

Bob

sauliusg

I feel like approving this PR, since SMILES is undoubtedly useful feature and easy to implement with just this spec. I am a bit wary about comparing SMILES as strings; this will only be useful if we know in advance what canonicalisation does the server use (if any). This limits the usefulness of such queries. But finding a canonical SMILES or other string representations of chemical structures is an ongoing research, and we should not stop merging the SMILES PR, provided that we can later change the mode of comparison.

And yes, we have another, more sophisticated search mechanism using SMARTS in the PR #398.

merkys added 2 commits December 3, 2021 17:48

Initial proposal for SMILES property (Materials-Consortia#368).

55be0a4

Describing canonicalization.

6f03d44

merkys mentioned this pull request Dec 3, 2021

Add SMILES property #368

Open

ml-evs reviewed Dec 3, 2021

View reviewed changes

optimade.rst Show resolved Hide resolved

optimade.rst Outdated Show resolved Hide resolved

optimade.rst Outdated Show resolved Hide resolved

optimade.rst Outdated Show resolved Hide resolved

merkys and others added 2 commits December 4, 2021 10:38

Update optimade.rst

75d1bba

Co-authored-by: Matthew Evans <[email protected]>

Update optimade.rst

8072ca8

Co-authored-by: Matthew Evans <[email protected]>

JPBergsma reviewed Dec 5, 2021

View reviewed changes

Explaining what "canonicalize" means.

72a5409

JPBergsma added the type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. label Dec 24, 2021

Merge branch 'develop' into SMILES-property

8485be4

JPBergsma mentioned this pull request May 17, 2022

Preview version of the OPTIMADE standard #403

Closed

Merge branch 'develop' into SMILES-property

408ce29

merkys requested review from rartino and sauliusg June 1, 2022 05:31

ml-evs added the PR/requires-discussion label Jun 1, 2022

merkys requested a review from ml-evs June 2, 2022 11:58

vaitkus reviewed Jun 3, 2022

View reviewed changes