Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMILES property #392

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from
Open

Conversation

merkys
Copy link
Member

@merkys merkys commented Dec 3, 2021

In #368 SMILES property for structures was proposed. After some discussion the following consensus emerged:

  • OpenSMILES specification for SMILES is to be used
  • Type of this property is String. Thus all queries on this property have to treat it as String, without analyzing underlying chemical structure.
  • For inorganic structures/parts, recommendations by Quirós et al. 2018 are suggested. (Disclosure: I am a co-author for this paper).

Fixes #368.

@merkys merkys mentioned this pull request Dec 3, 2021
Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments below!

optimade.rst Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
merkys and others added 2 commits December 4, 2021 10:38
Co-authored-by: Matthew Evans <[email protected]>
Co-authored-by: Matthew Evans <[email protected]>
Comment on lines +2000 to +2001
Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.
That is, providers MUST NOT perform substructure search, just regular string comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Queries MUST treat the value of this property as a raw string, without SMILES-specific semantics.
That is, providers MUST NOT perform substructure search, just regular string comparison.

A molecule can have hundreds of valid SMILES descriptors. A client would have to include all of them in a query, to determine whether a particular molecule is present in the database.
I can imagine that such a query would be slow to execute.
A more efficient way, would be to convert the SMILES string of the query into a structure and then back into a SMILES string using the same method that was used to generate the SMILES strings in the database.
These lines however explicitly forbid databases from implementing this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JPBergsma, are you OK with leaving these two lines intact and marking the conversation as resolved?

From what I understand from the discussions in #392, it was agreed to implement the complex structure search functionality in a different way (e.g. by using SMARTS).

@JPBergsma JPBergsma added the type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. label Dec 24, 2021
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
@merkys merkys requested a review from vaitkus June 20, 2022 11:11
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@vaitkus vaitkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see the two main conversations resolved before approving.

merkys and others added 3 commits June 27, 2022 10:11
Co-authored-by: Antanas Vaitkus <[email protected]>
Co-authored-by: Antanas Vaitkus <[email protected]>
Co-authored-by: Antanas Vaitkus <[email protected]>
optimade.rst Show resolved Hide resolved
smiles
~~~~~~

- **Description**: The SMILES (Simplified Molecular Input Line Entry System) representation of the structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a bit more clarification of the expected use.

How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string? Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about:

  • co-crystal with two distinct molecules (does SMILES do something fancy for this already?)
  • an inorganic surface with adsorbed molecule
  • a hybrid perovskite structure with molecular unit as a cation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How "much" of the structure should be described by the SMILES string for it to be valid here (e.g., that it should appear in the results when someone searches for it?) Do we need to require that every "site" in the OPTIMADE structure is present in the SMILES string?

This is a good point. I would say that every "site" has to be represented in SMILES. There surely will be situations where this is not attainable (i.e., OpenSMILES cannot express polymers and there will be difficulties in depicting mixture sites). Maybe at this point it would be easier to say that only the structures that are "expressible" using OpenSMILES should have smiles, that is, no nonstandard approximations should be done.

Obviously for nperiodic_dimensions=0 and a single molecule this makes sense, same for an nperiodic_dimensions=3 molecular crystal, but what about:

  • co-crystal with two distinct molecules (does SMILES do something fancy for this already?)

SMILES can contain many distinct molecules, disconnected components are joined with . (if I get the question right)

  • an inorganic surface with adsorbed molecule
  • a hybrid perovskite structure with molecular unit as a cation

I would say these two fall under class "polymer", thus inexpressible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

co-crystal with two distinct molecules (does SMILES do something fancy for this already?
This can be described using a "dot bond", e.g. CuSO4.O.O

@BobHanson
Copy link

BobHanson commented Jul 24, 2022 via email

@rartino
Copy link
Contributor

rartino commented Aug 8, 2022

@BobHanson

Q: What's the use case here?

The use case is just to allow databases to include a SMILES representation of a structure on whatever SMILES format the database likes (edit: whatever format which is compatible with OpenSMILES specification v1.0) . The field isn't really meant to allow any useful form of "search" - the discussion in #368 seemed to conclude that there is too little standardization of SMILES to support such search in a consistent standardized way, except possibly via SMILES. Hence, there is a separate PR for adding SMILES in #398.

I think any thoughts on how a standardized SMILES-based search could work are very welcome in the discussion in #368 ( #368 ). If I understand you correctly, you want the user to be able to give a SMILES string on "any format" and have the server internally handle the interpretation/conversions of that SMILES string to return entires for which the given SMILES is equivalent to the one given? Is there a benefit of doing the search this way, instead of formulating it as a SMARTS search where the full structure is the substructure?

(From the technical side I don't think the right way to express this kind of search is an expression that looks exactly like a string equality comparison. However, that is a technical discussion I think can be sorted out once we know more precisely how we would want a useful standardized SMILES search to work.)

@BobHanson
Copy link

BobHanson commented Aug 8, 2022 via email

@BobHanson
Copy link

BobHanson commented Aug 19, 2022 via email

Co-authored-by: Matthew Evans <[email protected]>
@merkys
Copy link
Member Author

merkys commented Sep 23, 2022

Very simple. Really nothing to it. I query "CC1=CC=CC=C1O" and you convert that to "c1(C)ccccc1O" because that is how that is saved on your system. These are extremely simple algorithms -- just create the molecular graph from the SMILES and then generate the particular variant of SMILES from that that you need. It's just a quick pass through a library method.

I tend to disagree, conversion between different aromaticity depiction conventions is not straightforward. Richard L. Apodaca wrote a nice blogpost summarizing the issue and another one proposing an algorithm for conversion, which I tried to implement and gave up due to its complexity. There surely are libraries for this task, but there is no guarantee they correctly process various corner cases.

As for SMARTS, I am not aware of a single specification. Different libraries understand SMARTS queries quite differently. Time for OpenSMARTS? 😅

Edit: There actually is a specification for OpenSMARTS!

merkys added a commit to merkys/OPTIMADE that referenced this pull request Nov 25, 2022
@merkys merkys mentioned this pull request Dec 7, 2022
@sauliusg
Copy link
Contributor

To me it seems like a tricky issue.

Just comparing SMILES as strings is of little use if the client and the server do not agree on canonical representation.

Reconstructing graph and then querying from it is doable but definitely more complicated than just passing a call to a database back-end.

InChI and InChI key are supposed to be more standard (and InChI keys must be queryable as strings, otherwise thye make no sense...); but in our hands InChI conversion also gives artefacts. What about "Inchified SMILES"?

IMHO, to be useful for string searches, the SMILES string MUST be canonicalised in a reliable way, and this canonicalisation MUST be standard in OPTIMADE.

@merkys
Copy link
Member Author

merkys commented Jan 27, 2023

InChI and InChI key are supposed to be more standard (and InChI keys must be queryable as strings, otherwise thye make no sense...); but in our hands InChI conversion also gives artefacts. What about "Inchified SMILES"?

To my knowledge, Inchified SMILES is only implemented in Open Babel. Thus putting Inchified SMILES in the standard would likely push towards unified usage of Open Babel, and likely tie to one particular version of it.

In addition, I would personally like to avoid InChI, as recent versions of InChI library are not free software, at least not as understood by the Debian Free Software Guidelines.

@BobHanson
Copy link

IMHO:

  1. SMILES are fundamentally valuable with or without canonicalization.
  2. To the extent that a SMILES is valuable depends upon the context.
  3. One context is structure matching.
  4. Another context is substructure searching.
  5. Another context is 2D- or 3D-structure creation from 1D representation (SMILES or InChI, name, etc.)

Agreed? Probably more options.

For structure matching, canonicalization is primarily valuable within a local context, because canonicalization only means that the particular algorithm used guarantees that regardless of how the structure's atoms and bonds are organized, the same string will be created -- provided that the same input options have been used (and there are many options!). And generally only within a local context do we know what exact algorithm was used and what options were used with it.

Furthermore, algorithms and implementations of algorithms are prone to multiple versioning. So one can never require any specifics regarding SMILES. Just to say, for example, "InChIfied SMILES" is not nearly enough. What version? What options? Would I somehow track down some old version and use it? Probably not.

It's a classic rat's nest.

So, I am not in favor of anything more than "smiles" here. It is a very narrow use-case where we need to know exactly what algorithm+options were used. If people feel that is necessary, then I suggest we follow the lead of PubChem. 1,2-dimethylbenzene here and allow for a second field that indicates at least something about the algorithm and options used:

InChI=1S/C8H10/c1-7-5-3-4-6-8(7)2/h3-6H,1-2H3
Computed by InChI 1.0.6 (PubChem release 2021.05.07)

CC1=CC=CC=C1C
Computed by OEChem 2.3.0 (PubChem release 2021.05.07)

(Interesting that they do not indicate the options there -- Here we see a Kekulé form of the SMILES, but we could have also seen Cc1ccccc1C, so perhaps the "canonical" OEChem option requires that. Probably. Maybe. Or it was an option.)

Just to make the point, if we go to ChEMBL, alas, we find that for them, the "Canonical SMILES" is, in fact, Cc1ccccc1C.

ChEMBL does not specify what algorithm+options were used.

My personal preference is noncanonical Kekulé SMILES, which is the basis for for SMILES searching targets (actual molecules), rather than aromatic SMILES, which are more useful for the pattern used to find the target, since it covers multiple Kekulé varieties.

Saulius, I'm guessing that at COD, when I type in a SMILES string, you immediately canonicalize it to match your database, right? Or do you just consider everything entered to be a SMARTS search?

Bob

Copy link
Contributor

@sauliusg sauliusg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like approving this PR, since SMILES is undoubtedly useful feature and easy to implement with just this spec. I am a bit wary about comparing SMILES as strings; this will only be useful if we know in advance what canonicalisation does the server use (if any). This limits the usefulness of such queries. But finding a canonical SMILES or other string representations of chemical structures is an ongoing research, and we should not stop merging the SMILES PR, provided that we can later change the mode of comparison.

And yes, we have another, more sophisticated search mechanism using SMARTS in the PR #398.

merkys added a commit to Materials-Consortia/namespace-cheminformatics that referenced this pull request Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/requires-discussion type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SMILES property
7 participants