Skip to content

Commit

Permalink
Initial version
Browse files Browse the repository at this point in the history
  • Loading branch information
samoturk committed Nov 16, 2017
1 parent 0910e47 commit d9cb3cc
Show file tree
Hide file tree
Showing 22 changed files with 298,956 additions and 2 deletions.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,10 @@ ENV/

# mypy
.mypy_cache/


# GhostWriter
*.backup

# PyCharm
.idea/
90 changes: 88 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,88 @@
# mol2vec
Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures
# Mol2vec
[Mol2vec](https://chemrxiv.org/articles/Mol2vec_Unsupervised_Machine_Learning_Approach_with_Chemical_Intuition/5513581) - an unsupervised machine learning approach to learn vector representations of molecular substructures



## Requirements
* **Python 3** (Python 2.x) is [not supported](http://www.python3statement.org/)
* [NumPy](http://www.numpy.org/)
* [matplotlib](https://matplotlib.org/)
* [seaborn](https://seaborn.pydata.org/)
* [pandas](http://pandas.pydata.org/)
* [IPython](https://ipython.org/)
* [RDKit](http://www.rdkit.org/docs/Install.html)
* [scikit-learn](http://scikit-learn.org/stable/)
* [gensim](https://radimrehurek.com/gensim/)
* [tqdm](https://pypi.python.org/pypi/tqdm)
* [joblib](https://pythonhosted.org/joblib/)

## Installation
`pip install git+https://github.com/samoturk/mol2vec`

#### Documentation
To build the documentation install `sphinx`, `numpydoc` and `sphinx_rtd_theme` and then run `make html` in `docs` directory.

## Usage
### As python module
```python
from mol2vec import features
from mol2vec import helpers
```
First line imports functions to generate "sentences" from molecules and train the model, and second line imports functions useful for depictions. Check [examples](https://github.com/samoturk/mol2vec/notebooks) directory for more details.

### Command line tool
Mol2vec is an unsupervised machine learning approach to learn vector representations of molecular substructures.
Command line application has subcommands to prepare a corpus from molecular data (SDF or SMILES), train Mol2vec model
and featurize new samples.

#### Subcommand 'corpus'

Generates corpus to train Mol2vec model. It generates morgan identifiers (up to selected radius) which represent words (molecules are sentences). Words are ordered in the sentence according to atom order in canonical SMILES (generated when generating corpus) and at each atom starting by identifier at radius 0.
Corpus subcommand also optionally replaces rare identifiers with selected string (e.g. UNK) which can be later used to represent completely new substructures (i.e. at featurization step). NOTE: It saves the corpus with replaced uncommon identifiers in separate file with ending "_{selected string to replace uncommon}". Since this is unsupervised method we recommend using as much molecules as possible (e.g. complete ZINC database).

##### Performance:
Corpus generation using 20M compounds with replacement of uncommon identifiers takes 6 hours on 4 cores.

##### Example:
To prepare a corpus using radius 1, 4 cores, replace uncommon identifiers that appear <= 3 times with 'UNK' run:
`mol2vec corpus -i mols.smi -o mols.cp -r 1 -j 4 --uncommon UNK --threshold 3`


#### Subcommand 'train'

Trains Mol2vec model using previously prepared corpus.

##### Performance:
Training the model on 20M sentences takes ~2 hours on 4 cores.

##### Example:
To train a Mol2vec model on corpus with replaced uncommon identifiers using Skip-gram, window size 10, generating 300 dimensional vectors and using 4 cores run:
`mol2vec train -i mols.cp_UNK -o model.pkl -d 300 -w 10 -m skip-gram --threshold 3 -j 4`


#### Subcommand 'featurize'

Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated during featurization) and all potential SD fields from input SDF file and finally followed by mol2vec-{0 to n-1} where n is dimensionality of embeddings in the model.

##### Example:
To featurize new samples using pre-trained embeddings and using vector trained on uncommon samples to represent new substructures:
`mol2vec featurize -i new.smi -o new.csv -m model.pkl -r 1 --uncommon UNK`


For more detail on individual subcommand run:
`mol2vec $sub-command --help`

### How to cite?
```bib
@article{Jaeger2017,
author = "Sabrina Jaeger and Simone Fulle and Samo Turk",
title = "{Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition}",
year = "2017",
month = "10",
url = "https://chemrxiv.org/articles/Mol2vec_Unsupervised_Machine_Learning_Approach_with_Chemical_Intuition/5513581",
doi = "10.26434/chemrxiv.5513581.v1"
}
```

### Sponsor info
Initial development was supported by [BioMed X Innovation Center](https://bio.mx), Heidelberg.
2 changes: 2 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
- [ ] Sphinx
- [ ] pytest
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = python -msphinx
SPHINXPROJ = mol2vec
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
181 changes: 181 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# mol2vec documentation build configuration file, created by
# sphinx-quickstart on Mon Oct 30 21:41:26 2017.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('.'))


# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc',
'sphinx.ext.mathjax',
'sphinx.ext.viewcode',
'sphinx.ext.autosummary',
'numpydoc']

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = 'mol2vec'
copyright = '2017, Sabrina Jaeger, Simone Fulle, Samo Turk'
author = 'Sabrina Jaeger, Simone Fulle, Samo Turk'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.

from mol2vec import __version__ as ver
# The short X.Y version.
version = ver
# The full version, including alpha/beta/rc tags.
release = ver

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# This is required for the alabaster theme
# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
html_sidebars = {
'**': [
'navigation.html',
'relations.html', # needs 'show_related': True theme option to display
'searchbox.html',
]
}


# -- Options for HTMLHelp output ------------------------------------------

# Output file base name for HTML help builder.
htmlhelp_basename = 'mol2vecdoc'


# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',

# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',

# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',

# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'mol2vec.tex', 'mol2vec Documentation',
'Sabrina Jaeger, Simone Fulle, Samo Turk', 'manual'),
]


# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'mol2vec', 'mol2vec Documentation',
[author], 1)
]


# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'mol2vec', 'mol2vec Documentation',
author, 'mol2vec', 'One line description of project.',
'Miscellaneous'),
]

# -- Numpydoc

numpydoc_use_plots = True
numpydoc_show_class_members = True
numpydoc_show_inherited_class_members = True
numpydoc_class_members_toctree = True

autosummary_generate = True
Loading

0 comments on commit d9cb3cc

Please sign in to comment.