Initial version

comocheng · Nov 16, 2017 · d9cb3cc · d9cb3cc
1 parent 0910e47
commit d9cb3cc
Show file tree

Hide file tree

Showing 22 changed files with 298,956 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -99,3 +99,10 @@ ENV/
 
 # mypy
 .mypy_cache/
+
+
+# GhostWriter
+*.backup
+
+# PyCharm
+.idea/
diff --git a/README.md b/README.md
@@ -1,2 +1,88 @@
-# mol2vec
-Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures
+# Mol2vec
+[Mol2vec](https://chemrxiv.org/articles/Mol2vec_Unsupervised_Machine_Learning_Approach_with_Chemical_Intuition/5513581) - an unsupervised machine learning approach to learn vector representations of molecular substructures
+
+
+
+## Requirements
+* **Python 3** (Python 2.x) is [not supported](http://www.python3statement.org/)
+* [NumPy](http://www.numpy.org/)
+* [matplotlib](https://matplotlib.org/)
+* [seaborn](https://seaborn.pydata.org/)
+* [pandas](http://pandas.pydata.org/)
+* [IPython](https://ipython.org/)
+* [RDKit](http://www.rdkit.org/docs/Install.html)
+* [scikit-learn](http://scikit-learn.org/stable/)
+* [gensim](https://radimrehurek.com/gensim/)
+* [tqdm](https://pypi.python.org/pypi/tqdm)
+* [joblib](https://pythonhosted.org/joblib/)
+
+## Installation
+`pip install git+https://github.com/samoturk/mol2vec`
+
+#### Documentation
+To build the documentation install `sphinx`, `numpydoc` and `sphinx_rtd_theme` and then run `make html` in `docs` directory.
+
+## Usage
+### As python module
+```python
+from mol2vec import features
+from mol2vec import helpers
+```
+First line imports functions to generate "sentences" from molecules and train the model, and second line imports functions useful for depictions. Check [examples](https://github.com/samoturk/mol2vec/notebooks) directory for more details.
+
+### Command line tool
+Mol2vec is an unsupervised machine learning approach to learn vector representations of molecular substructures.
+Command line application has subcommands to prepare a corpus from molecular data (SDF or SMILES), train Mol2vec model
+and featurize new samples.
+
+#### Subcommand 'corpus'
+
+Generates corpus to train Mol2vec model. It generates morgan identifiers (up to selected radius) which represent words (molecules are sentences). Words are ordered in the sentence according to atom order in canonical SMILES (generated when generating corpus) and at each atom starting by identifier at radius 0.  
+    Corpus subcommand also optionally replaces rare identifiers with selected string (e.g. UNK) which can be later used to represent completely new substructures (i.e. at featurization step). NOTE: It saves the corpus with replaced uncommon identifiers in separate file with ending "_{selected string to replace uncommon}". Since this is unsupervised method we recommend using as much molecules as possible (e.g. complete ZINC database).
+
+##### Performance:  
+Corpus generation using 20M compounds with replacement of uncommon identifiers takes 6 hours on 4 cores.  
+
+##### Example:  
+To prepare a corpus using radius 1, 4 cores, replace uncommon identifiers that appear <= 3 times with 'UNK' run:
+        `mol2vec corpus -i mols.smi -o mols.cp -r 1 -j 4 --uncommon UNK --threshold 3`
+
+
+#### Subcommand 'train'
+
+Trains Mol2vec model using previously prepared corpus.
+
+##### Performance:
+Training the model on 20M sentences takes ~2 hours on 4 cores.
+
+##### Example:
+To train a Mol2vec model on corpus with replaced uncommon identifiers using Skip-gram, window size 10, generating 300 dimensional vectors and using 4 cores run:
+        `mol2vec train -i mols.cp_UNK -o model.pkl -d 300 -w 10 -m skip-gram --threshold 3 -j 4`
+
+
+#### Subcommand 'featurize'
+
+Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated during featurization) and all potential SD fields from input SDF file and finally followed by mol2vec-{0 to n-1} where n is dimensionality of embeddings in the model.  
+
+##### Example:
+To featurize new samples using pre-trained embeddings and using vector trained on uncommon samples to represent new substructures:
+        `mol2vec featurize -i new.smi -o new.csv -m model.pkl -r 1 --uncommon UNK`
+
+
+For more detail on individual subcommand run:
+    `mol2vec $sub-command --help`
+
+### How to cite?
+```bib
+@article{Jaeger2017,
+author = "Sabrina Jaeger and Simone Fulle and Samo Turk",
+title = "{Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition}",
+year = "2017",
+month = "10",
+url = "https://chemrxiv.org/articles/Mol2vec_Unsupervised_Machine_Learning_Approach_with_Chemical_Intuition/5513581",
+doi = "10.26434/chemrxiv.5513581.v1"
+}
+```
+
+### Sponsor info
+Initial development was supported by [BioMed X Innovation Center](https://bio.mx), Heidelberg.
diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,2 @@
+- [ ] Sphinx
+- [ ] pytest
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = python -msphinx
+SPHINXPROJ    = mol2vec
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,181 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# mol2vec documentation build configuration file, created by
+# sphinx-quickstart on Mon Oct 30 21:41:26 2017.
+#
+# This file is execfile()d with the current directory set to its
+# containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+sys.path.insert(0, os.path.abspath('.'))
+
+
+# -- General configuration ------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = ['sphinx.ext.autodoc',
+    'sphinx.ext.mathjax',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.autosummary',
+    'numpydoc']
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = 'mol2vec'
+copyright = '2017, Sabrina Jaeger, Simone Fulle, Samo Turk'
+author = 'Sabrina Jaeger, Simone Fulle, Samo Turk'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+
+from mol2vec import __version__ as ver
+# The short X.Y version.
+version = ver
+# The full version, including alpha/beta/rc tags.
+release = ver
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This patterns also effect to html_static_path and html_extra_path
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# If true, `todo` and `todoList` produce output, else they produce nothing.
+todo_include_todos = False
+
+
+# -- Options for HTML output ----------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# This is required for the alabaster theme
+# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
+html_sidebars = {
+    '**': [
+        'navigation.html',
+        'relations.html',  # needs 'show_related': True theme option to display
+        'searchbox.html',
+    ]
+}
+
+
+# -- Options for HTMLHelp output ------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'mol2vecdoc'
+
+
+# -- Options for LaTeX output ---------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'mol2vec.tex', 'mol2vec Documentation',
+     'Sabrina Jaeger, Simone Fulle, Samo Turk', 'manual'),
+]
+
+
+# -- Options for manual page output ---------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'mol2vec', 'mol2vec Documentation',
+     [author], 1)
+]
+
+
+# -- Options for Texinfo output -------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'mol2vec', 'mol2vec Documentation',
+     author, 'mol2vec', 'One line description of project.',
+     'Miscellaneous'),
+]
+
+# -- Numpydoc
+
+numpydoc_use_plots = True
+numpydoc_show_class_members = True
+numpydoc_show_inherited_class_members = True
+numpydoc_class_members_toctree = True
+
+autosummary_generate = True