diff --git a/index.html b/index.html index 5e98bb7..3f14c6a 100644 --- a/index.html +++ b/index.html @@ -147,7 +147,8 @@

@@ -421,12 +425,14 @@

-

Results for name conversion (NC) and property prediction (PP) +

The following table shows the results for name conversion (NC) + and property prediction (PP) tasks. The metrics include exact match (EM), validity (Valid), - root mean square error (RMSE), and accuracy (Acc), where EM and Valid are in percentage.

+ root mean square error (RMSE), and accuracy (Acc), where EM, Valid, and Acc are in percentage.

results table 1

-

Results for molecule captioning (MC), molecule generation (MG), +

The following table shows results for molecule captioning (MC), + molecule generation (MG), forward synthesis (FS), and retrosynthesis (RS). The metrics include METEOR score (METEOR), exact match (EM), Morgan fingerprint-based tanimoto similarity @@ -436,16 +442,23 @@

Main takeaways:

(1) LlaSMol models significantly outperform the existing LLMs on all the tasks, - underscoring the effectiveness of the proposed SMolInstruct dataset and the benefits of fine- - tuning.

+ underscoring the effectiveness of the proposed SMolInstruct dataset and the benefits of fine-tuning.

(2) Our four LlaSMol models show substantial differences in their performance, and LlasMolMistral achieves the best, emphasizing the significant impact of base models on downstream tasks

-

(3) Our LlaSMol models exhibit comparable performance to SoTA models even with - only a small proportion of parameters being tuned (40M, 0.59%), - showing great potential to surpass task-specific models and work as universal models capable of - addressing - multiple chemistry tasks.

+

+ (3) Although LlaSMol models do not outperform SoTA models on all the tasks, they demonstrate considerable + potential for further improvements. + Compared to previous efforts, they greatly narrowed the gap between LLMs and SoTA task-specific models. + Remarkably, LlaSMolMistral attains such performance with only a small proportion of its parameters + fine-tuned (41.9M, 0.58\%). Our further experiments suggest its immense + potential to surpass task-specific models through more extensive fine-tuning and serve as a strong + foundation model for chemistry applications. +

+ +

+ Please check out our paper for findings regarding SMILES vs. SELFIES, the benefits of SMILES canonicalization, multi-task synergies, and more. +

@@ -465,7 +478,7 @@

-

If our paper or related resources prove valuable to your research, we kindly ask for citation. Please feel free +

If our paper or related resources are valuable to your research/applications, we kindly ask for citation. Please feel free to contact us with any inquiries.

@article{yu2024llasmol,
   title={LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},
diff --git a/static/images/ChemLLMFig.png b/static/images/ChemLLMFig.png
deleted file mode 100644
index a5d3e63..0000000
Binary files a/static/images/ChemLLMFig.png and /dev/null differ
diff --git a/static/images/ChemLLMFig.svg b/static/images/ChemLLMFig.svg
deleted file mode 100644
index 86030b0..0000000
--- a/static/images/ChemLLMFig.svg
+++ /dev/null
@@ -1 +0,0 @@
-Molecule DescriptionMolecule Captioning (MC)Query:Describe this molecule:<SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1</SMILES>Response:The molecule is an imidazole derivative with short-acting sedative, hypnotic, and general anesthetic properties. Etomidate appears to have gamma-aminobutyric acid (GABA) like effects, mediated through GABA-A receptor. The action enhances the inhibitory effect of GABA on the central nervous system by causing chloride channel opening events which leads to membrane hyperpolarization.Molecule Generation (MG)Query:Give me a molecule that satisfies the conditions outlined in the description: The molecule is a member of the class of tripyrrolesthat is a red-colouredpigment with antibiotic properties produced by Serratia marcescens. It has a role as an antimicrobial agent, a biological pigment, a bacterial metabolite, an apoptosis inducer and an antineoplastic agent. It is a tripyrrole, an aromatic ether and a ring assembly.Response:Here is a potential molecule:<SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES> Property PredictionESOL (PP-ESOL)Query:How soluble is <SMILES> CC(C)Cl </SMILES> ?Response:Its log solubility is <NUMBER> -1.41 </NUMBER>mol/L.LIPO (PP-LIPO)Query:Predict the octanol/water distribution coefficient logDunder the circumstance of pH 7.4 for<SMILES> NC(=O)C1=CC=CC=C1O </SMILES>.Response:<NUMBER> 1.090 </NUMBER>BBBP (PP-BBBP)Query:Is blood-brain barrier permeability (BBBP) a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?Response:<BOOLEAN> Yes </BOOLEAN>ClinTox(PP-ClinTox)Query:Is<SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES>toxic?Response:<BOOLEAN> No </BOOLEAN>HIV (PP-HIV)Query:Can<SMILES> CC1=CN(C2C=CCCC2O)C(=O)NC1=O </SMILES>serve as an inhibitor of HIV replication?Response:<BOOLEAN> No </BOOLEAN>SIDER (PP-SIDER)Query:Are there any known side effects of<SMILES> CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br</SMILES>affecting the heart? Response:<BOOLEAN> No </BOOLEAN>Name ConversionIUPAC to Molecular Formula (NC-I2F)Query:What is the molecular formula of the compound with this IUPAC name<IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?Response:<MOLFORMULA> C15H11NO  </MOLFORMULA>IUPAC to SMILES (NC-I2S)Query:Could you provide the SMILES for<IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>?Response:Of course.   It's <SMILES> CCC1(C)COC(=O)C1</SMILES>SMILES to Molecular Formula (NC-S2F)Query:Given the SMILES representation <SMILES> S=P1(N(CCCl)CCCl)NCCCO1 </SMILES>, what would be its molecular formula?Response:It is <MOLFORMULA> C7H15Cl2N2OPS </MOLFORMULA> .SMILES to IUPAC (NC-S2I)Query:Translate the given SMILES formula of a molecule<SMILES> CCC(C)C1CNCCCNC1 </SMILES> into its IUPAC name.Response:<IUPAC> 3-butan-2-yl-1,5-diazocane </IUPAC>Chemical ReactionForward Prediction (FS)Query:<SMILES> NC1=CC=C2OCOC2=C1.O=CO</SMILES>Based on the reactants and reagents given above, suggest a possible product.Response:A possible product can be <SMILES> O=CNC1=CC=C2OCOC2=C1 </SMILES>Retrosynthesis (RS)Query:Identify possible reactants that could have been used to create the specified product.  <SMILES> CC1=CC=C(N)N=C1N </SMILES>Response:<SMILES> CC(C#N)CCC#N.N </SMILES>O=C=OCO2
\ No newline at end of file
diff --git a/static/images/tables/o_1.png b/static/images/tables/o_1.png
index d85c43d..c3c49b0 100644
Binary files a/static/images/tables/o_1.png and b/static/images/tables/o_1.png differ
diff --git a/static/images/tables/o_2.png b/static/images/tables/o_2.png
index a9416a9..73adaad 100644
Binary files a/static/images/tables/o_2.png and b/static/images/tables/o_2.png differ
diff --git a/static/images/tables/statistics.png b/static/images/tables/statistics.png
new file mode 100644
index 0000000..c987849
Binary files /dev/null and b/static/images/tables/statistics.png differ
diff --git a/static/images/tables/tasks.png b/static/images/tables/tasks.png
deleted file mode 100644
index 4fe9e65..0000000
Binary files a/static/images/tables/tasks.png and /dev/null differ
diff --git a/static/images/task_overview.svg b/static/images/task_overview.svg
new file mode 100644
index 0000000..90bb380
--- /dev/null
+++ b/static/images/task_overview.svg
@@ -0,0 +1 @@
+Name ConversionIUPAC to Molecular Formula (NC-I2F)Query:What is the molecular formula of the compound with this IUPAC name<IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?Response:<MOLFORMULA> C15H11NO  </MOLFORMULA>IUPAC to SMILES (NC-I2S)Query:Could you provide the SMILES for<IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>?Response:Of course.   It's <SMILES> CCC1(C)COC(=O)C1</SMILES>SMILES to Molecular Formula (NC-S2F)Query:Given the SMILES representation <SMILES> S=P1(N(CCCl)CCCl)NCCCO1 </SMILES>, what would be its molecular formula?Response:It is <MOLFORMULA> C7H15Cl2N2OPS </MOLFORMULA> .SMILES to IUPAC (NC-S2I)Query:Translate the given SMILES formula of a molecule<SMILES> CCC(C)C1CNCCCNC1 </SMILES> into its IUPAC name.Response:<IUPAC> 3-butan-2-yl-1,5-diazocane </IUPAC>Chemical ReactionForward Synthesis (FS)Query:<SMILES> NC1=CC=C2OCOC2=C1.O=CO</SMILES>Based on the reactants and reagents given above, suggest a possible product.Response:A possible product can be <SMILES> O=CNC1=CC=C2OCOC2=C1 </SMILES>Retrosynthesis (RS)Query:Identify possible reactants that could have been used to create the specified product.  <SMILES> CC1=CC=C(N)N=C1N </SMILES>Response:<SMILES> CC(C#N)CCC#N.N </SMILES>Molecule DescriptionMolecule Captioning (MC)Query:Describe this molecule:<SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1</SMILES>Response:The molecule is an imidazole derivative with short-acting sedative, hypnotic, and general anesthetic properties. Etomidate appears to have gamma-aminobutyric acid (GABA) like effects, mediated through GABA-A receptor. The action enhances the inhibitory effect of GABA on the central nervous system by causing chloride channel opening events which leads to membrane hyperpolarization.Molecule Generation (MG)Query:Give me a molecule that satisfies the conditions outlined in the description: The molecule is a member of the class of tripyrrolesthat is a red-colouredpigment with antibiotic properties produced by Serratia marcescens. It has a role as an antimicrobial agent, a biological pigment, a bacterial metabolite, an apoptosis inducer and an antineoplastic agent. It is a tripyrrole, an aromatic ether and a ring assembly.Response:Here is a potential molecule:<SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES> Property PredictionESOL (PP-ESOL)Query:How soluble is <SMILES> CC(C)Cl </SMILES> ?Response:Its log solubility is <NUMBER> -1.41 </NUMBER>mol/L.LIPO (PP-LIPO)Query:Predict the octanol/water distribution coefficient logDunder the circumstance of pH 7.4 for<SMILES> NC(=O)C1=CC=CC=C1O </SMILES>.Response:<NUMBER> 1.090 </NUMBER>BBBP (PP-BBBP)Query:Is blood-brain barrier permeability (BBBP) a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?Response:<BOOLEAN> Yes </BOOLEAN>ClinTox(PP-ClinTox)Query:Is<SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES>toxic?Response:<BOOLEAN> No </BOOLEAN>HIV (PP-HIV)Query:Can<SMILES> CC1=CN(C2C=CCCC2O)C(=O)NC1=O </SMILES>serve as an inhibitor of HIV replication?Response:<BOOLEAN> No </BOOLEAN>SIDER (PP-SIDER)Query:Are there any known side effects of<SMILES> CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br</SMILES>affecting the heart? Response:<BOOLEAN> No </BOOLEAN>O=C=OCO2
\ No newline at end of file
diff --git a/static/video/LlaSMol.mp4 b/static/video/LlaSMol.mp4
index 677d7cb..987d2d8 100644
Binary files a/static/video/LlaSMol.mp4 and b/static/video/LlaSMol.mp4 differ
diff --git a/test_generation.ipynb b/test_generation.ipynb
deleted file mode 100644
index b71532a..0000000
--- a/test_generation.ipynb
+++ /dev/null
@@ -1,174 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "===================================BUG REPORT===================================\n",
-      "Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues\n",
-      "================================================================================\n",
-      "CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...\n",
-      "CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so\n",
-      "CUDA SETUP: Highest compute capability among GPUs detected: 8.6\n",
-      "CUDA SETUP: Detected CUDA version 118\n",
-      "CUDA SETUP: Loading binary /home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/yu.3737/Software/miniconda3/envs/lora did not contain libcudart.so as expected! Searching further paths...\n",
-      "  warn(msg)\n",
-      "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('FILE')}\n",
-      "  warn(msg)\n",
-      "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/modulefiles')}\n",
-      "  warn(msg)\n",
-      "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/share/lmod/lmod/share/man')}\n",
-      "  warn(msg)\n",
-      "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('() {  eval $($LMOD_DIR/ml_cmd \"$@\")\\n}')}\n",
-      "  warn(msg)\n",
-      "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('vs/workbench/api/node/extensionHostProcess')}\n",
-      "  warn(msg)\n",
-      "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//matplotlib_inline.backend_inline'), PosixPath('module')}\n",
-      "  warn(msg)\n"
-     ]
-    }
-   ],
-   "source": [
-    "from generation import LlaSMolGeneration, canonicalize_smiles_in_text"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "91482b88007548129672708d278f7f09",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Loading checkpoint shards:   0%|          | 0/2 [00:00 S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula?',\n",
-       "  'real_input_text': '[INST] Given the SMILES representation  S=P1(N(CCCl)CCCl)NCCCO1  , what would be its molecular formula? [/INST]',\n",
-       "  'output': ['It is  C7H15Cl2N2OPS  . ']}]"
-      ]
-     },
-     "execution_count": 10,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "generator.generate(\n",
-    "    'Given the SMILES representation  S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula?'\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from utils.smiles_canonicalization import canonicalize_molecule_smiles"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "'CCC1(C)CCOC1=O'"
-      ]
-     },
-     "execution_count": 7,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "canonicalize_molecule_smiles('CCC1(C)CCOC1=O')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "'CCC1(C)COC(=O)C1'"
-      ]
-     },
-     "execution_count": 8,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "canonicalize_molecule_smiles('CCC1(C)COC(=O)C1')"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "lora",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.9.18"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}