diff --git a/index.html b/index.html index 5e98bb7..3f14c6a 100644 --- a/index.html +++ b/index.html @@ -147,7 +147,8 @@
- TL;DR: SMolInstruct is an instruction dataset for chemistry that focuses on small - molecules. It contains 14 meticulously selected tasks and over 3M carefully curated - samples. Based on this dataset, we train LlaSMol, a series of large language models that - significantly outperform GPT-4 and achieve the best performance among existing - LLMs for chemistry. + TL;DR: We propose SMolInstruct, an instruction dataset for chemistry that focuses on small + molecules; and LlaSMol, a series of large language models that + substantially outperform existing LLMs on chemistry tasks. + LLMs.
- Abstract: Chemistry plays a crucial role in many domains, such as drug discovery and - material science. - While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language - processing tasks, - existing work shows their performance on chemistry tasks is discouragingly low. In this paper, however, we - demonstrate - that our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, - outperforming the most advanced GPT-4 across all the tasks by a substantial margin - (e.g., 94.5% EM for converting SMILES to Formula vs. GPT-4's 16.4%; - 32.9% EM for Retrosynthesis vs. GPT-4's ~0%) and approaching the SoTA task-specific - models. - The key to our success is a large-scale, comprehensive, high-quality dataset for instruction tuning named - SMolInstruct. - It contains 14 meticulously selected chemistry tasks and over three million high-quality samples, laying a - solid foundation - for training and evaluating LLMs for chemistry. Based on SMolInstruct, we fine-tune a set of open-source - LLMs, among which, - we find that Mistral serves as the best base model for chemistry tasks. We further conduct analysis on the - impact of - trainable parameters, providing insights for future research. + Chemistry plays a crucial role in many domains, + such as drug discovery and material science. + While large language models (LLMs) such as + GPT-4 exhibit remarkable capabilities on natural + language processing tasks, existing research indicates that their performance on chemistry tasks + is discouragingly low. In this paper, however, + we demonstrate that our developed LLMs can + achieve very strong results on a comprehensive + set of chemistry tasks, outperforming the most + advanced GPT-4 and Claude 3 Opus by a substantial margin + (e.g., 93.2% EM for converting SMILES to Formula vs. GPT-4's 4.8% and Claude 3 Opus's 9.2%; 32.9% EM + for Retrosynthesis vs. GPT-4's ~0.0% and Claude 3 Opus's 1.1%). To accomplish this, we propose + SMolInstruct, a large-scale, comprehensive, + and high-quality dataset for instruction tuning. + It contains 14 selected chemistry tasks and over + three million samples, laying a solid foundation + for training and evaluating LLMs for chemistry. + Using SMolInstruct, we fine-tune a set of + open-source LLMs, among which, we find that + Mistral serves as the best base model for chemistry tasks. Our analysis further demonstrates the + critical role of the proposed dataset in driving the + performance improvements.
@@ -321,29 +324,30 @@
The merits of SMolInstruct:
- (1) Large-Scale. SMolInstruct consists of 3.4M distinct samples and 1.6M distinct - molecules, - with a diverse range of sizes, structures, and properties, showcasing an - extensive coverage of diverse chemical knowledge. + Large-Scale. SMolInstruct consists of 3.3M samples and 1.6M distinct molecules, with a + diverse range of + sizes, structures, and properties, showcasing an extensive coverage of diverse chemical knowledge.
- (2) Comprehensive. SMolInstruct contains 4 types of chemical tasks (14 tasks in total), - emerging - as the most comprehensive instruction tuning dataset for small molecules. Notably, the tasks are - meticulously selected to build a strong chemistry foundation. + Comprehensive. SMolInstruct contains 4 types of + chemical tasks (14 tasks in total), emerging as the most comprehensive instruction tuning dataset for + small molecules. + Notably, the tasks are meticulously selected to build a strong + chemistry foundation model and to adapt to real-world applications.
- (3) High-Quality. Rigorous processing steps have been implemented to exclude - problematic and low- - quality samples. Along with careful data splitting and canonicalization of SMILES representations - SMolInstruct stands as a high-quality resource valuable for future research. + High-Quality. Rigorous processing steps have been + implemented to exclude problematic and low-quality samples. Along with careful data splitting and + canonicalization + of SMILES representations, SMolInstruct stands as a + high-quality resource valuable for future research.
Results for name conversion (NC) and property prediction (PP) +
The following table shows the results for name conversion (NC) + and property prediction (PP) tasks. The metrics include exact match (EM), validity (Valid), - root mean square error (RMSE), and accuracy (Acc), where EM and Valid are in percentage.
+ root mean square error (RMSE), and accuracy (Acc), where EM, Valid, and Acc are in percentage. -Results for molecule captioning (MC), molecule generation (MG), +
The following table shows results for molecule captioning (MC), + molecule generation (MG), forward synthesis (FS), and retrosynthesis (RS). The metrics include METEOR score (METEOR), exact match (EM), Morgan fingerprint-based tanimoto similarity @@ -436,16 +442,23 @@
Main takeaways:
(1) LlaSMol models significantly outperform the existing LLMs on all the tasks, - underscoring the effectiveness of the proposed SMolInstruct dataset and the benefits of fine- - tuning.
+ underscoring the effectiveness of the proposed SMolInstruct dataset and the benefits of fine-tuning.(2) Our four LlaSMol models show substantial differences in their performance, and LlasMolMistral achieves the best, emphasizing the significant impact of base models on downstream tasks
-(3) Our LlaSMol models exhibit comparable performance to SoTA models even with - only a small proportion of parameters being tuned (40M, 0.59%), - showing great potential to surpass task-specific models and work as universal models capable of - addressing - multiple chemistry tasks.
++ (3) Although LlaSMol models do not outperform SoTA models on all the tasks, they demonstrate considerable + potential for further improvements. + Compared to previous efforts, they greatly narrowed the gap between LLMs and SoTA task-specific models. + Remarkably, LlaSMolMistral attains such performance with only a small proportion of its parameters + fine-tuned (41.9M, 0.58\%). Our further experiments suggest its immense + potential to surpass task-specific models through more extensive fine-tuning and serve as a strong + foundation model for chemistry applications. +
+ ++ Please check out our paper for findings regarding SMILES vs. SELFIES, the benefits of SMILES canonicalization, multi-task synergies, and more. +
If our paper or related resources prove valuable to your research, we kindly ask for citation. Please feel free +
If our paper or related resources are valuable to your research/applications, we kindly ask for citation. Please feel free to contact us with any inquiries.
@article{yu2024llasmol,
title={LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},
diff --git a/static/images/ChemLLMFig.png b/static/images/ChemLLMFig.png
deleted file mode 100644
index a5d3e63..0000000
Binary files a/static/images/ChemLLMFig.png and /dev/null differ
diff --git a/static/images/ChemLLMFig.svg b/static/images/ChemLLMFig.svg
deleted file mode 100644
index 86030b0..0000000
--- a/static/images/ChemLLMFig.svg
+++ /dev/null
@@ -1 +0,0 @@
-
\ No newline at end of file
diff --git a/static/images/tables/o_1.png b/static/images/tables/o_1.png
index d85c43d..c3c49b0 100644
Binary files a/static/images/tables/o_1.png and b/static/images/tables/o_1.png differ
diff --git a/static/images/tables/o_2.png b/static/images/tables/o_2.png
index a9416a9..73adaad 100644
Binary files a/static/images/tables/o_2.png and b/static/images/tables/o_2.png differ
diff --git a/static/images/tables/statistics.png b/static/images/tables/statistics.png
new file mode 100644
index 0000000..c987849
Binary files /dev/null and b/static/images/tables/statistics.png differ
diff --git a/static/images/tables/tasks.png b/static/images/tables/tasks.png
deleted file mode 100644
index 4fe9e65..0000000
Binary files a/static/images/tables/tasks.png and /dev/null differ
diff --git a/static/images/task_overview.svg b/static/images/task_overview.svg
new file mode 100644
index 0000000..90bb380
--- /dev/null
+++ b/static/images/task_overview.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/static/video/LlaSMol.mp4 b/static/video/LlaSMol.mp4
index 677d7cb..987d2d8 100644
Binary files a/static/video/LlaSMol.mp4 and b/static/video/LlaSMol.mp4 differ
diff --git a/test_generation.ipynb b/test_generation.ipynb
deleted file mode 100644
index b71532a..0000000
--- a/test_generation.ipynb
+++ /dev/null
@@ -1,174 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "===================================BUG REPORT===================================\n",
- "Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues\n",
- "================================================================================\n",
- "CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...\n",
- "CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so\n",
- "CUDA SETUP: Highest compute capability among GPUs detected: 8.6\n",
- "CUDA SETUP: Detected CUDA version 118\n",
- "CUDA SETUP: Loading binary /home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/yu.3737/Software/miniconda3/envs/lora did not contain libcudart.so as expected! Searching further paths...\n",
- " warn(msg)\n",
- "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('FILE')}\n",
- " warn(msg)\n",
- "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/modulefiles')}\n",
- " warn(msg)\n",
- "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/share/lmod/lmod/share/man')}\n",
- " warn(msg)\n",
- "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('() { eval $($LMOD_DIR/ml_cmd \"$@\")\\n}')}\n",
- " warn(msg)\n",
- "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('vs/workbench/api/node/extensionHostProcess')}\n",
- " warn(msg)\n",
- "/home/yu.3737/Software/miniconda3/envs/lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//matplotlib_inline.backend_inline'), PosixPath('module')}\n",
- " warn(msg)\n"
- ]
- }
- ],
- "source": [
- "from generation import LlaSMolGeneration, canonicalize_smiles_in_text"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
- ]
- },
- {
- "data": {
- "application/vnd.jupyter.widget-view+json": {
- "model_id": "91482b88007548129672708d278f7f09",
- "version_major": 2,
- "version_minor": 0
- },
- "text/plain": [
- "Loading checkpoint shards: 0%| | 0/2 [00:00, ?it/s]"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "generator = LlaSMolGeneration('osunlp/LlaSMol-Mistral-7B')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[{'input_text': 'Given the SMILES representation S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula?',\n",
- " 'real_input_text': '[INST] Given the SMILES representation S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula? [/INST]',\n",
- " 'output': ['It is C7H15Cl2N2OPS . ']}]"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "generator.generate(\n",
- " 'Given the SMILES representation S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula?'\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "from utils.smiles_canonicalization import canonicalize_molecule_smiles"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'CCC1(C)CCOC1=O'"
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "canonicalize_molecule_smiles('CCC1(C)CCOC1=O')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "'CCC1(C)COC(=O)C1'"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "canonicalize_molecule_smiles('CCC1(C)COC(=O)C1')"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "lora",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.18"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}