[Paper] (NeurIPS, Track on Datasets and Benchmarks, 2023)
QH9 provides precise DFT-calculated Hamiltonian matrices for 999 or 2,998 molecular dynamics trajectories and 130,831 stable molecular geometries, based on the QM9 dataset.
In this repo, we provide both the QH9 dataset and the benchmark code, which can be highly valuable for developing machine learning methods and accelerating molecular and materials design for scientific and technological applications.
- We have released QH9-dyn-300k with 2,998 molecular trajectories. The time step is 50 a.u. (~1.2fs) and each trajectory has 100 different geometries.
- The code has implemented the automatically downloading for datasets and checkpoints. For load the pretrained model parameters, please refer to function load_pretrained_model_parameters.
- The dataset generation example code is provided for both stable and dynamic dataset based on PySCF.
To comprehensively evaluate the quantum Hamiltonian prediction performance, we define the following tasks based on the obtained stable and dynamic geometries in the QH9 dataset. Please refer to our paper for details of these task setups.
- QH9-stable-id
- QH9-stable-ood
- QH9-dynamic-300k-geo
- QH9-dynamic-300k-mol
- QH9-dynamic-100k-geo
- QH9-dynamic-100k-mol
Task | # Total geometries | # Total molecules | # Training/validation/testing geometries |
---|---|---|---|
QH9-stable-id | 130, 831 | 130, 831 | 104, 664/13, 083/13, 084 |
QH9-stable-ood | 130, 831 | 130, 831 | 104, 001/17, 495/9, 335 |
QH9-dynamic-100k-geo | 99, 900 | 999 | 79, 920/9, 990/9, 990 |
QH9-dynamic-100k-mol | 99, 900 | 999 | 79, 900/9, 900/10, 100 |
QH9-dynamic-300k-geo | 299, 800 | 2,998 | 239,840 / 29,980 / 29,980 |
QH9-dynamic-300k-mol | 299, 800 | 2,998 | 239,840 /29, 900/30, 100 |
Note that the cost of training on QH9-dynamic-300k is similar compared to QH9-dynamic-100k, while it contains more data and achieves higher performance in molecule-wise split. Therefore, it is recommended to use QH9-dynamic-300k. The trajectory includes molecular geometries and forces for QH9-dynamic can be released upon request.
We include key dependencies below. The versions we used are in parentheses.
- PyTorch (1.11.0)
- PyG (2.0.4)
- e3nn (0.5.1)
- pyscf (2.2.1) (QH9-Stable, QH9-Dynamic-300k)
- pyscf (2.3.0) (QH9-Dynamic-100k)
- hydra-core (1.1.2)
Meanwhile, we provide the installation file, and you can build the environment by source install.sh
.
We provide the datasets as commonly used PyG datasets. Here are simple examples to load our datasets with a few lines of code. Prior to that, you can download the datasets
folder, which includes the raw data files QH9Stable.db
and QH9Dynamic.db
, via this Google Drive link or OneDrive Link. Meanwhile, we provide the zip files of the datasets in this google drive link.
from torch_geometric.loader import DataLoader
from datasets import QH9Stable, QH9Dynamic
### Use one of the following lines to Load the specific dataset
dataset = QH9Stable(split='random') # QH9-stable-id
dataset = QH9Stable(split='size_ood') # QH9-stable-ood
dataset = QH9Dynamic(split='geometry', version='300k') # QH9-dynamic-geo
dataset = QH9Dynamic(split='mol', version='300k') # QH9-dynamic-mol
### Get the training/validation/testing subsets
train_dataset = dataset[dataset.train_mask]
valid_dataset = dataset[dataset.val_mask]
test_dataset = dataset[dataset.test_mask]
### Get the dataloders
train_data_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_data_loader = DataLoader(valid_dataset, batch_size=32, shuffle=False)
test_data_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
Equivariant quantum tensor network QHNet is selected as the main baseline method in the QH9 benchmark currently. QHNet has an extendable expansion module that is built upon intermediate full orbital matrices, enabling its capability to effectively handle different molecules. This flexibility allows QHNet to accommodate various molecules in the QH9 benchmark.
- Train the QHNet model
### Modify the configurations in config/config.yaml (or pass the configurations as args) as needed, and then run
python main.py datasets=QH9-stable datasets.split=random # QH9-stable-id
python main.py datasets=QH9-stable datasets.split=size_ood # QH9-stable-ood
python main.py datasets=QH9-dynamic datasets.split=geometry datasets.version=300k # QH9-dynamic-300k-geo
python main.py datasets=QH9-dynamic datasets.split=mol datasets.version=300k # QH9-dynamic-300k-mol
Trained models: our trained QHNet models on the defined tasks are available via this Google Drive link.
- Evaluate the trained model (in terms of MAE on Hamiltonian matrix, MAE on occupied orbital energies, and cosine similarity of orbital coefficients). The eigen decoposition cost lots of time to run it.
### Modify the configurations in config/config.yaml (or pass the configurations as args) as needed (including the trained_model arg), and then run
python test.py
- Evaluate the performance of accelerating DFT calculation, it needs to run DFT for 50 molecules with high computatioinal cost.
### Modify the configurations in config/config.yaml (or pass the configurations as args) as needed (including the trained_model arg), and then run
# Pyscf version 2.2.1 for QH9-Stable; Pyscf version 2.3.0 for QH9-Dynamic
python test_dft_acceleration.py
Below we provide a brief description on how to customize this benchmark to run model on your own dataset.
Suppose that you are prepared to generate your own datasets, our current dataset classes, such as QH9Stable
and QH9Dynamic
, support to fetch data from apsw
database.
Therefore, apsw
database is recommended to save the data.
MUST HAVE:
pos
: The coordinates of the atomic 3D positions.atoms
: The atomic number.Ham
: The Hamiltonian matrix for molecular geometries.
For the Hamiltonian matrix, pay attention to the atomic orbital order, and magnetic order
Add the model file in the corresponding directory AIRS/OpenDFT/QHBench/QH9/models/
, and the add the corresponding configuration information in AIRS/OpenDFT/QHBench/QH9/config/
.
@article{yu2023qh9,
title={{QH9}: A Quantum Hamiltonian Prediction Benchmark for QM9 Molecules},
author={Haiyang Yu and Meng Liu and Youzhi Luo and Alex Strasser and Xiaofeng Qian and Xiaoning Qian and Shuiwang Ji},
journal={arXiv Preprint, arXiv:2306.09549},
year={2023}
}
This work was supported in part by National Science Foundation grant IIS-2006861, CCF-1553281, DMR-2119103, DMR-1753054, DMR-2103842, and IIS-2212419. Acknowledgment is also made to the donors of the American Chemical Society Petroleum Research Fund for partial support of this research.