This repository contains an implementation of symmetry-adapted Gaussian Process Regression suitable to perform equivariant predictions of the electron density of both molecular and condensed-phase systems, as decomposed on an atom-centered spherical harmonics basis.
A quick-start guide is provided below; full documentation is also available.
- Andrea Grisafi, Alberto Fabrizio, David M. Wilkins, Benjamin A. R. Meyer, Clemence Corminboeuf, Michele Ceriotti, "Transferable Machine-Learning Model of the Electron Density", ACS Central Science 5, 57 (2019) [https://pubs.acs.org/doi/10.1021/acscentsci.8b00551]
- Alberto Fabrizio, Andrea Grisafi, Benjamin A. R. Meyer, Michele Ceriotti, Clemence Corminboeuf, "Electron density learning of non-covalent systems", Chemical Science 10, 9424 (2019) [https://pubs.rsc.org/en/content/articlelanding/2019/sc/c9sc02696g]
- Alan M. Lewis, Andrea Grisafi, Michele Ceriotti, Mariana Rossi, "Learning electron densities in the condensed-phase", Journal of Chemical Theory and Computation 17, 7203 (2021) [https://pubs.acs.org/doi/10.1021/acs.jctc.1c00576]
- Andrea Grisafi, Alan M. Lewis, Mariana Rossi, Michele Ceriotti, "Electronic-Structure Properties from Atom-Centered Predictions of the Electron Density", Journal of Chemical Theory and Computation 19, 4451 (2023) [https://pubs.acs.org/doi/10.1021/acs.jctc.2c00850]
In the SALTED directory, simply run make
, followed by pip install .
--> rascaline: rascaline installation requires a RUST compiler. To install a RUST compiler, run:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh && source "$HOME/.cargo/env"
rascaline can then be installed using
pip install git+https://github.com/Luthaf/rascaline.git
--> mpi4py: mpi4py is required to use MPI parallelisation; SALTED can nonetheless be run without this.
A parallel h5py installation is required to use MPI parellelisation. This can be installed by running:
HDF5_MPI="ON" CC=mpicc pip install --no-cache-dir --no-binary=h5py h5py
provided HDF5 has been compiled with MPI support.
SALTED input is provided in a inp.yaml
file, which is structured in the following sections:
salted
(required): define root storage directory and workflow labelsystem
(required): define system parametersqm
(required): define information about quantum-mechanical referencedescriptor
(required): define parameters of symmetry-adapted descriptorsgpr
(required): define Gaussian Process Regression parametersprediction
(optional): manage predictions on unseen datasets
Input structures are required in extXYZ format; the corresponding filename must be specified in the inp.system.filename
.
Electron density training data consists in the expansion coefficients of the scalar field over atom-centered basis functions made of radial functions and spherical harmonics. These coefficients are computed following density-fitting (DF), a.k.a. resolution of the identity, approximations, commonly applied in electronic-structure codes. We assume to work with orthonormalized real spherical harmonics defined with the Condon-Shortley phase convention. No restriction is instead imposed on the nature of the radial functions. Because of the non-orthogonality of the basis functions, the 2-center electronic integral matrices associated with the given density-fitting approximation are also required as input.
The electronic-structure codes that are to date interfaced with SALTED are:
- FHI-aims
- CP2K
- PySCF
We refer to the code-specific examples for how to produce the required quantum-mechanical data.
The root directory used for storing SALTED data is specified in inp.salted.saltedpath
. Depending on the chosen input parameters, a SALTED workflow can be labelled adding a coherent string in the inp.salted.saltedname
variable; in turn, this defines the name of the output folders that are automatically generated during the program execution. SALTED functions can be run either by importing the corresponding modules in Python, or directly from command line.
MPI parallelization can be activated by setting inp.system.parallel
as True
, and can be used, whenever applicable, to parallelize the calculation of SALTED functions over training data.
In what follows, we report an example of a general command line workflow:
-
Initialize structural features defined from 3-body symmetry-adapted descriptors,
$P^L$ , as computed following PRL 120, 036002 (2018):python3 -m salted.initialize
An optional
sparsify
subsection can be added to theinp.descriptor
input section in order to reduce the feature space size down toncut
sparse features selected using a "farthest point sampling" (FPS) algorithm. To facilitate this procedure, it is possible to perform the FPS selection over a subset ofnsamples
configurations, selected at random from the entire training dataset. -
Find sparse set of
inp.gpr.Menv
atomic environments in order to recast the SALTED problem into a low dimensional space. The non-linearity degree of the model must be defined at this stage by setting the variableinp.gpr.z
as a positive integer.z=1
corresponds to a linear model.python3 -m salted.sparse_selection
-
Compute sparse vectors of descriptors
$P^L_M$ for each atomic type and angular momentum:python3 -m salted.sparse_descriptor
(MPI parallelizable) -
Compute sparse equivariant kernels
$k^L_{MM}$ and find projector matrices over the Reproducing Kernel Hilbert Space (RKHS):python3 -m salted.rkhs_projector
-
Compute equivariant kernels
$k^L_{NM}$ over the entire dataset and project them on the RKHS to obtain the final SALTED input vectors:python3 -m salted.rkhs_vector
(MPI parallelizable) -
Build the Hessian matrix of the quadratic RKHS problem over a maximum of
inp.gpr.Ntrain
training structures selected from the entire dataset; these can be either selected at random (inp.gpr.trainsel: random
) or sequentially (inp.gpr.trainsel: sequential
). The remaining structures will be automatically retained for validation. The variableinp.gpr.trainfrac
can be used to define the fraction of the total training data to be used: this can go from 0 to 1 in order to make learning curves while keeping the validation set fixed.python3 -m salted.hessian_matrix
(MPI parallelizable) -
Solve the regression problem with a given regularization parameter
inp.gpr.regul
.python3 -m salted.solve_regression
NB: when the dimensionality exceeds
$10^5$ , it is recommended to perform a direct minimization of the SALTED loss function in place of an explicit matrix inversion (points 6 and 7). If the dimensionality exceeds$70000$ , the loss function must be minimized directly. This can be run as follows:python3 -m salted.minimize_loss
(MPI parallelizable) -
Validate predictions over the structures that have not been retained for training by computing the root mean square error in agreement to the definition of the SALTED loss function.
python3 -m salted.validation
(MPI parallelizable) -
Once the SALTED model has been trained and validated, SALTED predictions for a new unseen dataset can be handled according to the
inp.prediction
section. For that, ainp.prediction.filename
must be specified in XYZ format, while ainp.prediction.predname
string can be defined to label the prediction directories. Equivariant predictions can then be run as follows:python3 -m salted.prediction
(MPI parallelizable)
Andrea Grisafi, Alan Lewis, Zekun Lou, Mariana Rossi