About • Background • Techniques • Usage
This repository contains both Jupyter notebooks for each materials informatics technique and the source code for my personal portfolio.
Materials informatics is the application of data science and informatics techniques, including artificial intelligence and machine learning, to better understand and design materials. There are many aspects of materials science and engineering that make the application of these techniques unique and in many cases different than mainstream data science. Much of this uniqueness is due to the nature of materials-related datasets, which tend to be some combination of small, sparse, biased, clustered, and low quality.
Transfer learning is a machine learning (ML) approach where knowledge gained from a pre-trained model is used in the training of another model. Usually the pre-trained model is trained on a higher quality or more general dataset, and the target model is trained on a lower quality or more specific dataset. The goal is usually to speed up and/or improve learning on the target task. There are many transfer learning methods; two examples are: i) a latent variable (LV) approach, where the output of the pre-trained model is used as an input feature to the target model, and ii) a fine-tuning (FT) approach, where some optimized parameters of the pre-trained model are used to initialize the parameters of the target model.
I currently only have code up for the LV approach - see here for the corresponding Jupyter notebook and here for an overview on my portfolio page. All models were built with scikit-learn
's RandomForestRegressor
.
Active learning is a machine learning (ML) approach where the ML model predictions and uncertainties are used to decide what data point to evaluate next in a search space. In the context of materials and chemicals, active learning is often used to guide design and optimization. To decide what material or experiment to run next, we utilize acquisition functions to either explore or exploit the search space. For example, if we seek to optimize a particular chemical property, we would utilize an exploitive acquisition function that prioritizes compounds predicted to have property values close to our target value. On the other hand, if we want to explore the search space and diversify our training data, we would utilize an explorative acquisition function that prioritizes compounds for which the model is most uncertain. For more information on acquisition functions, see here and here.
See here for the corresponding Jupyter notebook and here for an overview on my portfolio page. All models were built with GPflow
's GPR
.
Physics-informed learning is an ML approach where physical knowledge is embedded into the learning processes, usually via a differential equation. Most often, this is achieved by incorporating an additional loss into the training of a neural network, where the loss is defined by the differential equation. Here, I have adapted this example on cooling to solve the time-dependency of Newton's law of cooling at various environmental temperatures
To incorporate a physics-based loss into the network, we simply move all terms to one side of the equation:
See here for the corresponding Jupyter notebook and here for an overview on my portfolio page. All models were built with pytorch
.
To install the necessary packages for running the Jupyter notebooks, use pip
to install the packages listed in requirements_notebooks.txt
:
pip install -r requirements_notebooks.txt
Note that the notebooks often import data from data, as well as modules from python
scripts located in notebooks.
To install the necessary packages for running the streamlit
app, use pip
to install the packages listed in requirements_streamlit.txt
:
pip install -r requirements_streamlit.txt
Note that the app connects to a Google cloud bucket to access data - the src code will need to be significantly modified in order to run the app locally, with or without Google cloud bucket.