DEEPScreen: Virtual Screening with Deep Convolutional Neural Networks Using Compound Images-with test scripts
M. Volkan Atalay
DEEPScreen is a large-scale drug-target interaction (DTI) prediction system, for early-stage drug discovery, using deep convolutional neural networks.
One of the main advantages of DEEPScreen is employing readily available 2-D structural representations of compounds at the input level instead of conventional descriptors. DEEPScreen learns complex features inherently from the 2-D representations, thus producing highly accurate predictions for virtual screening. DEEPScreen was developed using PyTorch framework.
More information can be obtained from DEEPScreen journal article.
In the original developed code for DEEPScreen, the model is trained and tested using a single input file; that is, all of the training data and test data are stored in the same file and each time the script is executed, training has to be performed also.
This new version allows to perform tests (prediction/virtual screening) separate from training (using an already trained model).
Here, I explain the newly added functionalities and functions.
DEEPScreen is a command-line prediction tool written in Python. The original repository came with a bundle of data and code, which I recently extended to include separate testing of a trained model. Here is the current directory structure:
- bin: source code including original and new script files (main_test.py and test_DEEPScreen.py)
- test_files: input test file(s) (this is the new added directory)
- training_files: files used in the training and test
- result_files: results of various tests/analyses
- trained_models: already trained models
Training is explained in the original repository under the title How to train DEEPScreen models and get performance results
Remark that after training, the trained model is stored (serialized) in a file entitled
targetid_best_val-targetid-<hyperparameters_seperated by dash>-<experiment_name>-state_dict.pth
under trained_models/<experiment_name>/
The following is an example call for main_training.py script to perform training for CHEMBL210 as the target protein.
python main_training.py --targetid CHEMBL210 --model CNNModel1 --fc1 256 --fc2 128 --lr 0.01 --bs 64 --dropout 0.25 --epoch 100 --en my_chembl210_training
This command generates a file (trained_models/my_chembl210_training/CHEMBL210_best_val-CHEMBL210-CNNModel1-256-128-0.01-64-0.25-100-my_chembl210_training-state_dict.pth) that contains a serialized PyTorch state dictionary. It is a Python dictionary that contains the state of a PyTorch model, including the model's weights, biases, and other parameters.
- Clone this Git Repository
- Download the compressed file for the chemical representations of compounds in ChEMBLv32 from here
- Move the compressed file under test_files/ and unzip it
- Prepare a file containing ChEMBL identifiers of compounds to be tested as explained below
- Run the main_test.py script as shown below
By executing main_test.py, the model for a target protein is restored and it can be used to screen (test or make a prediction for) a compound or a list of compounds.
main_test.py calls test_DEEPScreen function, which first parses the input test file and generates 2D images of the compounds listed in the test file. The trained model is then restored, and the predictions for the test compounds are obtained.
The following is an example call for main_test.py script to perform tests against CHEMBL210 using the model generated by the example call for main_training.py script to conduct training for CHEMBL210 as the target protein.
python main_test.py --targetid CHEMBL210 --modelfile DEEPScreen/trained_models/my_chembl210_training/CHEMBL210_best_val-CHEMBL210-CNNModel1-256-128-0.01-64-0.25-100-my_chembl210_training-state_dict.pth --testfile CHEMBL210_compounds.tsv
Here is the explanation of the parameters.
--targetid
target to be trained
--modelfile
trained model
--testfile
compounds/drugs to be tested
The file containing the compounds/drugs to be tested should be placed under test_files directory and its format is as follows.
A line starts with targetid_act
followed by a tab delimiter and a list of ChEMBL identifiers of active compounds, separated by a comma. Similarly, inactive compounds should be given in a separate line starting with targetid_inact
followed by a tab delimiter and a list of ChEMBL identifiers of inactive compounds, separated by a comma. If no activity information is known a priori, the user can insert the compounds in any of the two lists (in this case the user should ignore the performance evaluation scores).
An example for CHEMBL210 is given below.
CHEMBL210_act CHEMBL2111083,CHEMBL1084173,CHEMBL1095607,CHEMBL521589,CHEMBL3039518,CHEMBL1240967,CHEMBL1291,CHEMBL1290,CHEMBL471,CHEMBL27810,CHEMBL4297483,CHEMBL1095777,CHEMBL1002,CHEMBL926,CHEMBL605846,CHEMBL1363,CHEMBL1198857,CHEMBL649,CHEMBL1201295,CHEMBL714,CHEMBL1094785,CHEMBL776,CHEMBL160519,CHEMBL88055,CHEMBL1094966,CHEMBL2012520,CHEMBL546,CHEMBL1201237,CHEMBL1201273,CHEMBL83063,CHEMBL1760,CHEMBL49080,CHEMBL1197051,CHEMBL434394,CHEMBL768,CHEMBL27193,CHEMBL16476,CHEMBL1201213,CHEMBL500,CHEMBL32800,CHEMBL1263,CHEMBL499,CHEMBL1159717,CHEMBL321582,CHEMBL631,CHEMBL27,CHEMBL1940832,CHEMBL3039530,CHEMBL1256786
CHEMBL210_inact
The output is displayed on the screen. First, the values of performance measures are displayed, and the list of compounds is given with their actual labels (if available) and predicted activity (1 for active and 0 for inactive).
If you use DEEPScreen please consider citing:
Rifaioglu, A. S., Nalbat, E., Atalay, V., Martin, M. J., Cetin-Atalay, R., & Doğan, T. (2020). DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chemical Science, 11(9), 2531-2557.
There is also an article in Medium entitled A Deep Learning-based Tutorial for the Early Stages of Drug Discovery
DEEPScreenWithTest Copyright (C) 2023 CanSyL and M Volkan Atalay
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.