Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65

Data

We used the data leakage-free PPI dataset from Bernett et al available on figshare (DOI: 10.6084/m9.figshare.21591618.v3). For the proteins contained in the dataset, we generated ESM-2 per-token and per-protein embeddings data/extract_esm.py for esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, and esm2_t48_15B_UR50D.

Models

Per-protein models

As a baseline, we used a Random Forest Classifier - with full embeddings and PCA-reduced embeddings (400 and 40 components). The associated code is in models/baselineRFC.py.

As an advanced model, we re-implemented the fully connected model by Richoux et al. models/fc2_20_2_dense.py. A version including a Transformer encoder is contained in models/attention.py#L595.

Per-token models

The 2d-baseline model is implemented in models/baseline2d.py.

We extended the 2d-baseline by inserting a Transformer encoder with self- models/attention.py#L90 or cross-attention models/attention.py#L15.

Further, we re-implemented D-SCRIPT models/dscript_like.py. Also, we implemented a version that included a Transformer encoder models/attention.py#L157.

We also re-implemented TUnA models/attention.py#L321.

Tests

Hyperparameter tuning was done with wandb. All code necessary for repeating the analyses is found in main.py.

Distance maps

PPIs included in the PDB were identified in data/get_contact.py. Those were filtered for confident predictions made by the models data/find_confpreds_with_structure.py.

Finally, the distance maps were calculated and their correlations to the predicted distance maps were obtained in data/get_cmap.py.

Visualizations

All other visualizations can be found in plots/.

Contact

Timo Reim T.Reim@campus.lmu.de (Developer)
Judith Bernett judith.bernett@tum.de
Markus List markus.list@tum.de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65

Data

Models

Per-protein models

Per-token models

Tests

Distance maps

Visualizations

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65

Data

Models

Per-protein models

Per-token models

Tests

Distance maps

Visualizations

Contact