Skip to content

zoeyxyang/BioDolphin

Repository files navigation

BioDolphin

Automatically curate new entries from PDB

Set up the conda environment

cd BioDolphin
conda env create -f environment.yaml
conda activate biodolphin_env
pip install biopython

Make sure the files are placed correctly

Place the previous version of the dataset (txt file. For example: BioDolphin_vr1.0.txt) in /data.
Make sure the lipid annotation file (lipid_annotations.txt) is in /data.

Run the code

Generate a report file

python main.py --report
This will generate a report file to see what pdbs are missing for each lipid.
Note that some of them may contain pdbs that don't have proteins, and thus they will not be currated.
The file generated will have names such as: Report_MissingEntry_2024-09-06.txt with the current date at the end

Step1 for main.py

python main.py -d BioDolphin_vr1.0.txt -l lipid_annotations.txt -o BioDolphin_vr1.1 -r Report_MissingEntry_2024-09-06.txt --step1 \

-d: BioDolphin_vr1.0.txt (This is the current full BioDolphin dataset)
-l: lipid_annotations.txt (The lipid annotation files that maps lipid CCD to its annotation)
-o: BioDolphin_vr1.1 (This is the name of the next version of BioDolphin dataset)
-r: Report_MissingEntry_2024-09-06.txt (This is the Report file generated from the previous step) \

After this step, fasta files will be generated and we will need to manually upload them onto the DeepLoc webserver to get protein subcellular location predictions.
Follow the instructions of the output on the terminal to prepare for the next step.

Get residue numbers

source run_resnum.sh
This will submit 9 slurm scripts (run_resnum.slurm), and each slurm script will run parse_resnum.py

Step2 for main.py

python main.py -o BioDolphin_vr1.1 --step2 \

This step will combine the DeepLoc results and residue numbers to the dataset and produce the final updated dataset in /result
Current number of data is: 107849

About

Automatically curate new entries from PDB

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published