Automatically curate new entries from PDB
cd BioDolphin
conda env create -f environment.yaml
conda activate biodolphin_env
pip install biopython
Place the previous version of the dataset (txt file. For example: BioDolphin_vr1.0.txt) in /data
.
Make sure the lipid annotation file (lipid_annotations.txt) is in /data
.
python main.py --report
This will generate a report file to see what pdbs are missing for each lipid.
Note that some of them may contain pdbs that don't have proteins, and thus they will not be currated.
The file generated will have names such as: Report_MissingEntry_2024-09-06.txt
with the current date at the end
python main.py -d BioDolphin_vr1.0.txt -l lipid_annotations.txt -o BioDolphin_vr1.1 -r Report_MissingEntry_2024-09-06.txt --step1
\
-d: BioDolphin_vr1.0.txt (This is the current full BioDolphin dataset)
-l: lipid_annotations.txt (The lipid annotation files that maps lipid CCD to its annotation)
-o: BioDolphin_vr1.1 (This is the name of the next version of BioDolphin dataset)
-r: Report_MissingEntry_2024-09-06.txt (This is the Report file generated from the previous step) \
After this step, fasta files will be generated and we will need to manually upload them onto the DeepLoc webserver to get protein subcellular location predictions.
Follow the instructions of the output on the terminal to prepare for the next step.
source run_resnum.sh
This will submit 9 slurm scripts (run_resnum.slurm), and each slurm script will run parse_resnum.py
python main.py -o BioDolphin_vr1.1 --step2
\
This step will combine the DeepLoc results and residue numbers to the dataset and produce the final updated dataset in /result
Current number of data is: 107849