Skip to content

Commit

Permalink
Merge pull request #26 from calico/revision-upd-4
Browse files Browse the repository at this point in the history
Revision update
  • Loading branch information
johli authored Oct 8, 2024
2 parents eeee3fe + b900127 commit 65d71da
Show file tree
Hide file tree
Showing 137 changed files with 18,746 additions and 6,625 deletions.
89 changes: 75 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Code repository for Borzoi models, which are convolutional neural networks train

[https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1).

Borzoi was trained on a large set of RNA-seq experiments from ENCODE and GTEx, as well as re-processed versions of the original Enformer training data (including ChIP-seq and DNase data from ENCODE, ATAC-seq data from CATlas, and CAGE data from FANTOM5). Click [here](https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt) for a list of trained-on experiments.
Borzoi was trained on a large set of RNA-seq experiments from ENCODE and GTEx, as well as re-processed versions of the original Enformer training data (including ChIP-seq and DNase data from ENCODE, ATAC-seq data from CATlas, and CAGE data from FANTOM5). Here is a list of trained-on experiments: [human](https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt) / [mouse](https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_mouse.txt).

The repository contains example usage code (including jupyter notebooks for predicting and visualizing genetic variants) as well as links for downloading model weights, training data, QTL benchmark tasks, etc.

Expand All @@ -30,20 +30,67 @@ cd borzoi
pip install -e .
```

These repositories further depend on a number of python packages (which are automatically installed with borzoi). See **setup.cfg** for a complete list. The most important version dependencies are:
- Python == 3.9
- Tensorflow == 2.12.x (see [https://www.tensorflow.org/install/pip](https://www.tensorflow.org/install/pip))
To train new models, the [westminster repository](https://github.com/calico/westminster.git) is also required and can be installed with these commands (*this repo is not yet available, but will be made public soon*):
```sh
git clone https://github.com/calico/westminster.git
cd westminster
pip install -e .
```

These repositories further depend on a number of python packages (which are automatically installed with borzoi). See **pyproject.toml** for a complete list. The most important version dependencies are:
- Python == 3.10
- Tensorflow == 2.15.x (see [https://www.tensorflow.org/install/pip](https://www.tensorflow.org/install/pip))

*Note*: The example notebooks require jupyter, which can be installed with `pip install notebook`.<br/>
A new conda environment can be created with `conda create -n borzoi_py39 python=3.9`.
A new conda environment can be created with `conda create -n borzoi_py310 python=3.10`.<br/>
Some of the scripts in this repository start multi-process jobs and require [slurm](https://slurm.schedmd.com/).

Finally, the code base relies on a number of environment variables. For convenience, these can be configured in the active conda environment with the 'env_vars.sh' script. First, open up 'env_vars.sh' in each repository folder and change the few lines of code at the top to your local paths. Then, issue these commands:
```sh
cd borzoi
conda activate borzoi_py310
./env_vars.sh
cd ../baskerville
./env_vars.sh
cd ../westminster
./env_vars.sh
```

Alternatively, the environment variables can be set manually:
```sh
export BORZOI_DIR=/home/<user_path>/borzoi
export PATH=$BORZOI_DIR/src/scripts:$PATH
export PYTHONPATH=$BORZOI_DIR/src/scripts:$PYTHONPATH

export BASKERVILLE_DIR=/home/<user_path>/baskerville
export PATH=$BASKERVILLE_DIR/src/baskerville/scripts:$PATH
export PYTHONPATH=$BASKERVILLE_DIR/src/baskerville/scripts:$PYTHONPATH

export WESTMINSTER_DIR=/home/<user_path>/westminster
export PATH=$WESTMINSTER_DIR/src/westminster/scripts:$PATH
export PYTHONPATH=$WESTMINSTER_DIR/src/westminster/scripts:$PYTHONPATH

export BORZOI_CONDA=/home/<user>/anaconda3/etc/profile.d/conda.sh
export BORZOI_HG38=$BORZOI_DIR/examples/hg38
export BORZOI_MM10=$BORZOI_DIR/examples/mm10
export BASKERVILLE_CONDA=$BORZOI_CONDA
```

*Note*: The *baskerville* and *westminster* variables are only required for data processing and model training.

### Model Availability
The model weights can be downloaded as .h5 files from the URLs below. We trained a total of 4 model replicates with identical train, validation and test splits (test = fold3, validation = fold4 from [sequences_human.bed.gz](https://github.com/calico/borzoi/blob/main/data/sequences_human.bed.gz)).

[Borzoi V2 Replicate 0](https://storage.googleapis.com/seqnn-share/borzoi/f0/model0_best.h5)<br/>
[Borzoi V2 Replicate 1](https://storage.googleapis.com/seqnn-share/borzoi/f1/model0_best.h5)<br/>
[Borzoi V2 Replicate 2](https://storage.googleapis.com/seqnn-share/borzoi/f2/model0_best.h5)<br/>
[Borzoi V2 Replicate 3](https://storage.googleapis.com/seqnn-share/borzoi/f3/model0_best.h5)<br/>
[Borzoi Replicate 0 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f0/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f0/model1_best.h5)<br/>
[Borzoi Replicate 1 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f1/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f1/model1_best.h5)<br/>
[Borzoi Replicate 2 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f2/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f2/model1_best.h5)<br/>
[Borzoi Replicate 3 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f3/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f3/model1_best.h5)<br/>

Users can run the script *download_models.sh* to download all model replicates and annotations into the 'examples/' folder.
```sh
cd borzoi
./download_models.sh
```

#### Mini Borzoi Models
We have trained a collection of (smaller) model instances on various subsets of data modalities (or on all data modalities but with architectural changes compared to the original architecture). For example, some models are trained only on RNA-seq data while others are trained on DNase-, ATAC- and RNA-seq. Similarly, some model instances are trained on human-only data while others are trained on human- and mouse data. The models were trained with either 2- or 4-fold cross-validation and are available at the following URL:
Expand All @@ -60,9 +107,12 @@ For example, here are the weights, targets, and parameter file of a model traine
### Data Availability
The training data for Borzoi can be downloaded from the following URL:

[Borzoi V2 Training Data](https://storage.googleapis.com/borzoi-paper/data/)<br/>
[Borzoi Training Data](https://storage.googleapis.com/borzoi-paper/data/)<br/>

*Note*: This data bucket is very large and thus set to "Requester Pays".
*Note*: This data bucket is large (multiple TB) and thus set to "Requester Pays". To access the bucket, you must have a billable user project set up on the Google Cloud Platform (GCP) and included with the "-u" flag when issuing gsutil commands. For example, to list the contents of "gs://borzoi-paper/data", issue this command:
```sh
gsutil -u <user_project> ls gs://borzoi-paper/data
```

### QTL Availability
The curated e-/s-/pa-/ipaQTL benchmarking data can be downloaded from the following URLs:
Expand All @@ -72,10 +122,21 @@ The curated e-/s-/pa-/ipaQTL benchmarking data can be downloaded from the follow
[paQTL Data](https://storage.googleapis.com/borzoi-paper/qtl/paqtl/)<br/>
[ipaQTL Data](https://storage.googleapis.com/borzoi-paper/qtl/ipaqtl/)<br/>

### Paper Replication
To replicate the results presented in the paper, visit the [borzoi-paper repository](https://github.com/calico/borzoi-paper.git). This repository contains scripts for **training**, **evaluating**, and **analyzing** the published model, and for processing the **training data**.

### Tutorials
The following directories contain *minimal* tutorials regarding model training, variant scoring, and interpretation. The 'legacy' tutorials use data transformations that are similar to those used in the manuscript, while 'latest' use updated (and simpler) transformations. Note that these tutorials are only intended to showcase core functionality on sample data (such as processing an RNA-seq experiment, or training a simple model). For advanced analyses, we recommend studying the results presented in the manuscript (see [Paper Replication](https://github.com/calico/borzoi/tree/main?tab=readme-ov-file#paper-replication)).

- **Data Processing** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/make_data) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/make_data)<br/>
- **Model Training** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/train_model) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/train_model)<br/>
- **Variant Scoring** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/score_variants) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/score_variants)<br/>
- **Sequence Interpretation** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/interpret_sequence) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/interpret_sequence)<br/>

### Example Notebooks
The following notebooks contain example code for predicting and interpreting genetic variants.

[Notebook 1a: Interpret eQTL SNP (expression)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_eqtl_chr10_116952944_T_C.ipynb)<br/>
[Notebook 1b: Interpret sQTL SNP (splicing)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_sqtl_chr9_135548708_G_C.ipynb)<br/>
[Notebook 1c: Interpret paQTL SNP (polyadenylation)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_paqtl_chr1_236763042_A_G.ipynb)<br/>
[Notebook 1a: Interpret eQTL SNP (expression)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_eqtl_chr10_116952944_T_C.ipynb) [(fancy)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_eqtl_chr10_116952944_T_C_fancy.ipynb)<br/>
[Notebook 1b: Interpret paQTL SNP (polyadenylation)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_paqtl_chr1_236763042_A_G.ipynb) [(fancy)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_paqtl_chr1_236763042_A_G_fancy.ipynb)<br/>
[Notebook 1c: Interpret sQTL SNP (splicing)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_sqtl_chr9_135548708_G_C.ipynb)<br/>
[Notebook 1d: Interpret ipaQTL SNP (splicing and polya)](https://github.com/calico/borzoi/blob/main/examples/borzoi_example_ipaqtl_chr10_116664061_G_A.ipynb)<br/>
68 changes: 68 additions & 0 deletions download_models.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash

# download model weights (data fold 3, 4 replicates)
for rep in f3c0,f0 f3c1,f1 f3c2,f2 f3c3,f3; do IFS=","; set -- $rep;
mkdir -p "examples/saved_models/$1/train"
local_model="examples/saved_models/$1/train/model0_best.h5"
if [ -f "$local_model" ]; then
echo "$1 model already exists."
else
wget --progress=bar:force "https://storage.googleapis.com/seqnn-share/borzoi/$2/model0_best.h5" -O "$local_model"
fi
done

# download and uncompress annotation files
mkdir -p examples/hg38/genes/gencode41
mkdir -p examples/hg38/genes/polyadb

if [ -f examples/hg38/genes/gencode41/gencode41_basic_nort.gtf ]; then
echo "Gene annotation already exists."
else
wget -O - https://storage.googleapis.com/seqnn-share/helper/gencode41_basic_nort.gtf.gz | gunzip -c > examples/hg38/genes/gencode41/gencode41_basic_nort.gtf
fi

if [ -f examples/hg38/genes/gencode41/gencode41_basic_nort_protein.gtf ]; then
echo "Gene annotation (no read-through, protein-coding) already exists."
else
wget -O - https://storage.googleapis.com/seqnn-share/helper/gencode41_basic_nort_protein.gtf.gz | gunzip -c > examples/hg38/genes/gencode41/gencode41_basic_nort_protein.gtf
fi

if [ -f examples/hg38/genes/gencode41/gencode41_basic_protein.gtf ]; then
echo "Gene annotation (protein-coding) already exists."
else
wget -O - https://storage.googleapis.com/seqnn-share/helper/gencode41_basic_protein.gtf.gz | gunzip -c > examples/hg38/genes/gencode41/gencode41_basic_protein.gtf
fi

if [ -f examples/hg38/genes/gencode41/gencode41_basic_tss2.bed ]; then
echo "TSS annotation already exists."
else
wget -O - https://storage.googleapis.com/seqnn-share/helper/gencode41_basic_tss2.bed.gz | gunzip -c > examples/hg38/genes/gencode41/gencode41_basic_tss2.bed
fi

if [ -f examples/hg38/genes/gencode41/gencode41_basic_protein_splice.csv.gz ]; then
echo "Splice site annotation already exist."
else
wget https://storage.googleapis.com/seqnn-share/helper/gencode41_basic_protein_splice.csv.gz -O examples/hg38/genes/gencode41/gencode41_basic_protein_splice.csv.gz
fi

if [ -f examples/hg38/genes/gencode41/gencode41_basic_protein_splice.gff ]; then
echo "Splice site annotation already exist."
else
wget -O - https://storage.googleapis.com/seqnn-share/helper/gencode41_basic_protein_splice.gff.gz | gunzip -c > examples/hg38/genes/gencode41/gencode41_basic_protein_splice.gff
fi

if [ -f examples/hg38/genes/polyadb/polyadb_human_v3.csv.gz ]; then
echo "PolyA site annotation already exist."
else
wget https://storage.googleapis.com/seqnn-share/helper/polyadb_human_v3.csv.gz -O examples/hg38/genes/polyadb/polyadb_human_v3.csv.gz
fi

# download and index hg38 genome
mkdir -p examples/hg38/assembly/ucsc

if [ -f examples/hg38/assembly/ucsc/hg38.fa ]; then
echo "Human genome FASTA already exists."
else
wget -O - http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz | gunzip -c > examples/hg38/assembly/ucsc/hg38.fa
python src/scripts/idx_genome.py examples/hg38/assembly/ucsc/hg38.fa
fi
38 changes: 38 additions & 0 deletions env_vars.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/bin/bash

# set these variables before running the script
LOCAL_BORZOI_PATH="/home/jlinder/borzoi"
LOCAL_CONDA_PATH="/home/jlinder/anaconda3/etc/profile.d/conda.sh"

# create env_vars sh scripts in local conda env
mkdir -p "$CONDA_PREFIX/etc/conda/activate.d"
mkdir -p "$CONDA_PREFIX/etc/conda/deactivate.d"

file_vars_act="$CONDA_PREFIX/etc/conda/activate.d/env_vars.sh"
if ! [ -e $file_vars_act ]; then
echo '#!/bin/sh' > $file_vars_act
fi

file_vars_deact="$CONDA_PREFIX/etc/conda/deactivate.d/env_vars.sh"
if ! [ -e $file_vars_deact ]; then
echo '#!/bin/sh' > $file_vars_deact
fi

# append env variable exports to /activate.d/env_vars.sh
echo "export BORZOI_DIR=$LOCAL_BORZOI_PATH" >> $file_vars_act
echo 'export PATH=$BORZOI_DIR/src/scripts:$PATH' >> $file_vars_act
echo 'export PYTHONPATH=$BORZOI_DIR/src/scripts:$PYTHONPATH' >> $file_vars_act

echo 'export BORZOI_HG38=$BORZOI_DIR/examples/hg38' >> $file_vars_act
echo 'export BORZOI_MM10=$BORZOI_DIR/examples/mm10' >> $file_vars_act

echo "export BORZOI_CONDA=$LOCAL_CONDA_PATH" >> $file_vars_act

# append env variable unsets to /deactivate.d/env_vars.sh
echo 'unset BORZOI_DIR' >> $file_vars_deact
echo 'unset BORZOI_HG38' >> $file_vars_deact
echo 'unset BORZOI_MM10' >> $file_vars_deact
echo 'unset BORZOI_CONDA' >> $file_vars_deact

# finally activate env variables
source $file_vars_act
Loading

0 comments on commit 65d71da

Please sign in to comment.