This is the pipeline used in the Stenglein lab to taxonomically classify sequences in NGS datasets.
The goal of this document is to describe how to setup a computer to run this pipeline.
The dependencies for this pipeline (see below) can be installed via conda. The conda recipe file for 64 bit linux is here. After downlading this recipe file, create a conda environment with this command:
conda create --name taxonomy --file taxo_recipe.yaml
This pipeline has several main dependencies (included in the conda environment described above), including:
As well as some perl modules and a the scripts in this repository.
The pipeline also expects local installations of the NCBI nt/nr databases, as well as the NCBI taxonomy database for accession->taxid mapping.
The pipeline needs databases of nucleotide (nt) and protein (nr) sequences from NCBI. These databases are pretty big. Expect that they will take up something like ~1Tb of disk space (and they continue to grow) .
We download and setup NCBI databases using the scripts in this repository in the server directory. The main scripts to run are: