Skip to content

A tutorial on how to annotate and interpret novel mammalian reference genomes

License

Notifications You must be signed in to change notification settings

BaderLab/GenomeAnnotationTutorial

Repository files navigation

Tutorial: annotation and interpretation of novel mammalian reference genomes

This is a tutorial on how to annotate a newly sequenced mammalian genome. This takes the user from their FASTA sequence to a high-quality GFF file annotated with gene symbols. We break the process into five main steps:

  1. Repeat masking
  2. Generating protein-coding RNA (mRNA) gene models
  3. Combining and filtering gene models
  4. Adding non-coding RNA (ncRNA) gene models
  5. Assigning gene symbols

We recommend tools and best practices for each step, providing code to help the user execute each task. It is up to the user to install the tools that we recommend for this pipeline, however we make note of challenging installation processes that we have encountered with certain tools and how we overcame these challenges.

Generally, genome annotation does not have a comparable ground truth, so we use different sources of evidence to annotate the most likely gene models. These gene models are considered hypotheses for where the genes are located on the genome, but false positives and false negatives will always exist. This pipeline uses existing tools and quality-checking software to try to minimise both of these error rates to create a high-quality annotation.

Our main recommendations in brief

In this tutorial, we talk in depth on what process and tools we recommend depending on your data availability and quality. However, here is a brief overview of our recommendations:

  1. Soft-mask your genome using Earl Grey; use the soft-masked genome for all of the following steps
  2. Find two or three closely related genomes on RefSeq (or Ensembl) by searching through their genome database
  3. Lift over the annotations using LiftOff (works best for more closely-related species)
  4. Lift over the annotations using TOGA (works best for more distantly-related species)
  5. If you have access to high-quality RNA-seq data (2x100bp or longer) from the species you wish to annotate, align RNA-seq with HISAT2 and perform annotation with StringTie2; do this for as many tissues as possible
  6. Perform annotation with BRAKER3 using any or no RNA-seq data from your species
  7. Use Mikado to combine and filter any annotations generated in steps 3-6
  8. Use Infernal + RFam, MirMachine, and extract outputs from Earl Grey to annotate ncRNAs
  9. Predict gene symbols using OrthoFinder script or by intersecting gene liftover results with BedTools

For each annotation generated in steps 3, 4, 6, and 7, we recommend converting the GTF/GFF file to a protein FASTA file using GFFRead and generating a BUSCO score to assess the quality of the annotation (not possible for StringTie-generated annotations as these lack CDS features which are required for the conversion to a protein FASTA file). The combined and filtered annotation generated by Mikado in step 7 should have a higher score than any input annotation. If not, rerun Mikado with modified settings.

List of tools

Notes on computational requirements

We expect the user to be familiar with installing and running command-line tools, as genome annotation relies on such tools. Additional familiarity with R may be helpful for some more advanced tasks. Many tools can be run on a desktop, but some are very computationally intensive and require the use of a high-performance compute cluster (e.g. Compute Canada). The speed of many tools will improve if they have access to multiple threads and can therefore run tasks in parallel. To check how many threads you can specify when running tools, check the documentation of your compute cluster or run nproc on your local desktop.

Because genome annotation relies on a number of different command-line tools, we recommend creating or using unique environments for each tool on your machine, if possible. One tool may rely on one version of a piece of software, while another tool may rely on a more recent or older version; if tools share the same environment, such conflicts may prevent each tool from running properly. Virtual environments can be used or created with tools like Docker, Conda, or Python.

Virtual environments

Docker containers exist for certain tools, and typically mean that a tool is packaged with all of its requirements and is ready for you to use; you can check for them by running docker search name_of_tool. If a Docker container is listed, you can typically use it by running docker run -v "$(pwd)":/tmp name_of_container command. docker run means that you are running the container, -v indicates where you are mounting the volume, which essentially gives Docker a temporary space on your machine to store data (here just given the placeholder /tmp), name_of_container is replaced by the name of whatever container you want to try, and command indicates the command of the tool you want to use.

Conda environments can be created using the command conda create -n name_of_environment required_package_1 required_package_2 ... where additional packages separated by a space can replace the ellipses. Conda considers all the packages required by the user, and creates an isolated environment where all package versions should be compatable. If the user only needs one particular package with all of its dependencies, Conda will automatically find and install all dependencies into the environment when only the one needed is specified. Environments can be activated by running conda activate name_of_environment and deactivated with conda deactivate.

Python environments can be created using the command venv or virtualenv and work similarly to Conda environments. One can create a virtual Python environment, activate it, and then install all the Python packages they require for a particular tool that won't conflict with anything else on your machine once the virtual environment is deactivated. To create a virtual environment, you can run virtualenv /path/to/virtual/env where the path to the virtual environment is any path you'd like to store it. Then to activate the environment with source /path/to/virtual/env and deactivate with deactivate.

Running long jobs

Some of the tools we recommend will take a long time to run: sometimes hours, sometimes days. If this is the case, you probably want the job to run in the background so that you can do other things on your computer. You can do this with the nohup command, which can wrap around any command you want to execute and it allows the job to continue running if you exit the terminal - just don't turn your computer off. Nohup can be run using nohup command argument1 argument2 > output >& nohup.out. Just replace the command, arguments, and/or output with whatever you would run normally, and wrap it with nohup. The output that would normally be written to the screen will now be stored in nohup.out.

About

A tutorial on how to annotate and interpret novel mammalian reference genomes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages