Skip to content

Quick start

Graham Larue edited this page May 26, 2020 · 31 revisions

Quick start/testing

First, clone the repo to your local machine:

$ git clone https://github.com/glarue/intronIC.git
$ cd intronIC

You may also wish to add intronIC to your system path (how you do this is platform-dependent).

Dependencies

To install all dependencies using pip, do

python3 -m pip install numpy scipy matplotlib scikit-learn biogl

intronIC was built and tested on Linux, but should run on Windows or Mac OSes without much trouble (I say that now...).

Useful arguments

The required arguments for any classification run include a name (-n), along with:

  1. Genome (-g) and annotation/BED (-a, -b) files or,
  2. Intron sequences file (-q) (see Training data and PWMS for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, and considers only the longest isoform of each gene. Helpful arguments may include:

  • -p | parallel processes, which can significantly reduce runtime

  • -f cds | use only CDS features to identify introns (by default, uses both CDS and exon features)

  • --no_nc | exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries

  • -i | include introns from multiple isoforms of the same gene (default: longest isoform only)

  • -v | include introns with overlapping boundaries (e.g. alt. 5'/3' boundaries)

Running on test dataset

To test the installation, change to the test_data subdirectory, which contains Ensembl annotations for chr 19 of the human genome.

Classify annotated introns

  • ../intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences, add the -s flag:

  • ../intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s

See the rest of the Wiki for more extensive details about output files, usage info, etc.

Resource usage

For genomes with a large number of annotated introns, memory usage can be on the order of gigabytes. This should rarely be a problem even for most modern personal computers, however. For reference, the Ensembl 95 release of the human genome requires ~5 GB of memory.

For many non-model genomes, intronIC should run fairly quickly (e.g. tens of minutes). For human and other very well annotated genomes, runtime may be longer (the human Ensembl 95 release takes ~20-35 minutes in testing); run time scales relatively linearly with the total number of annotated introns, and can be improved by using parallel processes via -p.

Clone this wiki locally