-
Notifications
You must be signed in to change notification settings - Fork 0
Quick start
intronIC
is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), using a genome and annotation or the sequences themselves.
If you have (or can get pip
) the easiest way to install intronIC
is
$ python -m pip install intronIC
If successful, intronIC
should now be callable from the command-line.
Otherwise, you can simply clone this repository to your local machine using git
$ git clone https://github.com/glarue/intronIC.git
$ cd intronIC/intronIC
If you clone the repo, you may also wish to add intronIC/intronIC
to your system PATH (how best to do this depends on your platform).
- Python >=3.3
- numpy & scipy
- scikit-learn >=0.20.1
- biogl
- matplotlib (optional, required for plotting)
To install dependencies separately using pip
, do
python3 -m pip install numpy scipy matplotlib scikit-learn biogl
intronIC
was built and tested on Linux, but should run on Windows or Mac OSes without too much trouble (I say that now...).
The required arguments for any classification run include a name (-n
; see note below), along with:
- Genome (
-g
) and annotation/BED (-a
,-b
) files or, - Intron sequences file (
-q
) (see Training data and PWMS for formatting information, which matches the reference sequence format)
By default, intronIC
includes non-canonical introns, and considers only the longest isoform of each gene. Helpful arguments may include:
-
-p
parallel processes, which can significantly reduce runtime -
-f cds
use onlyCDS
features to identify introns (by default, uses bothCDS
andexon
features) -
--no_nc
exclude introns with non-canonical (non-GT-AG
/GC-AG
/AT-AC
) boundaries -
-i
include introns from multiple isoforms of the same gene (default: longest isoform only)
To test the installation, change to the test_data
subdirectory, which contains Ensembl annotations for chr 19 of the human genome.
../intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens
If you just want to retrieve all annotated intron sequences, add the -s
flag:
../intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s
See the rest of the Wiki for more extensive details about output files, usage info, etc.
By default, intronIC
expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC
then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.
Output files, on the other hand, are named using the full name supplied via -n
. If you'd prefer to have it leave whatever argument you supply to -n
unmodified, use the --na
flag.
If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.
For genomes with a large number of annotated introns, memory usage can be on the order of gigabytes. This should rarely be a problem even for most modern personal computers, however. For reference, the Ensembl 95 release of the human genome requires ~5 GB of memory.
For many non-model genomes, intronIC should run fairly quickly (e.g. tens of minutes). For human and other very well annotated genomes, runtime may be longer (the human Ensembl 95 release takes ~20-35 minutes in testing); run time scales relatively linearly with the total number of annotated introns, and can be improved by using parallel processes via -p
.