This package takes a (minimal) bed file with positive (1) and negative (0) labeled genomic coordinates, calculates their gc content and balances positive and negative labeled entries per chromosome based on their respective GC content.
The negative_training_sampler uses pybedtools to calculate the GC content for the provided regions. Pybedtools depends on bedtools, so make sure bedtools is installed. Otherwise install it manually: bedtools installation guide
Other Dependencies:
dependencies:
- click
- pandas
- pybedtools
- dask
Installing the package:
pip install -e /path/to/repo
The package needs at least a minimal bed file with positive (1) and negative (0) labeled regions in one file, a reference genome in fasta format and the corresponding genome file. The output will be written to stdout or to an output file specified with the '-o' option. For more details see input section.
General use:
negative_input_sampler -i LABEL_FILE -r REFERENCE_FILE -g GENOME_FILE -o OUTPUT_FILE
More advanced use:
negative_input_sampler -i LABEL_FILE -r REFERENCE_FILE -g GENOME_FILE -o OUTPUT_FILE --cores INT --memory [INT]GB
The help message of the command line interface can be accesssed by typing:
negative_training_sampler --help
Help output:
Usage: negative_training_sampler [OPTIONS]
A simple script that takes a tsv file with positive and negative labels
and a reference file. Generates negative samples with the same GC
distribution as the positive samples per chromosome.
Options:
-i, --label-file PATH Input bed file with labeled regions. [required]
-r, --reference-file PATH Input genome reference in fasta format.
[required]
-g, --genome-file PATH Input genome file of reference [required]
-o, --output_file PATH Path to output file.
-n, --label_num INTEGER Number of separate label columns. Default=1
--precision INTEGER Precision of decimals when computing the
attributes like GC content.
-c, --bgzip Output will be bgzipped.
--log PATH Write logging to this file.
--verbose Will print verbose messages.
--seed INTEGER Sets the seed for sampling.
--cores INTEGER Number of used cores default: 1
--memory TEXT Amount of memory per core (e.g. 2 cores * 2GB =
4GB) default: 2GB
--help Show this message and exit.
Input files are in bed format with a minimum of the three required bed field (chrom, chromStart, chromEnd) and up to all 12 bed field. Additionally, at least one label column containing numerical binary labels (0 and 1) is required.
Note: It is recommend to have genomic windows of the same size to minimize sampling bias. One way to generate labeled windows of equal size, is to use Seqdataloader.
Minimal input example:
chr1 191200 191500 0.0
chr1 205950 206250 0.0
chr1 296600 296900 0.0
chr1 354950 355250 0.0
chr1 356950 357250 0.0
chr1 362800 363100 1.0
chr1 362850 363150 1.0
chr1 362900 363200 1.0
Note: Columns are chrom, chromStart, chromEnd, label.
Output contains information from Input (bed fields) and an additional column containing information on the GC content of the region.
example output:
CHR START END label_1 gc
chr1 362800 363100 1.0 38.666702
chr1 362850 363150 1.0 35.666702
chr1 362900 363200 1.0 41.3333
chr1 381450 381750 1.0 74.3333
chr1 381500 381800 1.0 74.6667
chr1 381550 381850 1.0 73.0
chr1 381600 381900 1.0 72.6667
chr1 381650 381950 1.0 71.3333
chr1 381700 382000 1.0 68.6667
Note: The output is written to stdout by default or to a file, if an output files is specified by the -o option.
MIT license
negative_training_sampler
was written by Sebastian Röner.