Description of workflow for the PaintSHOP pipeline.
The pipeline consists of three stages:
-
First, several reference files are created including cleaned input files and various indices.
-
Secondly, the probe design pipeline is run on each chromosome making use of the reference files.
-
Finally, a set of final output files are generated which constitute the pipeline endpoints.
The first step in the workflow is to parse the raw assembly fasta file and discover chromosome names. Based on these names, some records are excluded from bowtie2
and jellyfish
indices to prevent wrongly eliminating probes as non-specific based on alignment and k-mer count results. All but presumptive canonical chromosomes are excluded from probe design, according to the following table:
Type | Identifier | Example | Included in indices | Probes designed |
---|---|---|---|---|
Canonical | * | chr1 | ✔️ | ✔️ |
Unplaced | Un_ | chrUn_KI270386v1 | ✔️ | ❌ |
Unlocalized | _random | chr1_KI270708v1_random | ✔️ | ❌ |
Novel sequence | _alt | chr1_KZ115747v1_alt | ❌ | ❌ |
Alt. haplotype | _hap | chr6_ssto_hap7 | ❌ | ❌ |
Fix patch | _fix | chr1_KN538361v1_fix | ❌ | ❌ |
NOTE: Chromosomes not identified as one of these exceptions are presumed to be canonical chromosomes and treated as such. A record of observed chromosome names and their classifications is generated in the pipeline output directory at 01_reference_files/01_chrom_names/.
This step creates a filtered multi-fasta file for creating bowtie2
and jellyfish
indices, as well as individual fasta files for each canonical chromosome for probes to be designed using parallel processing.
With canonical chromosomes discovered in the previous step, the provided annotations file is loaded and filtered to include only records where:
-
the
seqid
field is identical to one of the presumptive canonical chromosomes -
the
feature
field is equal toexon
The remaining records constitute the set of annotations that will be intersected with DNA probes to design the isoform-resolved RNA FISH probes. These annotations are split by chromosome for parallel processing during downstream steps.
For RNA FISH probe design, when it is known which isoform(s) should be targeted, the annotation file is useful as is. However, it is often desireable to obtain RNA FISH probes for a particular target without specifying isoform information.
For instance, probes designed against an exon that only appears on a very rare isoform, are not likely to be useful against most of the transcripts for this target. To remedy this, this step implements an algorithm to collapse exon annotations to those segments shared by the maximal number of isoforms, when possible.
This step generates an additional annotation file with isoforms flattened to shared segments. Each of these annotation files is intersected with the DNA probe set to produce the corresponding (isoform-resolved or isoform-flattened) RNA probe set.
After candidate probe sequences are mined from the genome, they are analyzed and scored for efficiency and specificity. As part of this process, candidate probe sequences are aligned to the reference genome using the Bowtie2 NGS aligner with very sensitive parameters. A k-mer frequency analysis is also performed using the jellyfish k-mer counter. Both of these tools require building an index from the genome before querying with candidate probe sequences.
Both indices are built from the filtered multi-fasta file generated upstream, and when the pipeline is executed on a computing cluster, or multiple cores are provided, these index building steps are executed in parallel to the mining of candidate probe sequences.
Candidate probe sequences are mined from the genome using OligoMiner with "newBalance" parameter values. For more information on the mining of candidate probe sequences, see the OligoMiner publication.
After mining candidate probe sequences, a series of steps are performed to score candidate sequences for specifity. These steps are described in depth in the PaintSHOP pre-print.
Briefly, probes are aligned to the genome using bowtie2, and pairwise alignments are reconstructed using sam2pairwise, and k-mer frequency is determined using jellyfish. A gradient boosting regression model implemented with XGBoost generate quantitative predictions about the likelihood of the candidates hybridizing with sequences other than their intended target in the genome using a thermodynamic partition function. These scores are aggregrated into an on-target and off-target score for each probe in the set. Here is a schematic overview of the machine learning pipeline:
After the pipeline is run on each chromosome, DNA FISH probes exist as per-chromosome .tsv files. These files are merged into a single file which constitutes the DNA FISH probe .tsv output file. This file is also used in subsequent steps.
The merged DNA probes are intersected with both the isoform-resolved and isoform-flattened annotation files, which generates the two RNA probe sets. For more information on these probe sets, see the output file specification.
Each of the three completed probe sets are also compressed into zip archives for convenience. These are the files that end up as downloads in the PaintSHOP Resources repo. These three compressed files contain the complete set of generated DNA and RNA FISH probes.