CAST(chromosomal aberrations analysis by single targeted LM-PCR)-Seq is a novel method capable of detecting and quantifying chromosomal aberrations derived from on- and off-target activity of CRISPR-Cas nucleases or TALEN. See Turchiano et al. for detail information about CAST-Seq background and potential clinical application.
Original CAST-Seq pipeline: Turchiano et al., Cell Stem Cell, 2021
T-CAST pipeline: Rhiel et al., Front. Genome Ed., 2023
D-CAST pipeline: Klermund et al., Molecular Therapy, 2024
General: The herein code is the official bioinformatic pipeline to process fastq files generated by CAST-Seq, T-CAST and D-CAST.
Composing of Results: The results directory is divided into 3 sub-directories: fastq_aln, guide_aln and random.
fastq_aln contains all pre-processing and alignment files from fastq.gz to bam files.
guide_aln contains the post-processing files from bed files to final xlsx report.
random contains the information related to the random regions that are used for normalisation.
Requiered software and databases
- R (3.4.2 or later)
- BBmap and BBmerge (38.22) from
- Bowtie2 ( from
- samtools (1.9) from
- bedtools (2.27.1) from
- fastQC from
- seqtk from
Annotation for bowtie2
- bowtie2Index
- genome fasta
R packages
Additional Files (provided in annotation folder)
- hg38_TSS_TES.txt (Bed file containing TSS and TES as start and end locations respectively)
- chrom.sizes (chromosome size file)
- TruSeq4-PE (adapter sequences)
- CancerGenesList_ENTREZ.txt (OncoKB cancer related genes)
If available, the used versions are noted.
Histones marks bed files (must be stored into annotations/histones/). Example of such file is provided for H3K4me3 in Primary hematopoietic stem cells (E035 from Roadmap Epigenomics
├── annotations
│ └── human
│ ├── bowtie2Index
│ │ ├── genome.1.bt2
│ │ ├── genome.2.bt2
│ │ ├── genome.3.bt2
│ │ ├── genome.4.bt2
│ │ ├── genome.fa.fai
│ │ ├── genome.rev.1.bt2
│ │ ├── genome.rev.1.bt2
│ │ └── genome.fa
│ ├── histones
│ │ ├── H3K4me3.bed
│ │ └── ...
│ ├── CancerGenesList_ENTREZ.txt
│ ├── hg38_TSS_TES.txt
│ ├── chrom.sizes
│ └──TruSeq4-PE.fa
├── samples
│ ├── XXX
│ │ ├── data
│ │ │ ├── fastq
│ │ │ │ ├── XXX_treated_R1_001.fastq.gz
│ │ │ │ ├── XXX_treated_R2_001.fastq.gz
│ │ │ │ ├── XXX_UNtreated_R1_001.fastq.gz
│ │ │ │ └── XXX_UNtreated_R2_001.fastq.gz
│ │ │ ├── gRNA.fa
│ │ │ ├── headTOhead.fa
│ │ │ ├── linker_RC.fa
│ │ │ ├── linker.fa
│ │ │ ├── mispriming.fa
│ │ │ ├── neg.fa
│ │ │ ├── ots.bed
│ │ │ └── pos.fa
│ │ └── results
│ │ ├── fastq_aln
│ │ ├── guide_aln
│ │ └── random
│ ├── YYY
│ │ ├── data
│ │ │ ├── fastq
│ │ │ └── ...
│ │ └── ...
│ └── ...
└── script
├── run
│ ├── XXX.R
│ ├── YYY.R
│ └── ...
├── annotateGenes.R
├── bed2sequence.R
├── bedTools_fct.R
└── ...
Remark: fastq files can also be stored in separated directories, then --tfastqD and --ufastqD can be used to set-up the paths.
After all tools and databases are installed and work properly, the whole CAST-Seq pipeline can be executed using this single command (see example in script/run/):
Rscript ./CAST-Seq.R --pipeline "crispr"\
--pname "G3_TOY_test"\
--sampleDname "G3_TOY"\
--tsamp "G3_treated"\
--usamp "G3_UNtreated"\
--homeD "../../"
--pipeline name of the pipeline you want to use. So far, "crispr", "talen" and "crispr2" are available for CAST-Seq, T-CAST and D-CAST respectively.
--pname name of current project sample
--sampleDname name of sample directory
--tsamp XXX name of test (treated) file. XXX_R1_001.fastq.gz AND XXX_R2_001.fastq.gz should exist
--usamp XXX name of control (untreated) file. XXX_R1_001.fastq.gz AND XXX_R2_001.fastq.gz should exist
--homeD name of home directory
Additional parameters can be changed in the command above. Here is a description of these parameters:
--tfastqD name of directory containing the fastq (tsamp) files
--ufastqD name of directory containing the fastq (usamp) files
--grna name of gRNA fasta (default "gRNA.fa")
--onTarget name of ON-target bed file (default "ots.bed")
--otsDistance distance (bp) from the ON-target. Reads +/- this distance will be removed (default 50)
--surrounding_size distance (bp) from the ON-target. Use for the scoring system (default 20000)
--flank1 name of first flanking sequence (default "flank1.fa")
--flank2 name of second flanking sequence (default "flank2.fa")
--flankingSize distance to consider for HMT (default 2500)
--random number of random sequences to generate (default 10000)
--width distance to extend the putative sites (default 250)
--distCutoff distance to merge hits together (default 1500)
--pvCutoff pvalue threshold (default 0.05)
--scoreCutoff gRNA alignment score threshold (default NULL)
--hitsCutoff minimum number of hits per site (default 1)
--distCov distance from the maximum covered bin from where the gRNA will be aligned
--saveReads should reads fastq sequences be saved (default "no")
--species name of sample species (default "hg") so far only hg and mm (Mouse) can be used
--ovl number of samples to be considered in the overlap analysis (default 1)
--signif number of significant samples to be considered in the overlap analysis (default 1)
--cpu number of CPUs (default 2) at least 4 is advised
--pythonPath python path (default "/usr/bin/python")
These parameters are only used when --pipeline "talen" is set.
--grnaR name of gRNA (RIGHT) fasta file
--grnaL name of gRNA (LEFT) fasta file
These parameters are only used when --pipeline "crispr_overlap" or "talen_overlap" is set.
--ovlDname name of overlap directory within sample directory
--ovlName name of overlap sample within overlap directory
--replicates name of sample to be used in the overlap analysis
--repNames labels of the replicates to be used in the overlap analysis
--repDname name of a representative replicate (used to find the appropriate replicate files) file
--ovl number of significant samples to be considered in the overlap analysis
- Geoffroy Andrieux
- Giandomenico Turchiano
This software is under AGPL3 license.
We thank all members of our laboratories for constructive discussions and suggestions.
