Skip to content

Commit

Permalink
readme updated
Browse files Browse the repository at this point in the history
  • Loading branch information
Felix Beaudry committed Apr 12, 2022
1 parent 4119dc2 commit f06bc98
Showing 1 changed file with 76 additions and 51 deletions.
127 changes: 76 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,7 @@ Homologous Recombination Deficiency (HRD) Prediction Workflow using sig.tools
* [tabix 1.9](http://www.htslib.org/doc/tabix.html)
* [bcftools 1.9](https://samtools.github.io/bcftools/bcftools.html)
* [sigtools 0.0.0.9000](https://github.com/Nik-Zainal-Group/signature.tools.lib)
* [grch38-alldifficultregions 3.0](https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/genome-stratifications/v3.0/GRCh38/union/)
* [hg38 p12](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/)
* [bis-rlibs 0.1](https://ggplot2.tidyverse.org/)


## Usage
Expand All @@ -30,11 +29,12 @@ Parameter|Value|Description
`smallsVcfFile`|File|Input VCF file of SNV and indels (small mutations) (eg. from mutect2)
`smallsVcfIndex`|File|Index of input VCF file of SNV and indels
`segFile`|File|File for segmentations, used to estimate number of segments in Loss of heterozygosity (LOH) (eg. from sequenza)
`indelVAF`|Int|Variant Allele Frequency for filtering of indel mutations
`snvVAF`|Int|Variant Allele Frequency for filtering of SNVs
`indelVAF`|String|Variant Allele Frequency for filtering of indel mutations
`snvVAF`|String|Variant Allele Frequency for filtering of SNVs
`tissue`|String|Cancerous-tissue of origin
`rScript`|String|Temporary variable to call the .R script containing sigtools, will be modulated. default: ~/sigtools_workflow/sigTools_runthrough.R
`sampleName`|String|Name of sample matching the tumor sample in .vcf
`plotIt`|Boolean|Create plots of sigtools results


#### Optional workflow parameters:
Expand All @@ -47,21 +47,20 @@ Parameter|Value|Default|Description
---|---|---|---
`filterStructural.basename`|String|basename("~{structuralVcfFile}",".vcf.gz")|Base name
`filterStructural.modules`|String|"bcftools/1.9"|Required environment modules
`filterStructural.structuralVAF`|Int|0|VAF for structural variants
`filterStructural.jobMemory`|Int|5|Memory allocated for this job (GB)
`filterStructural.threads`|Int|1|Requested CPU threads
`filterStructural.timeout`|Int|1|Hours before task timeout
`filterINDELs.basename`|String|basename("~{smallsVcfFile}",".vcf.gz")|Base name
`filterINDELs.modules`|String|"gatk/4.2.0.0 tabix/1.9 bcftools/1.9 grch38-alldifficultregions/3.0 hg38/p12"|Required environment modules
`filterINDELs.modules`|String|"gatk/4.2.0.0 tabix/1.9 bcftools/1.9 hg38/p12 grch38-alldifficultregions/3.0"|Required environment modules
`filterINDELs.genome`|String|"$HG38_ROOT/hg38_random.fa"|Path to loaded genome
`filterINDELs.difficultRegions`|String|"$GRCH38_ALLDIFFICULTREGIONS_ROOT/GRCh38_alldifficultregions.bed"|Path to loaded difficult regions to align to
`filterINDELs.difficultRegions`|String?|None|Path to .bed of difficult regions to align to, string must include the --exclude-intervals flag, eg: --exclude-intervals $GRCH38_ALLDIFFICULTREGIONS_ROOT/GRCh38_alldifficultregions.bed
`filterINDELs.jobMemory`|Int|10|Memory allocated for this job (GB)
`filterINDELs.threads`|Int|1|Requested CPU threads
`filterINDELs.timeout`|Int|2|Hours before task timeout
`filterSNVs.basename`|String|basename("~{smallsVcfFile}",".vcf.gz")|Base name
`filterSNVs.modules`|String|"gatk/4.2.0.0 tabix/1.9 bcftools/1.9 grch38-alldifficultregions/3.0 hg38/p12"|Required environment modules
`filterSNVs.genome`|String|"$HG38_ROOT/hg38_random.fa"|Path to loaded genome
`filterSNVs.difficultRegions`|String|"$GRCH38_ALLDIFFICULTREGIONS_ROOT/GRCh38_alldifficultregions.bed"|Path to loaded difficult regions to align to
`filterSNVs.difficultRegions`|String?|None|Path to .bed of difficult regions to align to, string must include the --exclude-intervals flag, eg: --exclude-intervals $GRCH38_ALLDIFFICULTREGIONS_ROOT/GRCh38_alldifficultregions.bed
`filterSNVs.jobMemory`|Int|10|Memory allocated for this job (GB)
`filterSNVs.threads`|Int|1|Requested CPU threads
`filterSNVs.timeout`|Int|2|Hours before task timeout
Expand All @@ -70,76 +69,102 @@ Parameter|Value|Default|Description
`convertSegFile.threads`|Int|1|Requested CPU threads
`convertSegFile.timeout`|Int|1|Hours before task timeout
`hrdResults.modules`|String|"sigtools/0.0.0.9000"|Required environment modules
`hrdResults.sigtoolsBootstrap`|Int|2500|Number of bootstraps for sigtools
`hrdResults.jobMemory`|Int|20|Memory allocated for this job (GB)
`hrdResults.threads`|Int|1|Requested CPU threads
`hrdResults.timeout`|Int|2|Hours before task timeout
`plotResults.modules`|String|"bis-rlibs/0.1"|Required environment modules
`plotResults.jobMemory`|Int|20|Memory allocated for this job (GB)
`plotResults.threads`|Int|1|Requested CPU threads
`plotResults.timeout`|Int|2|Hours before task timeout


### Outputs

Output | Type | Description
---|---|---
`sigToolsOutput`|File|point estimate and bootstraped confidence intervals for HRD from sigtools
`indelFilteringReport`|File|counts of INDELs pre and post filtering
`snvFilteringReport`|File|counts of SNVs pre and post filtering
`structuralFilteringReport`|File|counts of structural variants pre and post filtering
`sigTools_hrd_Output`|File|point estimate and bootstraped confidence intervals for HRD from sigtools
`sigTools_model_Output`|File|parameters raw values and weights for estimation of HRD from sigtools
`sigTools_sigs_Output`|File|signature breakdown from sigtools
`sigTools_sigs_plot_Output`|File?|plot of signature breakdown from sigtools
`sigTools_hrd_plot_Output`|File?|plot of point estimate and bootstraped confidence intervals for HRD from sigtools


## Commands
This section lists commands run by sigtools_workflow. It launches four tasks to make input files for sigtools, and then runs the sigtools packages in an .R script.

### Convert structural variant .vcf to .bedpe

This section lists commands run by sigTooler workflow. It launches four tasks to make input files for sigtools, and then runs the sigtools packages in an .R script. The first four tasks convert file formats:

echo -e "chrom1\tstart1\tend1\tchrom2\tstart2\tend2\tsample\tsvclass" >~{basename}.bedpe
1. take a .vcf of structural variants and turns it into a
.bedpe
2. take in a .vcf of small variants and split it into
one vcf for indels and one vcf for SNVs (this is two tasks)
4. take a seg file and convert it into a CNA seg file.

$BCFTOOLS_ROOT/bin/bcftools query -f "%CHROM\t%POS\t%INFO/END\t%FILTER\t%ALT\t%INFO/CIPOS\t%INFO/CIEND\t[%DR\t]\t[%DV\t]\t[%RR\t]\t[%RV\t]\n" ~{structuralVcfFile} | \
awk '$5 !~ ":" {print}' | \
awk '$4 ~ "PASS" {print}' | \
awk -v VAF=0.~{structuralVAF} '($10+$14)/($8+$10+$12+$14) > VAF {print}' | \
awk -v sampleName=~{sampleName} 'split($6,a,",") split($7,b,",") {print $1"\t"$2+a[1]-1"\t"$2+a[2]"\t"$1"\t"$3+b[1]-1"\t"$2+b[2]"\t"sampleName"\t"$5} ' | \
sed 's/<//g; s/>//g' >>~{basename}.bedpe
There is an optional sixth step to plot results.

### Pull INDELs from small variants .vcf

gatk SelectVariants \
-V ~{smallsVcfFile} \
-R ~{genome} \
--exclude-intervals ~{difficultRegions} \
--select-type-to-include INDEL \
-O ~{basename}.INDEL.vcf
### Convert structural variant .vcf to .bedpe

$BCFTOOLS_ROOT/bin/bcftools filter -i "(FORMAT/AD[0:1])/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= 0.~{indelVAF}" ~{basename}.INDEL.vcf >~{basename}.INDEL.VAF.vcf
echo -e "chrom1\tstart1\tend1\tchrom2\tstart2\tend2\tsample\tsvclass" >~{basename}.bedpe

bgzip ~{basename}.INDEL.VAF.vcf
$BCFTOOLS_ROOT/bin/bcftools view -f 'PASS' ~{structuralVcfFile} |\
$BCFTOOLS_ROOT/bin/bcftools filter -e 'INFO/SVTYPE = "BND"' |\
$BCFTOOLS_ROOT/bin/bcftools query -f "%CHROM\t%POS\t%INFO/END\t%FILTER\t%INFO/SVTYPE\t%INFO/CIPOS\t%INFO/CIEND\n" |\
awk -v sampleName=~{sampleName} 'split($6,a,",") split($7,b,",") {print $1"\t"$2+a[1]-1"\t"$2+a[2]"\t"$1"\t"$3+b[1]-1"\t"$2+b[2]"\t"sampleName"\t"$5} ' >>~{basename}.bedpe

tabix -p vcf ~{basename}.INDEL.VAF.vcf.gz

### Pull SNPs from small variants .vcf
awk '$1 !~ "#" {print}' ~{structuralVcfFile} | wc -l >~{sampleName}.structural.filteringReport.txt
awk '$1 !~ "#" {print}' ~{basename}.bedpe | wc -l >>~{sampleName}.structural.filteringReport.txt

gatk SelectVariants \
-V ~{smallsVcfFile} \
-R ~{genome} \
--exclude-intervals ~{difficultRegions} \
--select-type-to-include SNP \
-O ~{basename}.SNP.vcf
### Pull INDELs from small variants .vcf

$BCFTOOLS_ROOT/bin/bcftools filter -i "(FORMAT/AD[0:1])/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= 0.~{snvVAF}" ~{basename}.SNP.vcf >~{basename}.SNP.VAF.vcf
gatk SelectVariants \
-V ~{smallsVcfFile} \
-R ~{genome} ~{difficultRegions} \
--select-type-to-include INDEL \
-O ~{basename}.INDEL.vcf

bgzip ~{basename}.SNP.VAF.vcf
$BCFTOOLS_ROOT/bin/bcftools view -f 'PASS,clustered_events' ~{basename}.INDEL.vcf | $BCFTOOLS_ROOT/bin/bcftools filter -i "(FORMAT/AD[0:1])/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= 0.~{indelVAF}" >~{basename}.INDEL.VAF.vcf

tabix -p vcf ~{basename}.SNP.VAF.vcf.gz
bgzip ~{basename}.INDEL.VAF.vcf
tabix -p vcf ~{basename}.INDEL.VAF.vcf.gz


### Convert .seg file to ASCAT format
zcat ~{smallsVcfFile} | awk '$1 !~ "#" {print}' | wc -l >~{sampleName}.INDEL.filteringReport.txt
awk '$1 !~ "#" {print}' ~{basename}.INDEL.vcf | wc -l >>~{sampleName}.INDEL.filteringReport.txt
zcat ~{basename}.INDEL.VAF.vcf.gz | awk '$1 !~ "#" {print}' | wc -l >>~{sampleName}.INDEL.filteringReport.txt

set -euo pipefail
### Pull SNPs from small variants .vcf
gatk SelectVariants \
-V ~{smallsVcfFile} \
-R ~{genome} ~{difficultRegions} \
--select-type-to-include SNP \
-O ~{basename}.SNP.vcf

echo -e "seg_no\tChromosome\tchromStart\tchromEnd\ttotal.copy.number.inNormal\tminor.copy.number.inNormal\ttotal.copy.number.inTumour\tminor.copy.number.inTumour" >~{basename}_segments.cna.txt
$BCFTOOLS_ROOT/bin/bcftools view -f 'PASS,clustered_events' ~{basename}.SNP.vcf | $BCFTOOLS_ROOT/bin/bcftools filter -i "(FORMAT/AD[0:1])/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= 0.~{snvVAF}" >~{basename}.SNP.VAF.vcf

bgzip ~{basename}.SNP.VAF.vcf

tabix -p vcf ~{basename}.SNP.VAF.vcf.gz

zcat ~{smallsVcfFile} | awk '$1 !~ "#" {print}' | wc -l >~{sampleName}.SNP.filteringReport.txt
awk '$1 !~ "#" {print}' ~{basename}.SNP.vcf | wc -l >>~{sampleName}.SNP.filteringReport.txt
zcat ~{basename}.SNP.VAF.vcf.gz | awk '$1 !~ "#" {print}' | wc -l >>~{sampleName}.SNP.filteringReport.txt

### Convert .seg file to ASCAT format
echo -e "seg_no\tChromosome\tchromStart\tchromEnd\ttotal.copy.number.inNormal\tminor.copy.number.inNormal\ttotal.copy.number.inTumour\tminor.copy.number.inTumour" >~{basename}_segments.cna.txt

tail -n +2 ~{segFile} | \
awk 'split($1,a,"\"") split(a[2],b,"chr") {print NR"\t"b[2]"\t"$2"\t"$3"\t"2"\t"1"\t"$10"\t"$12}' >>~{basename}_segments.cna.txt

### Run the sigtools package in R

Rscript --vanilla ~{rScript}_runthrough.R ~{sampleName} ~{tissue} ~{snvVcfFiltered} ~{indelVcfFiltered} ~{structuralBedpeFiltered} ~{lohSegFile} ~{sigtoolsBootstrap}

### Run plotting in R (optional)

Rscript --vanilla ~{rScript}_plotter.R ~{sampleName}

tail -n +2 ~{segFile} | \
awk 'split($1,a,"\"") split(a[2],b,"chr") {print NR"\t"b[2]"\t"$2"\t"$3"\t"2"\t"1"\t"$10"\t"$12}' >>~{basename}_segments.cna.txt

### Run the sigtools package in R
Rscript --vanilla ~{rScript} ~{sampleName} ~{tissue} ~{snvVcfFiltered} ~{indelVcfFiltered} ~{structuralBedpeFiltered} ~{lohSegFile}

## Support

For support, please file an issue on the [Github project](https://github.com/oicr-gsi) or send an email to [email protected] .
Expand Down

0 comments on commit f06bc98

Please sign in to comment.