Releases: HKU-BAL/ClairS-TO
v0.3.0
This version is a major update. The new features and benchmarks are explained in a technical note titled “Improving the performance of ClairS and ClairS-TO with new real cancer cell-line datasets and PoN”. A summary of changes: 1. Starting from this version, ClairS-TO will provide two model types. ssrs
is a model trained initially with synthetic samples and then real samples augmented (e.g., ont_r10_dorado_sup_5khz_ssrs
), ss
is a model trained from synthetic samples (e.g., ont_r10_dorado_sup_5khz_ss
). The ssrs
model provides better performance and fits most usage scenarios. ss
model can be used when missing a cancer-type in model training is a concern. In v0.3.0, four real cancer cell-line datasets (HCC1937, HCC1954, H1437, and H2009) covering two cancer types (breast cancer, lung cancer) published by Park et al. were used for ssrs
model training. 2. Added using CoLoRSdb (Consortium of Long Read Sequencing Database) as a PoN for tagging non-somatic variant. The idea was inspired by Park et al., 2024. The F1-score improved by ~10-20% for both SNV and Indel by using CoLoRSdb. 3. Added tagging indels at sequence with low entropy as LowSeqEntropy
. 4. Added the --indel_min_af
option and adjusted the default minimum allelic fraction requirement to 0.1 for Indels in ONT platform. 5. Removed limiting Indel calling to only confident and necessary regions (whole genome - GIAB stratification v3.3 all difficult regions + CMRG v1.0 regions). The practice was started with in v0.1.0, and is deemed unnecessary and removed in v0.4.0. User can use --calling_indels_only_in_these_regions
option to specify Indel calling regions.
v0.2.0
- Added a module called
verdict
to statistically classify a called variant into either a germline, somatic, or subclonal somatic variant based on the copy number alterations (CNA) profile and tumor purity estimation. To disable, use--disable_verdict
option. Verdict module is based on ASCAT algorithms and appropriate to use with tumor purity estimation lower than 0.8.
v0.1.0
- Added support for somatic Indel calling. To disable, use
--disable_indel_calling
. Indels are called only in the BED regions specified by the--calling_indels_only_in_these_regions
option. The default regions are (whole genome - GIAB stratification v3.3 all difficult regions + CMRG v1.0 regions). - Added
--panel_of_normals_require_allele_matching
option that takes comma separated booleans to indicate whether to require allele matching for each of the PoNs given in--panel_of_normals
. By default, allele matching is enabled when using germline variants sources (e.g., gnomAD, dbSNP) for non-somatic tagging, and is disabled when using panels (e.g., 1000G PoN). - Added multiple filters to remove as many spurious calls as possible. Including the use of i. phasing information: how good the alternative alleles are from a single haplotype after phasing (Simpson, 2024); ii. ancestral haplotype support: can an ancestral haplotype be found for reads that contain the alternative allele (Zheng et al., 2023); iii. BQ, MQ of the alternative allele reads; iv. variant position in read: whether the supporting alleles are gathered at the start or end of reads; v. strand bias; vi. realignment effect: for short read, whether both the count of supporting alt alleles and AF decreased after realignment.
- Added
--qual_cutoff_phaseable_region
and--qual_cutoff_unphaseable_region
to allow different qual cutoffs for tagging (as LowQual) the variants in the phaseable and unphaseable regions. Variants in unphaseable regions are suitable for a higher quality cutoff than those in the phaseable regions. - Added tags: i.
H
to indicate a variant is found in phaseable region; ii.SB
showing the p-value of Fisher’s exact test on strand bias.
v0.0.2
- Added ONT Guppy 5kHz HAC (
-p ont_r10_guppy_hac_5khz
) and Dorado 4kHz HAC (-p ont_r10_dorado_hac_4khz
) models, check here for more details. - Added
FAU
,FCU
,FGU
,FTU
,RAU
,RCU
,RGU
, andRTU
tags for the count of forward/reverse strand reads supporting A/C/G/T. - Revamped the way how panel of normals (PoNs) are inputted. Population databases are also considered as PoNs, and users can disable default population databases and add multiple other PoNs.
- Added
file
andmd5
information of the PoNs to the VCF output header. - Enabled somatic variant calling in sex chromosomes.
- Fixed an issue that misses PoNs tagging for low-quality variants.
v0.0.1
Initial release for early access.