This version is a major update. The new features and benchmarks are explained in a technical note titled “Improving the performance of ClairS and ClairS-TO with new real cancer cell-line datasets and PoN”. A summary of changes: 1. Starting from this version, ClairS-TO will provide two model types. ssrs
is a model trained initially with synthetic samples and then real samples augmented (e.g., ont_r10_dorado_sup_5khz_ssrs
), ss
is a model trained from synthetic samples (e.g., ont_r10_dorado_sup_5khz_ss
). The ssrs
model provides better performance and fits most usage scenarios. ss
model can be used when missing a cancer-type in model training is a concern. In v0.3.0, four real cancer cell-line datasets (HCC1937, HCC1954, H1437, and H2009) covering two cancer types (breast cancer, lung cancer) published by Park et al. were used for ssrs
model training. 2. Added using CoLoRSdb (Consortium of Long Read Sequencing Database) as a PoN for tagging non-somatic variant. The idea was inspired by Park et al., 2024. The F1-score improved by ~10-20% for both SNV and Indel by using CoLoRSdb. 3. Added tagging indels at sequence with low entropy as LowSeqEntropy
. 4. Added the --indel_min_af
option and adjusted the default minimum allelic fraction requirement to 0.1 for Indels in ONT platform. 5. Removed limiting Indel calling to only confident and necessary regions (whole genome - GIAB stratification v3.3 all difficult regions + CMRG v1.0 regions). The practice was started with in v0.1.0, and is deemed unnecessary and removed in v0.4.0. User can use --calling_indels_only_in_these_regions
option to specify Indel calling regions.