-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability issues #24
Comments
Thanks for reporting. Can you provide the command line options you used for generating the SAV files and what format fields are present in the files you are evaluating (e.g., GT:DP:AD:GQ:PL)? I can try to reproduce this trend using data I have access to? If I remember correctly, you were experiencing a crash when enabling PBWT. Did you use v2.0.1 from the releases section or did you use a commit from the master branch? if the former, updating to the latest from master may resolve this issue since a lot has been fixed since v2.0.1. I'm actually a bit surprised that SAV does so much better than BCF for the smaller datasets. Sparse vectors are only used for fields that have mostly zero values (GT, DS, HDS, etc). PBWT is necessary in order for DP, AD, GQ, and PL to compress well in the current version of SAV. |
I will give the master branch a try if I can make it build (no internet access on our research network, no cget, no conda). |
Thanks. Can you describe the MD field? The only dependencies needed to build the sav CLI are libzstd, libz, and shrinkwrap (https://github.com/jonathonl/shrinkwrap/archive/395361020c84f664d50c1ec51055e107d9178ad3.tar.gz). Shrinkwrap is a header-only library. The other two you likely already have available in your environment. |
It's a single integer field described as "Read depths of multiple alleles." Maybe @hannespetur can shed some more light? The master-branch binary is now working, and it doesn't crash with PBWT enabled. In fact, using PBWT gives a very good compression on a quick test I just did. P.S: Shrinkwrap looks very useful! I will definitely try replacing some of my code with that! |
Great. I think |
Yes it is just indicating how many reads supporting more than one variant allele equally well. It's going to be 0 most of the time. |
Ok. If MD is mostly zero, then |
Using In most of the benchmarks, running savvy master build not using the --pbwt-fields option results in ~2x smaller file size than BCF but with These scalability issues are still occurring though, but they now occur when going from 150k to 200k samples (instead of 75k to 100k). At 200k samples, savvy (without pbwt) output is almost exactly the same size as its BCF input. savvy with pbwt does compress at 200k though but it's compression ratio drops from ~3.2 (150k) to ~2.4x (200k) compared to BCF. There must be some fallback to no compression on the sparse fields that is being triggered here, right? |
Hmm... there's no application logic that I can think of that would explain this. I have a 200k data set that I can use to try to reproduce. Are your smaller datasets merely sample subsets of the 200k dataset? If so, are monomorphic variant records (AC==0) removed when subsetting or do does each sample set have the same number of variant records? Using a higher level of compression might scale better (adding |
Yes. The original VCF contains 487k samples that I subset into smaller and smaller sample sizes while also removing AC==0 from the file, using
Ok thanks, I will give it a try. |
The
|
As discussed in the video-call, we have been experimenting with different formats here at deCODE.
Unfortunately, I cannot share the data files, but one very visible observation that we had in regard to Savvy, was, that between 75k samples and 100k samples there is a huge jump in file size; it more than doubles. After that the file size of Savvy always seems to be very close ot BCF.
I suspect that the mechanism for determining when to use sparse vectors has a bug and does not cope with the large size, resulting uncompressed fallback.
The text was updated successfully, but these errors were encountered: