Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBI virus databases now available! #3507

Open
1 task
ctb opened this issue Jan 25, 2025 · 0 comments
Open
1 task

NCBI virus databases now available! #3507

ctb opened this issue Jan 25, 2025 · 0 comments
Labels
fyi Information that is interesting or useful

Comments

@ctb
Copy link
Contributor

ctb commented Jan 25, 2025

viral genome databases are now available on farm as well as for download! 🎉

TODO:

  • add them to the databases page

files

location: /group/ctbrowngrp5/sourmash-db/ncbi-viruses-2025.01

The .sig.zip databases are here:

-rw-rw-r-- 1 ctbrown datalabgrp 2.7G Jan 25 08:24 'ncbi-viruses.skip_m2n3.k=24.scaled=50.sig.zip'
-rw-rw-r-- 1 ctbrown datalabgrp 1.4G Jan 25 08:50 'ncbi-viruses.dna.k=21.scaled=50.sig.zip'
-rw-rw-r-- 1 ctbrown datalabgrp 1.4G Jan 25 08:50 'ncbi-viruses.dna.k=31.scaled=50.sig.zip'
-rw-rw-r-- 1 ctbrown datalabgrp  53M Jan 25 08:43  ncbi-viruses.lineages.csv

note that these databases are scaled=50, and include skipmer-m2n3 databases using the parameters that seemed to perform well based on https://github.com/sourmash-bio/2024-ictv-challenge-sourmash.

they are available for download here:

ncbi-viruses-2025.01/ncbi-viruses.dna.k=21.scaled=50.sig.zip
ncbi-viruses-2025.01/ncbi-viruses.dna.k=31.scaled=50.sig.zip
ncbi-viruses-2025.01/ncbi-viruses.lineages.csv
ncbi-viruses-2025.01/ncbi-viruses.skip_m2n3.k=24.scaled=50.sig.zip

build repos and scripts

for sketching, I used the code in https://github.com/sourmash-bio/2025-sourmash-ncbi-viral-databases to get a list of all viral genomes and sketch them with directsketch.

content summary

% sourmash sig summarize collections/ncbi-viruses.mf.csv

== This is sourmash version 4.8.14. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

** loading from 'collections/ncbi-viruses.mf.csv'
path filetype: StandaloneManifestIndex
location: collections/ncbi-viruses.mf.csv
is database? yes
has manifest? yes
num signatures: 692919
** examining manifest...
total hashes: 606952063
summary of sketches:
   230973 sketches with skipm2n3, k=24, scaled=50     303273376 total hashes
   230973 sketches with DNA, k=21, scaled=50          151805251 total hashes
   230973 sketches with DNA, k=31, scaled=50          151873436 total hashes

Execution time/memory:

        Command being timed: "sourmash scripts gbsketch collections/ncbi-viruses.links.csv -n 9 -r 10 -p skipm2n3,k=24,scaled=50 -p dna,k=21,k=31,scaled=50 --failed gbsketch-fail.ncbi-viruses.txt --checksum-fail gbsketch-check-fail.ncbi-viruses.txt -o databases/ncbi-viruses.skip_m2n3.k24.zip -c 1 --batch 1000"
        User time (seconds): 3942.62
        System time (seconds): 644.22
        Percent of CPU this job got: 77%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:38:06
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4387888
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 7
        Minor (reclaiming a frame) page faults: 3025881
        Voluntary context switches: 20260228
        Involuntary context switches: 23931
        Swaps: 0
        File system inputs: 0
        File system outputs: 11742808
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ctb ctb added the fyi Information that is interesting or useful label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fyi Information that is interesting or useful
Projects
None yet
Development

No branches or pull requests

1 participant