Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark "sourmash scripts multisearch" vs "sourmash scripts manysearch" + polars-based TF-IDF/probability of overlap calculation #2

Open
olgabot opened this issue Feb 11, 2025 · 0 comments
Labels
Python Only involves writing Python code

Comments

@olgabot
Copy link
Contributor

olgabot commented Feb 11, 2025

The probability of overlap between the query and target, plus a TF-IDF calculation was added to sourmash scripts multisearch here: sourmash-bio/sourmash_plugin_branchwater#458

However, multisearch is completely in-memory and may not be performant for larger databases. Thus, I want to benchmark whether we can compute our own probability of overlap and TF-IDF with polars using the kmer parquet file created by kmerseek.sketch.sketch method.

@olgabot olgabot added Python Only involves writing Python code help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python Only involves writing Python code
Projects
None yet
Development

No branches or pull requests

1 participant