-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minhash scaling code does not return hashes for many sequences #3516
Comments
oof, that's weird! What happens with If you get nothing - then, for some reason, the file is not being read properly! If you get something, then maybe further diagnosing needed 😅 |
scaled=1 works. It stops to work beyond some values (but only for some sequences). There is definitely a bug in the code. |
(I'm not saying there's no bug, but I would be surprised if there were one this obvious, since we use sourmash quite heavily for lots of things ;).) I would then also expect
to produce empty output. Hmm. So if it's not the sequence format, maybe it's the sequence content. Would you be able to attach a subsampled file here, or send some of the sequence to me at [email protected]? You could also try intermediate values of scaled to see where it's losing hashes. |
I linked two example sequences above, would that be enough?
|
So I ran this code to do a quick test -
For SEQUENCE1, upto scale=31, there are values and after that it is 0. Scale=1 is 71 hashes. |
I have a very large list of sequences for which scaling using Minhash does not return any result. Even when scaled < len(sequence)
Example sequences -
ATGGTTGGGATCGACGGACCGTAAATATCGGCATCGAGAACCCCGACCTTTGCGCCTTCAGCCGCTAACGCCAGCGCCAGGTTTACCGCCGTGGACGATTTCCCCACCCC
ATTAGTAAAAAACATGAGCATGGCCTGGCAAAATGTACTGTATATCGTGGCCGCGATATTAGTAATCATGCTGTGCGTCTTTACGCTGATCATTCGCGGTAAAGCCAAAAGCGA
Minimum code reproduce -
This prints -
Where as
mh = MinHash(n=100, ksize=40)
works just fine.
The text was updated successfully, but these errors were encountered: