-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly benchmarks hanging during HNSW connectComponents #13844
Comments
I eyeballed the code and am having trouble understanding how this would not terminate. My suggestion would be to add more infostream logging to HnswGraphBuilder.connectComponents(), enable infostream logging for HNSW component in nightly bench and see if we can reproduce. I'll propose some more logging that should help narrow down the cause. |
Looking at the code I found one glitch where we search graph level 0 for connections but should be searching the current level. However I think it's benign since we filter the results using a bitset ( |
I restarted the benchy with
Next I'll apply #13849 and repro the hang again and post back. |
@mikemccand What exact luceneutil command / dataset are you using to reproduce this? I wanna repro and debug locally. |
The connect components stuff has been in main for a while now. I would have expected this to fail way more and way sooner if that was the direct cause. I wonder if the major vector refactoring that was merged just a couple of days ago is the cause @msokolov ? |
Hey @mikemccand @msokolov I found the bug.
^ Since This obviously is no bueno. This needs to guard against that and insure that the scorer is a byte focused thing. So, |
Good sleuthing @benwtrent. Let's get this fixed a.s.a.p. And I'll look into why this was not caught by tests. |
Yes, good catch! Separately I'd also like to understand why this led to the hanging behavior; maybe there is another bug? |
@msokolov it lead to the hanging behavior because scores meant nothing any more. I am guessing trying to build the graph took FOREVER because the scores were non-sense. Everything was It wasn't a true death loop, but I have noticed that graph building takes exponentially longer if the scores are meaningless. |
Follow up to @benwtrent 's awesome fix; a small PR that verifies and would have caught the underlying root cause of this issue, #13851 |
It's in
and:
But then you'd also need the |
OK in my latest repro, it got to the "hang" again, ~4.7 MM docs, and the new
We feel confident that the float/byte confusion might cause this? Anyways, I'll apply that PR and try to repro! Thanks for the quick attention here @ChrisHegarty, @benwtrent and @msokolov! |
The float/byte confusion would causes this to run exponentially worse. Upshot is that graph exploration is much worse due to nonsensical scores and causes greater disconnectedness. This is a double whammy. I expect the latest bug fix to work. |
Awesome -- I pulled just that fix into nightly benchy Lucene clone and restarted! I'll post back once it gets beyond this hang. Thanks!! Another interesting thing: the nightly benchy is producing quite large segments, ~15 GB, because of how it creates its segments by doc counts / deterministic merging. This is beyond what Lucene would normally do by default (target ~5 GB max segment size) ... so the benchy is testing larger HNSW graphs than "normal". |
I put the nightly benchy binary line file docs source file here: https://githubsearch.mikemccandless.com/enwiki-20110115-lines-1k-fixed-with-random-labels.bin ... it's ~25 GB. |
OK indeed @benwtrent fix resolves the issue! Benchy happily got past that one large merge ( |
Description
I'm not sure if this is a real bug, or something wrong w/ the nightly benchy env ... but the benchmark is hung right now building the deterministic search index, for ~21 hours not making any progress (it prints number of docs indexed periodically ...). That indexing is single threaded, and the one thread is doing this:
It's Java 22:
JVM : Arch Linux OpenJDK 64-Bit Server VM runtime 22.0.2+9 22.0.2+9
. It's on Lucenemain
branch commit0a8604d908c7657187ed6d6926484ef7406d9794
.ls -ltrh
of the index it's writing:It's merging to a large segment (
_32
) and has written large vector files (separately: why do we have a.tmp
and non-temp.vec
file, both 15 GB?), but hasn't yet written the graph (.vex
).Talking to @msokolov, this is the phase of HNSW graph building that tries to connect disconnected components.
Version and environment details
No response
The text was updated successfully, but these errors were encountered: