Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly benchmarks hanging during HNSW connectComponents #13844

Closed
mikemccand opened this issue Oct 1, 2024 · 16 comments
Closed

Nightly benchmarks hanging during HNSW connectComponents #13844

mikemccand opened this issue Oct 1, 2024 · 16 comments
Labels
blocker A severe issue that should be resolved before the released specified in its Milestone. type:bug
Milestone

Comments

@mikemccand
Copy link
Member

Description

I'm not sure if this is a real bug, or something wrong w/ the nightly benchy env ... but the benchmark is hung right now building the deterministic search index, for ~21 hours not making any progress (it prints number of docs indexed periodically ...). That indexing is single threaded, and the one thread is doing this:

"Index #0" #122 [2684589] prio=5 os_prio=0 cpu=73410815.43ms elapsed=74640.64s tid=0x00007ffff05de700 nid=2684589 runnable  [0x00007ff730187000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.util.LongHeap.downHeap(LongHeap.java:175)
        at org.apache.lucene.util.LongHeap.pop(LongHeap.java:108)
        at org.apache.lucene.util.hnsw.NeighborQueue.pop(NeighborQueue.java:131)
        at org.apache.lucene.util.hnsw.HnswGraphSearcher.searchLevel(HnswGraphSearcher.java:225)
        at org.apache.lucene.util.hnsw.HnswGraphBuilder.connectComponents(HnswGraphBuilder.java:465)
        at org.apache.lucene.util.hnsw.HnswGraphBuilder.connectComponents(HnswGraphBuilder.java:423)
        at org.apache.lucene.util.hnsw.HnswGraphBuilder.finish(HnswGraphBuilder.java:416)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.finish(HnswConcurrentMergeBuilder.java:123)
        at org.apache.lucene.util.hnsw.HnswConcurrentMergeBuilder.build(HnswConcurrentMergeBuilder.java:94)
        at org.apache.lucene.util.hnsw.IncrementalHnswGraphMerger.merge(IncrementalHnswGraphMerger.java:143)
        at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.mergeOneField(Lucene99HnswVectorsWriter.java:373)
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.mergeOneField(PerFieldKnnVectorsFormat.java:121)
        at org.apache.lucene.codecs.KnnVectorsWriter.merge(KnnVectorsWriter.java:105)
        at org.apache.lucene.index.SegmentMerger.mergeVectorValues(SegmentMerger.java:271)
        at org.apache.lucene.index.SegmentMerger$$Lambda/0x00007ff749232be0.merge(Unknown Source)
        at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:158)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5283)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4749)
        at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6567)
        at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:38)
        - locked <0x00007ff7d8ed7b80> (a org.apache.lucene.index.SerialMergeScheduler)
        at org.apache.lucene.index.IndexWriter.executeMerge(IndexWriter.java:2318)
        at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2313)
        at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:6024)
        at org.apache.lucene.index.IndexWriter.maybeProcessEvents(IndexWriter.java:6014)
        at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1553)
        at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1511)
        at perf.IndexThreads$IndexThread.run(IndexThreads.java:275)

It's Java 22: JVM : Arch Linux OpenJDK 64-Bit Server VM runtime 22.0.2+9 22.0.2+9. It's on Lucene main branch commit 0a8604d908c7657187ed6d6926484ef7406d9794.

ls -ltrh of the index it's writing:

-rw-r--r-- 1 mike mike  13K Sep 30 15:44 _32.fdx
-rw-r--r-- 1 mike mike  47M Sep 30 15:44 _32.fdt
-rw-r--r-- 1 mike mike  327 Sep 30 15:44 _32.fdm
-rw-r--r-- 1 mike mike  139 Sep 30 15:44 _32.nvm
-rw-r--r-- 1 mike mike 9.5M Sep 30 15:44 _32.nvd
-rw-r--r-- 1 mike mike 1.3K Sep 30 15:45 _32_Lucene912_0.tmd
-rw-r--r-- 1 mike mike 3.7M Sep 30 15:45 _32_Lucene912_0.tip
-rw-r--r-- 1 mike mike 160M Sep 30 15:45 _32_Lucene912_0.tim
-rw-r--r-- 1 mike mike  112 Sep 30 15:45 _32_Lucene912_0.psm
-rw-r--r-- 1 mike mike 1.3G Sep 30 15:45 _32_Lucene912_0.pos
-rw-r--r-- 1 mike mike 890M Sep 30 15:45 _32_Lucene912_0.doc
-rw-r--r-- 1 mike mike 6.2K Sep 30 15:45 _32_Lucene90_0.dvm
-rw-r--r-- 1 mike mike 330M Sep 30 15:45 _32_Lucene90_0.dvd
-rw-r--r-- 1 mike mike  326 Sep 30 15:45 _32.kdm
-rw-r--r-- 1 mike mike 174K Sep 30 15:45 _32.kdi
-rw-r--r-- 1 mike mike  47M Sep 30 15:45 _32.kdd
-rw-r--r-- 1 mike mike    0 Sep 30 15:45 _32_Lucene99HnswVectorsFormat_0.vex
-rw-r--r-- 1 mike mike    0 Sep 30 15:45 _32_Lucene99HnswVectorsFormat_0.vemf
-rw-r--r-- 1 mike mike    0 Sep 30 15:45 _32_Lucene99HnswVectorsFormat_0.vem
-rw-r--r-- 1 mike mike  15G Sep 30 15:45 _32_Lucene99HnswVectorsFormat_0.vec_temp_6g.tmp
-rw-r--r-- 1 mike mike  15G Sep 30 15:46 _32_Lucene99HnswVectorsFormat_0.vec

It's merging to a large segment (_32) and has written large vector files (separately: why do we have a .tmp and non-temp .vec file, both 15 GB?), but hasn't yet written the graph (.vex).

Talking to @msokolov, this is the phase of HNSW graph building that tries to connect disconnected components.

Version and environment details

No response

@msokolov
Copy link
Contributor

msokolov commented Oct 1, 2024

I eyeballed the code and am having trouble understanding how this would not terminate. My suggestion would be to add more infostream logging to HnswGraphBuilder.connectComponents(), enable infostream logging for HNSW component in nightly bench and see if we can reproduce. I'll propose some more logging that should help narrow down the cause.

@msokolov
Copy link
Contributor

msokolov commented Oct 1, 2024

Looking at the code I found one glitch where we search graph level 0 for connections but should be searching the current level. However I think it's benign since we filter the results using a bitset (notFullyConnected) that was computed on the correct level, and since level 0 is a superset, the returned nodes should be OK (and we would have raised an assertion later if the node we tried to connect to was not on the current level). Still we should fix.

@mikemccand
Copy link
Member Author

I restarted the benchy with InfoStream enabled during indexing and it's hanging in the same spot. This is the end of the log:

HNSW 0 [2024-10-02T08:19:43.363286Z; hnsw-merge-1-thread-67]: addVectors [4958208 4960256)
HNSW 0 [2024-10-02T08:19:44.496797157Z; hnsw-merge-1-thread-54]: addVectors [4960256 4962304)
HNSW 0 [2024-10-02T08:19:44.890408111Z; hnsw-merge-1-thread-74]: addVectors [4962304 4964352)
HNSW 0 [2024-10-02T08:19:45.512236220Z; hnsw-merge-1-thread-59]: addVectors [4964352 4966400)
HNSW 0 [2024-10-02T08:19:46.273832822Z; hnsw-merge-1-thread-50]: addVectors [4966400 4968448)
HNSW 0 [2024-10-02T08:19:47.312206424Z; hnsw-merge-1-thread-70]: addVectors [4968448 4970496)
HNSW 0 [2024-10-02T08:19:47.373776388Z; hnsw-merge-1-thread-61]: addVectors [4970496 4972544)
HNSW 0 [2024-10-02T08:19:48.061919420Z; hnsw-merge-1-thread-60]: addVectors [4972544 4974592)
HNSW 0 [2024-10-02T08:19:50.205716356Z; hnsw-merge-1-thread-62]: addVectors [4974592 4976640)
HNSW 0 [2024-10-02T08:19:50.961509404Z; Index #0]: addVectors [4976640 4978031)
HNSW 0 [2024-10-02T08:19:52.023305772Z; hnsw-merge-1-thread-58]: built 4950000 in 11988/11988 ms
HNSW 0 [2024-10-02T08:19:53.889552006Z; hnsw-merge-1-thread-67]: built 4960000 in 10526/10526 ms
HNSW 0 [2024-10-02T08:19:55.874703177Z; hnsw-merge-1-thread-70]: built 4970000 in 8562/8562 ms

Next I'll apply #13849 and repro the hang again and post back.

@ChrisHegarty
Copy link
Contributor

@mikemccand What exact luceneutil command / dataset are you using to reproduce this? I wanna repro and debug locally.

@benwtrent
Copy link
Member

The connect components stuff has been in main for a while now. I would have expected this to fail way more and way sooner if that was the direct cause.

I wonder if the major vector refactoring that was merged just a couple of days ago is the cause @msokolov ?

@benwtrent
Copy link
Member

Hey @mikemccand @msokolov I found the bug.

public RandomVectorScorerSupplier getRandomVectorScorerSupplier(
      VectorSimilarityFunction similarityType, KnnVectorValues vectorValues) throws IOException {
    // a quantized values here is a wrapping or delegation issue
    assert !(vectorValues instanceof QuantizedByteVectorValues);
    // currently only supports binary vectors
    if (vectorValues instanceof HasIndexSlice byteVectorValues
        && byteVectorValues.getSlice() != null) {
      var scorer =
          Lucene99MemorySegmentByteVectorScorerSupplier.create(
              similarityType, byteVectorValues.getSlice(), vectorValues);
      if (scorer.isPresent()) {
        return scorer.get();
      }
    }
    return delegate.getRandomVectorScorerSupplier(similarityType, vectorValues);
  }

^ Since OffHeapFloatVectorValues provides the HasIndexSlice interface, we would attempt to use Lucene99MemorySegmentByteVectorScorerSupplier over floating point vectors.

This obviously is no bueno. This needs to guard against that and insure that the scorer is a byte focused thing. So, hasSlice and instanceof ByteVectorValues

@ChrisHegarty ChrisHegarty added this to the 10.0.0 milestone Oct 2, 2024
@ChrisHegarty ChrisHegarty added the blocker A severe issue that should be resolved before the released specified in its Milestone. label Oct 2, 2024
@ChrisHegarty
Copy link
Contributor

Good sleuthing @benwtrent. Let's get this fixed a.s.a.p. And I'll look into why this was not caught by tests.

@msokolov
Copy link
Contributor

msokolov commented Oct 2, 2024

Yes, good catch! Separately I'd also like to understand why this led to the hanging behavior; maybe there is another bug?

@benwtrent
Copy link
Member

@msokolov it lead to the hanging behavior because scores meant nothing any more. I am guessing trying to build the graph took FOREVER because the scores were non-sense. Everything was 0 or something like that.

It wasn't a true death loop, but I have noticed that graph building takes exponentially longer if the scores are meaningless.

@ChrisHegarty
Copy link
Contributor

Follow up to @benwtrent 's awesome fix; a small PR that verifies and would have caught the underlying root cause of this issue, #13851

@mikemccand
Copy link
Member Author

@mikemccand What exact luceneutil command / dataset are you using to reproduce this? I wanna repro and debug locally.

It's in nightlyBench.py in luceneutil, specifically this single-threaded (deterministic) search index:

    # Must use only 1 thread so we get same index structure, always:                                                                                                                                                                                  
    index = comp.newIndex(NIGHTLY_DIR, mediumSource,
                          # nocommit                                                                                                                                                                                                                  
                          verbose = True,
                          analyzer='StandardAnalyzerNoStopWords',
                          postingsFormat='Lucene912',
                          numThreads=1,
                          useCMS=False,
                          directory=DIR_IMPL,
                          idFieldPostingsFormat='Lucene912',
                          mergePolicy='LogDocMergePolicy',
                          facets=(('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear'),
                                  ('sortedset:Date', 'Date'),
                                  ('taxonomy:RandomLabel', 'RandomLabel'),
                                  ('sortedset:RandomLabel', 'RandomLabel')),
                          addDVFields=True,
                          vectorFile=constants.VECTORS_DOCS_FILE,
                          vectorDimension=constants.VECTORS_DIMENSIONS,
                          vectorEncoding=constants.VECTORS_TYPE,
                          hnswThreadsPerMerge = constants.HNSW_THREADS_PER_MERGE,
                          hnswThreadPoolCount = constants.HNSW_THREAD_POOL_COUNT
                          )

and:

    indexPathNow, ign, ign, atClose, profilerSearchIndex, profilerSearchJFR = buildIndex(r, runLogDir,
                                                                                         'search index (fixed segments)',
                                                                                         index, 'fixedIndex.log')

But then you'd also need the wikimediumall source file ... I'll try to copy this somewhere accessible. But it looks like we found/fixed the bug already!? That's great. I'll test the latest PR.

@mikemccand
Copy link
Member Author

OK in my latest repro, it got to the "hang" again, ~4.7 MM docs, and the new InfoStream verbosity is printing this, seemingly forever ...:

HNSW 0 [2024-10-02T14:38:57.213224399Z; Index #0]: connected ok 3259356 -> 1588283
HNSW 0 [2024-10-02T14:38:57.213228727Z; Index #0]: connect component Component[start=1588302, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:38:58.071588512Z; Index #0]: connected ok 4958321 -> 1588302
HNSW 0 [2024-10-02T14:38:58.071649757Z; Index #0]: connected ok 4603916 -> 1588302
HNSW 0 [2024-10-02T14:38:58.071654205Z; Index #0]: connect component Component[start=1588314, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:38:59.056790289Z; Index #0]: connected ok 476853 -> 1588314
HNSW 0 [2024-10-02T14:38:59.056855021Z; Index #0]: connected ok 3621451 -> 1588314
HNSW 0 [2024-10-02T14:38:59.056859740Z; Index #0]: connect component Component[start=1588875, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:38:59.460672305Z; Index #0]: connected ok 4762949 -> 1588875
HNSW 0 [2024-10-02T14:38:59.460749310Z; Index #0]: connected ok 1325953 -> 1588875
HNSW 0 [2024-10-02T14:38:59.460753808Z; Index #0]: connect component Component[start=1588889, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:00.176183421Z; Index #0]: connected ok 1558507 -> 1588889
HNSW 0 [2024-10-02T14:39:00.176248132Z; Index #0]: connected ok 2135121 -> 1588889
HNSW 0 [2024-10-02T14:39:00.176252130Z; Index #0]: connect component Component[start=1588908, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:01.062557483Z; Index #0]: connected ok 4659744 -> 1588908
HNSW 0 [2024-10-02T14:39:01.062624378Z; Index #0]: connected ok 2226014 -> 1588908
HNSW 0 [2024-10-02T14:39:01.062628576Z; Index #0]: connect component Component[start=1589010, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:01.959456863Z; Index #0]: connected ok 4645839 -> 1589010
HNSW 0 [2024-10-02T14:39:01.959520262Z; Index #0]: connected ok 3325246 -> 1589010
HNSW 0 [2024-10-02T14:39:01.959525001Z; Index #0]: connect component Component[start=1589050, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:02.517998859Z; Index #0]: connected ok 1229619 -> 1589050
HNSW 0 [2024-10-02T14:39:02.518061436Z; Index #0]: connected ok 4746044 -> 1589050
HNSW 0 [2024-10-02T14:39:02.518065895Z; Index #0]: connect component Component[start=1589059, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:03.723843596Z; Index #0]: connected ok 2391036 -> 1589059
HNSW 0 [2024-10-02T14:39:03.723913016Z; Index #0]: connected ok 478823 -> 1589059
HNSW 0 [2024-10-02T14:39:03.723918737Z; Index #0]: connect component Component[start=1589078, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:04.106043320Z; Index #0]: connected ok 3785244 -> 1589078
HNSW 0 [2024-10-02T14:39:04.106104635Z; Index #0]: connected ok 1604849 -> 1589078
HNSW 0 [2024-10-02T14:39:04.106108953Z; Index #0]: connect component Component[start=1589175, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:04.475971140Z; Index #0]: connected ok 4590765 -> 1589175
HNSW 0 [2024-10-02T14:39:04.476031093Z; Index #0]: connected ok 4799856 -> 1589175
HNSW 0 [2024-10-02T14:39:04.476035441Z; Index #0]: connect component Component[start=1589287, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:04.833763512Z; Index #0]: connected ok 2053056 -> 1589287
HNSW 0 [2024-10-02T14:39:04.833823585Z; Index #0]: connected ok 2449630 -> 1589287
HNSW 0 [2024-10-02T14:39:04.833827853Z; Index #0]: connect component Component[start=1589492, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:05.931177996Z; Index #0]: connected ok 2106405 -> 1589492
HNSW 0 [2024-10-02T14:39:05.931240674Z; Index #0]: connected ok 4064050 -> 1589492
HNSW 0 [2024-10-02T14:39:05.931244742Z; Index #0]: connect component Component[start=1589505, size=1] to Component[start=748, size=274202]
HNSW 0 [2024-10-02T14:39:07.417257519Z; Index #0]: connected ok 4960883 -> 1589505
HNSW 0 [2024-10-02T14:39:07.417322050Z; Index #0]: connected ok 900733 -> 1589505
HNSW 0 [2024-10-02T14:39:07.417326659Z; Index #0]: connect component Component[start=1589635, size=1] to Component[start=748, size=274202]

We feel confident that the float/byte confusion might cause this? Anyways, I'll apply that PR and try to repro! Thanks for the quick attention here @ChrisHegarty, @benwtrent and @msokolov!

@benwtrent
Copy link
Member

The float/byte confusion would causes this to run exponentially worse. Upshot is that graph exploration is much worse due to nonsensical scores and causes greater disconnectedness. This is a double whammy.

I expect the latest bug fix to work.

@mikemccand
Copy link
Member Author

The float/byte confusion would causes this to run exponentially worse. Upshot is that graph exploration is much worse due to nonsensical scores and causes greater disconnectedness. This is a double whammy.

I expect the latest bug fix to work.

Awesome -- I pulled just that fix into nightly benchy Lucene clone and restarted! I'll post back once it gets beyond this hang. Thanks!!

Another interesting thing: the nightly benchy is producing quite large segments, ~15 GB, because of how it creates its segments by doc counts / deterministic merging. This is beyond what Lucene would normally do by default (target ~5 GB max segment size) ... so the benchy is testing larger HNSW graphs than "normal".

@mikemccand
Copy link
Member Author

But then you'd also need the wikimediumall source file ... I'll try to copy this somewhere accessible.

I put the nightly benchy binary line file docs source file here: https://githubsearch.mikemccandless.com/enwiki-20110115-lines-1k-fixed-with-random-labels.bin ... it's ~25 GB.

ChrisHegarty added a commit that referenced this issue Oct 2, 2024
This is a test only change that verifies the behaviour when float vector values are passed to our FlatVectorsScorer implementations. This would have caught the bug causing #13844, subsequently fixed by #13850.
ChrisHegarty added a commit that referenced this issue Oct 2, 2024
This is a test only change that verifies the behaviour when float vector values are passed to our FlatVectorsScorer implementations. This would have caught the bug causing #13844, subsequently fixed by #13850.
ChrisHegarty added a commit that referenced this issue Oct 2, 2024
This is a test only change that verifies the behaviour when float vector values are passed to our FlatVectorsScorer implementations. This would have caught the bug causing #13844, subsequently fixed by #13850.
@mikemccand
Copy link
Member Author

OK indeed @benwtrent fix resolves the issue! Benchy happily got past that one large merge (_32 segment heh)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker A severe issue that should be resolved before the released specified in its Milestone. type:bug
Projects
None yet
Development

No branches or pull requests

4 participants