-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First-class random access API for KnnVectorValues #13779
Conversation
another concern I have is how this would impact ongoing work to enable multiple vectors per doc/field. There would almost certainly be conflicts with that PR on the surface, but I hope this could actually simplify things in that the new |
That is my intuition as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few thoughts/questions. In general, I see how such a random-access API change could help with e.g. your BP reordering work and be valuable in general. I was wondering if this API may be too tailored to HNSW and prevent us from supporting other interesting algorithms, but actually I don't think that this is the case?
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java
Outdated
Show resolved
Hide resolved
* Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access | ||
* different values at once, to avoid overwriting the underlying vector returned. | ||
*/ | ||
public abstract KnnVectorValues copy() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could make the API a bit nicer by removing this copy() and instead have something like a FloatVectorDictionary { float[] vectorValue(int ord); }
and a method here that can return a new FloatVectorDictionary
(a bit like SortedDocValues
and TermsEnum
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way SortedDocValuesTermsEnum
is, calling its next
method will overwrite the internal buffer ofd the SortedDocValues
on which it is built, defeating the purpose of copy()
which is to provide two completely independent sources. Another thing we could do is to add vectorValue(int ord, float[] scratch)
allowing the caller to provide the memory to write to. If we had that, we wouldn't need copy()
. Maybe we could manage to squeeze that into 10.0 too, but I'd rather do it in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you call SortedDocValues#termsEnum twice, this would give you two independent sources of terms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.
if (iterator == null) { | ||
iterator = createIterator(); | ||
} | ||
return iterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this return a new iterator every time to make the API a bit nicer? From a quick look, it seems that call sites could easily be adjusted to not rely on this method returning a shared instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try - I was also a bit unhappy about this, but at one point along this journey I was relying on being able to recover the shared state - maybe I finally was able to get rid of that and just didn't notice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new iterator would be cleaner, if the use sites allow for it.
* Creates an iterator from this instance's ordinal-to-docid mapping which must be monotonic | ||
* (docid increases when ordinal does). | ||
*/ | ||
protected DocIndexIterator fromOrdToDoc() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we make it look a bit more like DocIdSetIterator#all
by moving it to DocIndexIterator#all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you mean rename this method to all
? sure, makes sense
@Override | ||
public int advance(int target) throws IOException { | ||
return slowAdvance(target); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could be a performance trap, which is why DocIdSetIterator offers this helper method without making it the default impl. Should we leave it without a default impl here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I don't think anything relies on this, makes sense
|
||
@Override | ||
public long cost() { | ||
throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here, I'd rather leave it unimplemented to force implementers to decide if having cost() throw an exception is fine. Presumably, most of the time it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think cost() is rarely used in the vector reader/writers which instead are concerned with KnnVectorValues.size() -- they typically want to know how many vector values there are and to the extent they care about the number of docs it's only when they must iterate through all of them and have no use for an estimate. These iterators aren't really used during searching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we default cost() to returning size(), that would work for me. But I'm not comfortable with having implementations of DocIdSetIterator#cost that may throw, which means e.g. that they cannot be put in a Conjunction DISI(which will want to sort its clauses by cost).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Even in FloatVecotorValues cost() is returning size() only. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java#L46-L48
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here. Either it should default to size()
via some provided dependency or it shouldn't implement at all and force sub-classes.
Am guessing correctly that you're targeting 10.0 for this change? |
Thanks for the quick review! I will get started on addressing. As for timeline for this change, it would definitely be convenient to get in to 10.0 release. I think you had said 9/22 would be a feature freeze date; it seems we could possibly meet that timeline. I will be traveling starting tomorrow for a week, but I should be able to put in some quality time on the plane LOL |
public byte[] vectorValue() throws IOException { | ||
return current.values.vectorValue(); | ||
public byte[] vectorValue(int ord) throws IOException { | ||
return current.values.vectorValue(current.values.iterator().index()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part feels a bit hacky, could we instead merge the ord->vector mappings of the various vector values by concatenating them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can enhance DocIDMerger by adding random access to it
I was thinking of doing it next week, but we can backport this PR even though the branch has been cut if it looks ready/safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this change. I see a lot of refactoring similar to what I half started at one point or the other, but never finished. There are some specific comments to be addressed, but otherwise the approach LGTM.
@Override | ||
RandomAccessQuantizedByteVectorValues copy() throws IOException; | ||
/** Returns an IndexInput from which to read this instance's values. */ | ||
IndexInput getSlice(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much like this, and had something similar in a past unmarked PR. 👍
* Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access | ||
* different values at once, to avoid overwriting the underlying vector returned. | ||
*/ | ||
public abstract KnnVectorValues copy() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.
if (iterator == null) { | ||
iterator = createIterator(); | ||
} | ||
return iterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a new iterator would be cleaner, if the use sites allow for it.
I pushed a new revision here addressing some of the major comments:
I didn't have a chance to look seriously at removing |
OK there seem to be some test failures ... I did a complete run, but randomized testing always seems to ferret out something interesting! Actually those really should have failed on any test run -- not sure how I missed them, oops |
Regarding the rename of |
FWIW I started playing with removing copy() by replacing it with a factory method for a dictionary: msokolov@ae7aca3. Not sure how far I'll go. :) |
I'll post one more iteration here addressing the concerns about dangerous default impls that adds back impls of copy() and cost(). I also added a test-and-throw ensuring that the vectorValues impls that require forward-iteration enforce it. We can fully implement random access later without breaking any APIs. I also think we should go ahead with Adrien's Dictionary idea, but do this in two steps because there is a lot going on here already. |
The dictionary idea is OK, but I still don't see how it removes I agree, any further refactoring should be done in another PR. |
I think the idea w/Dictionary is that callers, instead of calling |
Exactly. I tried to model it similarly to what doc values do, where |
OK I think we've addressed the blocking concerns that have been raised here and I plan to push later today if nothing else comes up. Regarding removing copy() in favor of dictionary() I'll open a separate issue. If Adrien finishes it up, great, but I may also see if I can find time to take that forward soon; it would be good to get it done for 10 since it would be a breaking change and ideally we don't want copy() to linger as deprecated. As for implementing better random access in merged values I think we can take that up at a more relaxed pace since it doesn't require any API change. |
hm interesting there was an EOFException in there - I'll dig |
OK, I found an off-by-one error plus a problem with lazy iterator creation that slipped in when we got rid of createIterator(). It makes me a little nervous these didn't show up in earlier testing. I'm now running with |
OK, I think this is ready after a few minor issues have been addressed. I opened #13831 to track replacing copy() with dictionary() |
Should there be a migrate entry added with this change? |
oh thanks, yes, and a CHANGES entry. I opened #13833 if you want to review |
Our lucene_snapshot branch requires updating after apache/lucene#13779
Our lucene_snapshot branch requires updating after apache/lucene#13779
…13850) introduced in the major refactor #13779 Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.
…13850) introduced in the major refactor #13779 Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.
…13850) introduced in the major refactor #13779 Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.
addresses #13778
Key things in this PR:
KnnVectorValues
from whichByteVectorValues
andFloatVectorValues
derive;RandomAccessVectorValues
intoKnnVectorValues
thus eliminating some casts.RandomAccessVectorValues.Floats
becomesFloatVectorValues
andRandomAccessVectorValues.Bytes
becomesByteVectorValues
.RandomAccessQuantizedByteVectorValues
folded intoQuantizedByteVectorValues
.IndexInput getSlice()
moved to a newHasIndexSlice
interface.VectorEncoding KnnVectorValues.getEncoding()
to enable type-specific branches in a few places where we are dealing with abstractKnnVectorValues
(tests only IIRC). Also used that to provide a defaultgetVectorByteLength()
.KnnVectorValues
no longer extendsDocIdSetIterator
; rather it provides a tightly-couplediterator()
. This enables refactoring common iteration patterns that were repeated many times in the code base. This new iterator,DocIndexIterator
provides an additional methodindex()
analogous toIndexedDISI
.Some of the methods on
KnnVectorValues
have default impls that throwUnsupportedOperationException
enabling subclasses to provide partial implementations and relying on testing to catch missing required methods. I'd like feedback on this. Should we provide implementations we never use, just to make these classes complete? That didn't make sense to me. But the previous alternative of attempting to provide strict adherence to declarative contracts was becoming in my view, overly restrictive and leading to hard-to-maintain code. Some of these readers would only ever be used iteratively. Random access is required for search, but not used when merging the values themselves, and when we merge we do search, but using a temporary file so that searching is always done over a file-based value. Random access also gets used during merging when the index is sorted, again this is provided by specialized readers, so not every reader needs to implement random access. But the API maintenance is greatly simplified if we allow partial implementation. Anyway that is the idea I am trying out here. Can we live with a little less API purity and gain some simplicity and less boilerplate?Notes for reviewers:
There is a lot of code change here, but much of it is repetitive. I recommend starting with
KnnVectorValues
and checking itsDocIndexIterator
inner class. The rest of the changes are basically consequences of introducing those abstrations in place of theRandom*Values
we removed.