Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Index Client] Implement remote client build awaiting functionality, validate encoder support #2576

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

owenhalpert
Copy link
Contributor

@owenhalpert owenhalpert commented Mar 4, 2025

Description

This PR is a followup to #2560. This PR adds a RemoteIndexWaiter interface to await the completion of the remote vector build. For the HTTP client, this is implemented as a simple polling mechanism in RemoteIndexPoller. While there is a fair amount of files changed due to refactoring, the crux of this PR is in the RemoteIndexPoller, RemoteIndexHTTPClient, and RemoteIndexBuildStrategy classes.

This PR also adds a check on the encoder type, enforcing that only HNSWFlat is supported by the remote build service.

This PR does not include:

  • Metric collection

which will go in the next PR.

Related Issues

#2560 PR 1
#2518 LLD
#2391 Meta issue

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@owenhalpert owenhalpert changed the title [Remote Index Client] Vector client polling mechanism [Remote Index Client] Vector Client Polling Mechanism Mar 4, 2025
@owenhalpert owenhalpert force-pushed the vector-client-poller branch from 9a9dfb6 to 53d63d4 Compare March 4, 2025 18:07
@owenhalpert owenhalpert marked this pull request as ready for review March 4, 2025 18:49
@jed326
Copy link
Contributor

jed326 commented Mar 4, 2025

Thanks @owenhalpert, the overall approach looks good to me, just needs some cleanup in a few places.

@kotwanikunal
Copy link
Member

On a high level - looking at the changes - I am not sure if we should add in all the client/HTTP constructs directly into the plugin. The job status constants, client configuration seems like a better fit for a client package.
This adds in a direct dependency between the build service and the kNN plugin - which I feel can be abstracted out.

BlobPath path = repository.basePath().add(indexSettings.getUUID() + VECTORS_PATH);
BlobContainer blobContainer = repository.blobStore().blobContainer(path);

assert blobContainer != null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assserts do not work in prod. Can we do null checks instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the logic of getting the blob container comes from OpenSearch core so the assertion is just a sanity check on that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 if assertion fails will we fallback to CPU based index builds? and will there be a useful exception which can tell what happened? If no, please add these information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is there is not any way for blobContainer to be null, and none of the methods in the calling path to retrieve blobContainer throw any checked exceptions. The only thing that can throw an exception is getRepository, and creating a blobContainer on top of a repository reference is more or less just setting the file path to be written/read to. This assertion just helps make sure in tests if we do try to test/mock this method we are properly configuring things.

@navneet1v
Copy link
Collaborator

On a high level - looking at the changes - I am not sure if we should add in all the client/HTTP constructs directly into the plugin. The job status constants, client configuration seems like a better fit for a client package. This adds in a direct dependency between the build service and the kNN plugin - which I feel can be abstracted out.

@kotwanikunal , since remote index build service is not an hosted service like S3 which vends out a HTTP client. Also, we don't want to maintain another repo as the maintainence overhead will be high. Hence the client code is better suited to be in k-NN repo only. If you see right now, the code is abstracted in a different java package. Now if you think we should move the java package to a separate module then that is something which can be easily achieved in upcoming iterations. Let us know your thoughts.

@owenhalpert owenhalpert force-pushed the vector-client-poller branch from 8e6ce08 to ce7f8a0 Compare March 4, 2025 22:49
@jed326 jed326 added the v3.0.0 label Mar 5, 2025
Comment on lines 141 to 154
private String getMethodName(String parametersJson) throws IOException {
XContentParser parser = XContentType.JSON.xContent()
.createParser(NamedXContentRegistry.EMPTY, DeprecationHandler.THROW_UNSUPPORTED_OPERATION, parametersJson.getBytes());

while (parser.nextToken() != XContentParser.Token.END_OBJECT) {
if (parser.currentToken() == XContentParser.Token.FIELD_NAME) {
String fieldName = parser.currentName();
parser.nextToken();
if (NAME.equals(fieldName)) {
return parser.text();
}
}
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was mentioned before, but can we track in a GitHub issue somewhere to improve this parsing? I think that's a generic problem in k-NN and not specific to remote index build too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will get this out tomorrow

@owenhalpert owenhalpert force-pushed the vector-client-poller branch from ce7f8a0 to 1a9577b Compare March 5, 2025 01:41
@@ -64,7 +64,7 @@ public NativeIndexBuildStrategy getBuildStrategy(
if (KNNFeatureFlags.isKNNRemoteVectorBuildEnabled()
&& repositoriesServiceSupplier != null
&& indexSettings != null
&& knnEngine.supportsRemoteIndexBuild()
&& knnEngine.supportsRemoteIndexBuild(fieldInfo.attributes())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cannot we just pass the whole fieldInfo Object, rather than passing attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous PR I used BuildIndexParams.parameters instead of the whole BuildIndexParams in a similar way, because of feedback from @Vikasht34 who suggested I pass only the minimum necessary data between these classes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand that, but fieldInfo.attributes() give a map which can be generic and open for exploit. Hence in this case I was thinking having a fieldInfo object would be better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Me and Navneet has discussed , Please keep Map as it is , Rather than passing FieldInfo

BlobPath path = repository.basePath().add(indexSettings.getUUID() + VECTORS_PATH);
BlobContainer blobContainer = repository.blobStore().blobContainer(path);

assert blobContainer != null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 if assertion fails will we fallback to CPU based index builds? and will there be a useful exception which can tell what happened? If no, please add these information

Comment on lines 141 to 154
private String getMethodName(String parametersJson) throws IOException {
XContentParser parser = XContentType.JSON.xContent()
.createParser(NamedXContentRegistry.EMPTY, DeprecationHandler.THROW_UNSUPPORTED_OPERATION, parametersJson.getBytes());

while (parser.nextToken() != XContentParser.Token.END_OBJECT) {
if (parser.currentToken() == XContentParser.Token.FIELD_NAME) {
String fieldName = parser.currentName();
parser.nextToken();
if (NAME.equals(fieldName)) {
return parser.text();
}
}
}
return null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines 60 to 62
if (COMPLETED_INDEX_BUILD.equals(builder.taskStatus) && StringUtils.isBlank(builder.fileName)) {
throw new IOException("Invalid response format, missing " + FILE_NAME + " for completed status");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this validation be part of fromXContent() ? I feel we should move this validation out of this function and put it at the place where we are validating the status and other attributes.

Copy link
Contributor Author

@owenhalpert owenhalpert Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the actual content checks to the poller (while the status response object will just be checking the validity of the JSON).

Comment on lines 132 to 135
public RemoteBuildStatusResponse awaitVectorBuild(RemoteBuildResponse remoteBuildResponse) throws InterruptedException, IOException {
RemoteIndexPoller remoteIndexPoller = new RemoteIndexPoller(this);
return remoteIndexPoller.pollRemoteEndpoint(remoteBuildResponse);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I mentioned that Polling function should not be part of client I mean the Poller class should call the awaitVectorBuild api from outside. What is being done here is exactly reversed. This is the reason why I still think the api name should be getIndexBuildStatus rather than awaitVectorBuild. Because polling is one way to wait for the IndexBuild to complete.

I think RemoteIndexPoller -> getIndexBuildStatus() rather than awaitVectorBuild() -> RemoteIndexPoller()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks. The new flow will be that the build strategy gets the Waiter implementation from a factory method. In this case it's a Poller. The Poller uses client.getBuildStatus and does the polling all on its own.

The client is specifically responsible for getting and sending requests — the Waiter will use the client to do so when needed.

Pull polling logic out to generic RemoteIndexPoller class (that can use any client), add tests

Signed-off-by: owenhalpert <[email protected]>
Refactor, use more string constants where possible, fix unit tests after refactor

Signed-off-by: owenhalpert <[email protected]>
Signed-off-by: owenhalpert <[email protected]>

# Conflicts:
#	src/test/java/org/opensearch/knn/index/engine/KNNEngineTests.java
@owenhalpert owenhalpert force-pushed the vector-client-poller branch from acd6118 to 2a80d32 Compare March 6, 2025 00:32
Signed-off-by: owenhalpert <[email protected]>
@owenhalpert owenhalpert force-pushed the vector-client-poller branch from 8c0a04a to fbb2d1b Compare March 6, 2025 02:07
Copy link
Contributor

@jed326 jed326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @owenhalpert

@@ -10,6 +10,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
* [Remote Vector Index Build] Implement vector data upload and vector data size threshold setting [#2550](https://github.com/opensearch-project/k-NN/pull/2550)
* [Remote Vector Index Build] Implement data download and IndexOutput write functionality [#2554](https://github.com/opensearch-project/k-NN/pull/2554)
* [Remote Vector Index Build] Introduce Client Skeleton + basic Build Request implementation [#2560](https://github.com/opensearch-project/k-NN/pull/2560)
* [Remote Vector Index Build] Client polling mechanism [#2576](https://github.com/opensearch-project/k-NN/pull/2576)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please improve your commit message and the changelog entry.

@@ -64,7 +64,7 @@ public NativeIndexBuildStrategy getBuildStrategy(
if (KNNFeatureFlags.isKNNRemoteVectorBuildEnabled()
&& repositoriesServiceSupplier != null
&& indexSettings != null
&& knnEngine.supportsRemoteIndexBuild()
&& knnEngine.supportsRemoteIndexBuild(fieldInfo.attributes())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand that, but fieldInfo.attributes() give a map which can be generic and open for exploit. Hence in this case I was thinking having a fieldInfo object would be better.

Comment on lines +131 to +152
public boolean supportsRemoteIndexBuild(Map<String, String> attributes) throws IOException {
String parametersJson = attributes.get(PARAMETERS);
if (parametersJson != null) {
String methodName = getMethodName(parametersJson);
if (METHOD_HNSW.equals(methodName)) {
return FaissHNSWMethod.supportsRemoteIndexBuild(attributes);
}
}
return false;
}

/**
* Get method name from a {@link FieldInfo} formatted attributes map
* Example:
* {
* "index_description": "HNSW12,Flat",
* "spaceType": "l2",
* "name": "hnsw",
* ...
* }
*/
private String getMethodName(String parametersJson) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets avoid IOException in method signature, it is just making the whole interfaces ugly. Lets catch the exception in the functions and then see if we want throw a proper runtime exception or we want to return some defaults.

Copy link
Contributor Author

@owenhalpert owenhalpert Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think we can catch and return null for the parsing methods on IOExceptions so the supportsRemoteIndexBuild just returns false.

* @return encoder name or null if not found
* @throws IOException if the json string is not valid
*/
private static String getEncoderName(String parametersJson) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15 and @naveentatikonda can you please help in validating this logic to ensure that it is full proof, with all the different combinations that can be possible for the encoders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm thinking about this a bit more. Do we only need the encoder on write? If so, I think itd be safest to retrieve it from the KnnVectorFieldType via mapper service (https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldType.java). We should expose it as a method in https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/KNNLibraryIndexingContext.java which will be available via the mapped field type.

If we need it on read, then we have to parse it out of this. But, from my understanding, we shouldnt need it on read.

public class RemoteBuildStatusRequest {
private final String jobId;

public RemoteBuildStatusRequest(RemoteBuildResponse remoteBuildResponse) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public RemoteBuildStatusRequest(RemoteBuildResponse remoteBuildResponse) {
public static RemoteBuildStatusRequest build(RemoteBuildResponse remoteBuildResponse) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Status Build Request work based on RemoteBuildResponse rather than it has to work with minimum JobId which is enough for get status.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this keeps the design more extensible if the request/response were to contain something in addition to or instead of the job ID (e.g. tenantId).

String fileName;
String errorMessage;

static RemoteBuildStatusResponse fromXContent(XContentParser parser) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a java doc with all the possible json that this function can parse.

Comment on lines +45 to +47
long timeout = KNNSettings.getRemoteBuildClientTimeout().getNanos();
// Thread.sleep expects millis
long pollInterval = KNNSettings.getRemoteBuildClientPollInterval().getMillis();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these values be part of while loop so that if settings are changed the loop can exits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this too -- I lean towards no though for 3 main reasons:

  1. I think it's safer / more stable to have a merge/flush task be immutable wrt settings, similar to how we are doing with the endpoint
  2. It would be good to not make so may getClusterSettings() calls in a loop (although we could instead cache the value with a settings update consumer)
  3. If we want to support cancel functionality we should properly scope that out instead of overloading it into the timeout settings

*/
class RemoteIndexPoller implements RemoteIndexWaiter {
// The poller waits KNN_REMOTE_BUILD_CLIENT_POLL_INTERVAL * INITIAL_DELAY_FACTOR before sending the first status request
private static final int INITIAL_DELAY_FACTOR = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since your KNN_DEFAULT_REMOTE_BUILD_CLIENT_POLL_INTERVAL_SECONDS is 30sec, the factor of 3 is pretty high. decrease the value of 30sec to say may be 10sec or 5 sec, then a factor of 3 will be good.

I understand we want to tune these values, but I can tell you from my initial benchmarks this value is very high already


package org.opensearch.knn.index.remote;

public class RemoteIndexWaiterFactory {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add java docs on all new classes and public functions.

@navneet1v
Copy link
Collaborator

@owenhalpert please check why CIs are failing. Also, I think code is in pretty good shape and close to be getting shipped.

@owenhalpert owenhalpert changed the title [Remote Index Client] Vector Client Polling Mechanism [Remote Index Client] Implement remote client build awaiting functionality, validate encoder support Mar 6, 2025
@Value
@Builder
public class RemoteBuildStatusResponse {
private static final ParseField TASK_STATUS_FIELD = new ParseField(TASK_STATUS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where we have defined Seervice API contract and how clinet can assume these parsing!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants