Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove usage of cluster level setting for circuit breaker #2567

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jmazanec15
Copy link
Member

@jmazanec15 jmazanec15 commented Feb 27, 2025

Description

Simplification of circuit breaker management. Before, we were periodically updating the circuit breaker cluster setting to have a global circuit breaker. However, this is inconvenient and not actually useful.

This change removes that logic and only trips based on node level circuit breaker. For knn stats, in order to fetch the cluster level circuit breaker, a node stats transport call with just the cache_capacity_reached stat is made to check all of the nodes. This isnt super efficient but its made for stats calls so its not on the critical path.

Overall, this greatly simplfiies the memory management code. However, it does mean that the cluster setting "knn.circuit_breaker.triggered" will no longer have an effect. But, I think for 3.0 this is okay. For users interested in this info, they can get it from the stats API.

Related Issues

Resolves #1607

Check List

  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 added Refactoring Improve the design, structure, and implementation while preserving its functionality v3.0.0 labels Feb 27, 2025
@jmazanec15 jmazanec15 force-pushed the issue-1607 branch 2 times, most recently from 3eaf8b3 to 6365da6 Compare February 27, 2025 16:38
}
}
};
this.threadPool.scheduleWithFixedDelay(runnable, TimeValue.timeValueSeconds(CB_TIME_INTERVAL), ThreadPool.Names.GENERIC);
threadPool.scheduleWithFixedDelay(runnable, TimeValue.timeValueSeconds(CB_TIME_INTERVAL), ThreadPool.Names.GENERIC);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go even a step further an move this to the native cache manager to simplfiy things more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change - I think its fairly straightforward and completely removes the potential circular dependency.

@Override
public Boolean get() {
try {
KNNCircuitBreakerTrippedResponse knnCircuitBreakerTrippedResponse = client.execute(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This is going to have to somehow be made to be async, which will complicate things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change. Its a bit messy - will clean up a bit.

@jmazanec15 jmazanec15 force-pushed the issue-1607 branch 3 times, most recently from df687f7 to ecd1318 Compare February 27, 2025 22:32
@kotwanikunal
Copy link
Member

@jmazanec15 - I like the overall idea of getting rid of the cluster level breaker and relying on the node level breakers.
The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?
It heads in the direction of rejecting per node requests based on the current node load, and we can make the coordinator rely on retrying on another node instead.

It will get us closer to the backpressure constructs: https://opensearch.org/blog/shard-indexing-backpressure-in-opensearch/

@@ -500,4 +486,21 @@ private void startMaintenance(Cache<String, NativeMemoryAllocation> cacheInstanc

maintenanceTask = threadPool.scheduleWithFixedDelay(cleanUp, interval, ThreadPool.Names.MANAGEMENT);
}

private void circuitBreakerUpdater() {
Runnable runnable = () -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous approach was directly in the codepath. Are we moving to runnable tasks because we want a mechanism to unset the CB?

I would rather have it as a side effect than a constant monitor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was actually in the KNNCircuitBreaker class: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/KNNCircuitBreaker.java#L60-L109. I moved it to the cache manager because it removes the circular dependency between the 2. However, the logic remains consistent

@Override
public ActionListener<Void> setupContext(KNNStatFetchContext knnStatFetchContext, ActionListener<Void> actionListener) {
// If there are any nodes in the cluster before 3.0, then we need to fall back to checking the CB
if (minVersionSupplier.get().compareTo(Version.V_3_0_0) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use the compare from Version class directly? minVersionSupplier.get().orOrBefore(Version.V_3_0_0)

@jmazanec15
Copy link
Member Author

The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?

@kotwanikunal Why do you say that? We only trip up per node.

@kotwanikunal
Copy link
Member

The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?

@kotwanikunal Why do you say that? We only trip up per node.

The concern I have here is - we still trip up the entire cluster instead of just the node. What do you think of making it act up per node?

@kotwanikunal Why do you say that? We only trip up per node.

CircuitBreakerStat is still marked as a cluster level stat. I was wondering if we could get away from maintaining it at the cluster level all together.

@jmazanec15
Copy link
Member Author

jmazanec15 commented Feb 28, 2025

@kotwanikunal Oh I see. Yeah I figured thatd be a breaking change and so couldnt remove it without deprecation. Hence, the stats update. I might be mistaken

@kotwanikunal
Copy link
Member

@kotwanikunal Oh I see. Yeah I figured thatd be a breaking change and so couldnt remove it without deprecation. Hence, the stats update. I might be mistaken

I think it makes sense. We missed the deprecation train for this one. We have to wait until 4.0 now 😢
Looks good to me overall.

@jmazanec15 jmazanec15 force-pushed the issue-1607 branch 3 times, most recently from eca9e08 to 4cf10b8 Compare March 4, 2025 03:33
@jmazanec15
Copy link
Member Author

@kotwanikunal @shatejas Can you take a look again when you get the chance? The stats are similar except instead of a new transport action, I just reuse the stats transport action and make a single node stats call before making the complete stats call..

@navneet1v
Copy link
Collaborator

@jmazanec15 can you add the output of how new stats will look like? Also, if a user has to know a CB is triggered or not, how user will identify it?

With this new logic should I assume that if CB is triggered for a node all the indexing requests on that node will fail correct? if the earlier statement is true then what will happen if a node has a replica so will the replica ingestion will fail only?

@jmazanec15
Copy link
Member Author

jmazanec15 commented Mar 4, 2025

@navneet1v

@jmazanec15 can you add the output of how new stats will look like? Also, if a user has to know a CB is triggered or not, how user will identify it?

There wont be any change to the stats api interface. The only change is how we set the cluster circuit_breaker_triggered stat. Before, we were just reading from the setting. Now, we are sending out a request to all nodes to get the cache_capacity_reached stat and if any are true, then we return true.

With this new logic should I assume that if CB is triggered for a node all the indexing requests on that node will fail correct? if the earlier statement is true then what will happen if a node has a replica so will the replica ingestion will fail only?

Yes, if node has CB tripped (instead of cluster) we will fail the request. This change is https://github.com/opensearch-project/k-NN/pull/2567/files#diff-9dad8c6bb50a281c98d7cee29de3491e73a561f530d5e0a029d7deb860cf1698. It gets called here: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L690. So, we would be unable to parse the index request, so replica would fail.

// Add the stats makes sure that we dont recurse infinitely.
dependentStats.forEach(knnStatsRequest::addStat);
client.execute(KNNStatsAction.INSTANCE, knnStatsRequest, ActionListener.wrap(knnStatsResponse -> {
knnNodeStatAggregation = new KNNNodeStatAggregation(knnStatsResponse.getNodes());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this action singleton or a new action object is created per request? My concern is will this aggregation object remain the same with two threads have the request in parallel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will associate it with the request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@navneet1v
Copy link
Collaborator

Yes, if node has CB tripped (instead of cluster) we will fail the request. This change is https://github.com/opensearch-project/k-NN/pull/2567/files#diff-9dad8c6bb50a281c98d7cee29de3491e73a561f530d5e0a029d7deb860cf1698. It gets called here: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L690. So, we would be unable to parse the index request, so replica would fail.

Thanks @jmazanec15 for providing the info. I am bit concerned now, since if replicas fails then indexing is done partially and other cases can come. How we are thinking about those cases?

@jmazanec15
Copy link
Member Author

Thanks @jmazanec15 for providing the info. I am bit concerned now, since if replicas fails then indexing is done partially and other cases can come. How we are thinking about those cases?

This will be consistent to if the jvm circuit breaker for a node is tripped and it cannot send the request to the replica node. I dont think it will fail indexing on the primary.

Simplification of circuit breaker management. Before, we were
periodically updating the circuit breaker cluster setting to have a
global circuit breaker. However, this is inconvenient and not actually
useful.

This change removes that logic and only trips based on node level
circuit breaker. For knn stats, in order to fetch the cluster level
circuit breaker, a transport call is made to check all of the nodes.
This isnt super efficient but its made for stats calls so its not on the
critical path.

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
@kotwanikunal
Copy link
Member

Thanks @jmazanec15 for providing the info. I am bit concerned now, since if replicas fails then indexing is done partially and other cases can come. How we are thinking about those cases?

This will be consistent to if the jvm circuit breaker for a node is tripped and it cannot send the request to the replica node. I dont think it will fail indexing on the primary.

It should be consistent irrespective of where it is triggered. Jack's changes focus on changing the tracking mechanism and not the overall behavior from what I can think of.

Today, if a circuit breaker is triggered before the replica shard indexes - we would fall into the same scenario.

@jmazanec15
Copy link
Member Author

Yes, so if CB is tripped during indexing on replica action, they will retry the operation (https://github.com/opensearch-project/OpenSearch/blob/f6d6aa61e5039e4c6143cc25a71c3e448572dd33/server/src/main/java/org/opensearch/action/support/replication/ReplicationOperation.java#L329), but other than that, it will fail the action in the same way. So this would be consistent with how core handles CB.

@jmazanec15
Copy link
Member Author

Confirmed locally the behavior will be that it fails the replica shard.

@jmazanec15
Copy link
Member Author

We could also change the exception KnnCircuitBreakerException to CircuitBreakingException so that expected retries happen. This is done here for snapshot: https://github.com/opensearch-project/OpenSearch/blob/f6d6aa61e5039e4c6143cc25a71c3e448572dd33/server/src/main/java/org/opensearch/action/admin/cluster/snapshots/status/TransportSnapshotsStatusAction.java#L710

@navneet1v
Copy link
Collaborator

@jmazanec15 thanks for the deep-dive. Can we make the behavior as close as possible to text based indices CB exceptions.

@navneet1v
Copy link
Collaborator

navneet1v commented Mar 4, 2025

@jmazanec15 in the stats api response will we now be seeing node level CB value and also the top level CB value? can you paste a response from your testing

@jmazanec15
Copy link
Member Author

@jmazanec15 thanks for the deep-dive. Can we make the behavior as close as possible to text based indices CB exceptions.

Sure, the problem is that any exception thrown in the mapper gets wrapped: internalParseDocument(mapping, metadataFieldsMappers, context, parser);. So its not going to be able to retry easily even if we throw the CBException. That being said, Im wondering if we should (re)move this check on the circuit breaker. We could put it in the onIndexModule.addIndexOperationListener##preIndex. This might be better because it actually would be used before the index operation occurs: https://github.com/opensearch-project/OpenSearch/blob/2e4cc8c6e12f0e5fdfe9274da0126e81f95f59b3/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L1199.

@jmazanec15 in the stats api response will we now be seeing node level CB value and also the top level CB value? can you paste a response from your testing

Sure there is no change in this/.

curl-os GET /_cat/indices?v
health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   target_index2 G-Nig9ziTw2Qmgke3s0QjQ   1   1          1            0      4.3kb          4.3kb
7cf34dd6bba9:k-NN-1 jmazane$
7cf34dd6bba9:k-NN-1 jmazane$ curl-os GET /_plugins/_knn/stats?pretty
{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "integTest",
  "circuit_breaker_triggered" : false,
  "model_index_status" : null,
  "nodes" : {
    "J5C9PgClRcW7PV3GfVNCGA" : {
      "max_distance_query_with_filter_requests" : 0,
      "graph_memory_usage_percentage" : 0.0,
      "graph_query_requests" : 0,
      "graph_memory_usage" : 0,
      "cache_capacity_reached" : false,
      "load_success_count" : 0,
      "training_memory_usage" : 0,
      "indices_in_cache" : { },
      "script_query_errors" : 0,
      "hit_count" : 0,
      "knn_query_requests" : 0,
      "total_load_time" : 0,
      "miss_count" : 0,
      "min_score_query_requests" : 0,
      "knn_query_with_filter_requests" : 0,
      "training_memory_usage_percentage" : 0.0,
      "max_distance_query_requests" : 0,
      "lucene_initialized" : false,
      "graph_index_requests" : 0,
      "faiss_initialized" : false,
      "load_exception_count" : 0,
      "training_errors" : 0,
      "min_score_query_with_filter_requests" : 0,
      "eviction_count" : 0,
      "nmslib_initialized" : false,
      "script_compilations" : 0,
      "script_query_requests" : 0,
      "graph_stats" : {
        "merge" : {
          "current" : 0,
          "total" : 0,
          "total_time_in_millis" : 0,
          "current_docs" : 0,
          "total_docs" : 0,
          "total_size_in_bytes" : 0,
          "current_size_in_bytes" : 0
        },
        "refresh" : {
          "total" : 0,
          "total_time_in_millis" : 0
        }
      },
      "graph_query_errors" : 0,
      "indexing_from_model_degraded" : false,
      "graph_index_errors" : 0,
      "training_requests" : 0,
      "script_compilation_errors" : 0
    },
    "WYIjPvCcQHKvQGoMFEATow" : {
      "max_distance_query_with_filter_requests" : 0,
      "graph_memory_usage_percentage" : 0.0,
      "graph_query_requests" : 0,
      "graph_memory_usage" : 0,
      "cache_capacity_reached" : false,
      "load_success_count" : 0,
      "training_memory_usage" : 0,
      "indices_in_cache" : { },
      "script_query_errors" : 0,
      "hit_count" : 0,
      "knn_query_requests" : 0,
      "total_load_time" : 0,
      "miss_count" : 0,
      "min_score_query_requests" : 0,
      "knn_query_with_filter_requests" : 0,
      "training_memory_usage_percentage" : 0.0,
      "max_distance_query_requests" : 0,
      "lucene_initialized" : false,
      "graph_index_requests" : 0,
      "faiss_initialized" : false,
      "load_exception_count" : 0,
      "training_errors" : 0,
      "min_score_query_with_filter_requests" : 0,
      "eviction_count" : 0,
      "nmslib_initialized" : false,
      "script_compilations" : 0,
      "script_query_requests" : 0,
      "graph_stats" : {
        "refresh" : {
          "total_time_in_millis" : 0,
          "total" : 0
        },
        "merge" : {
          "current" : 0,
          "total" : 0,
          "total_time_in_millis" : 0,
          "current_docs" : 0,
          "total_docs" : 0,
          "total_size_in_bytes" : 0,
          "current_size_in_bytes" : 0
        }
      },
      "graph_query_errors" : 0,
      "indexing_from_model_degraded" : false,
      "graph_index_errors" : 0,
      "training_requests" : 0,
      "script_compilation_errors" : 0
    },
    "XZKc9zROR_OHbSU7KlWYKQ" : {
      "max_distance_query_with_filter_requests" : 0,
      "graph_memory_usage_percentage" : 0.0,
      "graph_query_requests" : 0,
      "graph_memory_usage" : 0,
      "cache_capacity_reached" : false,
      "load_success_count" : 0,
      "training_memory_usage" : 0,
      "indices_in_cache" : { },
      "script_query_errors" : 0,
      "hit_count" : 0,
      "knn_query_requests" : 0,
      "total_load_time" : 0,
      "miss_count" : 0,
      "min_score_query_requests" : 0,
      "knn_query_with_filter_requests" : 0,
      "training_memory_usage_percentage" : 0.0,
      "max_distance_query_requests" : 0,
      "lucene_initialized" : false,
      "graph_index_requests" : 0,
      "faiss_initialized" : false,
      "load_exception_count" : 0,
      "training_errors" : 0,
      "min_score_query_with_filter_requests" : 0,
      "eviction_count" : 0,
      "nmslib_initialized" : false,
      "script_compilations" : 0,
      "script_query_requests" : 0,
      "graph_stats" : {
        "refresh" : {
          "total_time_in_millis" : 0,
          "total" : 0
        },
        "merge" : {
          "current" : 0,
          "total" : 0,
          "total_time_in_millis" : 0,
          "current_docs" : 0,
          "total_docs" : 0,
          "total_size_in_bytes" : 0,
          "current_size_in_bytes" : 0
        }
      },
      "graph_query_errors" : 0,
      "indexing_from_model_degraded" : false,
      "graph_index_errors" : 0,
      "training_requests" : 0,
      "script_compilation_errors" : 0
    }
  }
}

@navneet1v
Copy link
Collaborator

Sure, the problem is that any exception thrown in the mapper gets wrapped: internalParseDocument(mapping, metadataFieldsMappers, context, parser);. So its not going to be able to retry easily even if we throw the CBException. That being said, Im wondering if we should (re)move this check on the circuit breaker. We could put it in the onIndexModule.addIndexOperationListener##preIndex. This might be better because it actually would be used before the index operation occurs: https://github.com/opensearch-project/OpenSearch/blob/2e4cc8c6e12f0e5fdfe9274da0126e81f95f59b3/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L1199.

Yes I think we should do this.

@navneet1v
Copy link
Collaborator

Sure there is no change in this/.

Will it be better to have the node level CB info too. Since now the CB is moved from top level to node level. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactoring Improve the design, structure, and implementation while preserving its functionality v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove cluster setting for native memory circuit breaker
4 participants