Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16505: Switch UpdateShardHandler.getRecoveryOnlyHttpClient to Jetty HTTP2 #2276

Conversation

iamsanjay
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-16505

Switch UpdateShardHandler.getRecoveryOnlyHttpClient to Jetty HTTP2

In UpdateShardHandler.java removed the old reference to Apache HTTP client for recoveryUpdate, and initialized with Jetty HTTP client. The recovery client is only used by RecoveryStrategy.java to recreate the builder with the existing HTTPClient.
In RecoveryStrategy.java, sendPrepRecoveryCmd sending async request using HttpUriRequest, After replacing it with newer client there is no option to create HttpUriRequest. However, we can send asyncRequest using executor. Upon calling, withExecutor() to set the executor, it did not work. Further inspection revealed that Http2SolrClient only set executor when they creating new HttpClient, In case, if builder is initialized with existing client there is no way to set the executor. I added the new code to tackle that issue. But now we have two different executor: one for Http2SolrClient and another one for HttpClient.

Tests

Added stressRecoveryTest.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of the test is really nice.

@@ -133,16 +135,25 @@ public UpdateShardHandler(UpdateShardHandlerConfig cfg) {
DistributedUpdateProcessor.DISTRIB_FROM,
DistributingUpdateProcessorFactory.DISTRIB_UPDATE_PARAM);
Http2SolrClient.Builder updateOnlyClientBuilder = new Http2SolrClient.Builder();
Http2SolrClient.Builder recoveryOnlyClientBuilder = new Http2SolrClient.Builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, updateOnly & recoveryOnly clients are configured the same except for withTheseParamNamesInTheUrl but it's benign to have that shared. Ultimately, the point I think is for both clients to be separated so that saturation in one (particularly updates) doesn't block the other (recovery). Hopefully I can get Mark Miller to weigh in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markrmiller do you recall why there are separate clients for "updateOnly" vs "recoveryOnly"?

@epugh
Copy link
Contributor

epugh commented Feb 20, 2024

Thank you @dsmiley for reviewing this PR... I'm excited for the work @iamsanjay has been doing, and I do NOT feel comfortable reviewing/merging code in this area ;-)

@iamsanjay
Copy link
Contributor Author

I was thinking about the executor in the Http2SolrClient class.

What is the issue?

  1. Http2SolrClent calls createHttpClient to create a Jetty HTTP client. By default, If executor is not provided then a executor will be created for it to initialize the Jetty HTTP.
   private HttpClient createHttpClient(Builder builder) {
    HttpClient httpClient;

    executor = builder.executor;
    if (executor == null) {
      BlockingArrayQueue<Runnable> queue = new BlockingArrayQueue<>(256, 256);
      this.executor =
          new ExecutorUtil.MDCAwareThreadPoolExecutor(
              32, 256, 60, TimeUnit.SECONDS, queue, new SolrNamedThreadFactory("h2sc"));
      shutdownExecutor = true;
    } else {
      shutdownExecutor = false;
    }
  1. However, If someone creates Http2SolrClient.Builder with existing Jetty HTTP client by calling withHttpClient there will be no executor available to them, even if withExecutor method is called. Therefore, all the asyncRequest will result in NullPointerException. Because only createHttpClient is the only method which handles the initialization of executor and If new builder is created is using existing Jetty client then no call is made to createHttpClient and hence no executor.

Solution
There are two ways that I am thinking as of now to handle it, in case If new builder is created is using the existing Jetty HTTP client.

  1. In Http2SolrClient constructor, below snipped of code is added to check whether executor is null or not. And then re-initlialize the executor at Http2SolrClient level. At this point, we have two executors: one for Http2SolrClient and another for Jetty Http client. And Http2SolrClient is only responsible for closing the ones they making from inside, and If you are providing any Executor from outside, then user will be responsible for closing it.
if (this.executor == null) {
        this.executor = http2SolrClient.executor;
      }
  1. Second, As we recreate builder using existing Jetty Client by calling withHttpClient, there we can also add code that would also set executor(one which we created calling createHttpClient) for Http2SolrClient as well among with other properties. But that means that Every builder recreated using the existing Jetty client will have access to executor without them knowing about it. Below is that executor. Something I am not much comfortable about, I believe If user creating Http2SolrClient to send asyncRequest then they should provide the implementation for executor.

Also In this option there will be code to check wh

 BlockingArrayQueue<Runnable> queue = new BlockingArrayQueue<>(256, 256);
      this.executor =
          new ExecutorUtil.MDCAwareThreadPoolExecutor(
              32, 256, 60, TimeUnit.SECONDS, queue, new SolrNamedThreadFactory("h2sc"));

@dsmiley wdyt? I have changed the code to implement the second option for now.

@stillalex I see that we transferred some properties from older Http2SolrClient as we call withHttpClient to create new Http2SolrClient but the exisitng executor was never used for the new builder. Was there any reason for that?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! This really shows how setting out to accomplish one goal can shine a light on some issues to make improvements tangential to the affected code.

solr/CHANGES.txt Outdated Show resolved Hide resolved
@iamsanjay
Copy link
Contributor Author

iamsanjay commented Mar 3, 2024

@dsmiley I was working on IndexFetcher to change their client to Http2SolrClient. Suddenly all the request in IndexFetcher started failing due to authentication issue. Further inspection revealed that the listener required for Auth is not registered for recoveryOnlyClient. Interestingly, no calls in RecoveryStrategy requires Auth, and therefore no request failed there.

Below is the code, that called on updateOnlyClient to register the Auth listener.

public void setSecurityBuilder(HttpClientBuilderPlugin builder) {
builder.setup(updateOnlyClient);
}

However, only calling this method for recoveryOnlyClient won't work. As recreating Http2SolrClient using withHttpClient won't pass on listeners. Basically same issue that we have faced with Executors.

@dsmiley
Copy link
Contributor

dsmiley commented Mar 5, 2024

Thanks for digging into the authentication issue further; this is shining a light on an issue I was vaguely aware of but now I can see it directly. And I'm so glad there are tests; I thought it wasn't tested. Can you name one please?

One thing will help; stop excessive creation of new SolrClient instances when we can re-use the same one that is configured with this auth listener thing. Thus don't have fields on classes that hold an HttpClient (we'll deem this bad/questionable practice); instead hold an Http2SolrClient. If due to customized timeouts we need a new Http2SolrClient, we could ensure that the builder withHttpClient(http2SolrClient) copies listeners (and anything else like an executor; why not). That could be its own JIRA.

@iamsanjay
Copy link
Contributor Author

TestPullReplicaWithAuth.testPKIAuthWorksForPullReplication is the one.

On a side note, How about we rethink the design decorating the request. Right now, we recreate the Http2SolrClient to override socketTimeout and connTimeout. Instead of recreating Http2SolrClient, we introduced another method which would override Timeouts.

request(SolrRequest<?> solrRequest, String collection, int socketTimeout, int connTimeout)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reference to this file from the test; am I missing something?

If this is actually used, then please consider instead modifying your test to manipulate the temp config files so that we don't have yet another test config file to maintain. For an example of this technique, see org.apache.solr.search.TestCpuAllowedLimit#createConfigSet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't push this point further because your ...WithAuth test uses ReplicationTestHelper.SolrInstance which has sone conveniences and I'm not sure how compatible it is with my suggestion.

@iamsanjay
Copy link
Contributor Author

Below test failed!

./gradlew :solr:core:test --tests "org.apache.solr.cloud.OverseerStatusTest.test" -Ptests.jvms=96 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=5C4EADB21CB7FBF1 -Ptests.timeoutSuite=600000! -Ptests.file.encoding=US-ASCII

Exception

 java.lang.NullPointerException
   >         at __randomizedtesting.SeedInfo.seed([5C4EADB21CB7FBF1:D41A9268B24B9609]:0)
   >         at org.apache.solr.cloud.OverseerStatusTest.test(OverseerStatusTest.java:78)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   >         at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:80)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)

git bisect 61a67c0

CC @noblepaul

@@ -1878,7 +1870,6 @@ private int fetchPackets(FastInputStream fis) throws Exception {
log.debug("Fetched and wrote {} bytes of file: {}", bytesDownloaded, fileName);
// errorCount is always set to zero after a successful packet
errorCount = 0;
if (bytesDownloaded >= size) return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a git blame and found: SOLR-14299 IndexFetcher doesn't reset count to 0 after the last packet is received

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That issue merely moved this code slightly; it didn't introduce it. Anyway, I suppose this code potentially prematurely exited before reading the whole stream, and that's why you removed it?

Comment on lines -1846 to -1848
log.warn("No content received for file: {}", fileName);
return NO_CONTENT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least an assert (or comment) communicating this is unexpected

@@ -845,6 +852,13 @@ public static class Builder

protected Long keyStoreReloadIntervalSecs;

public Http2SolrClient.Builder withListenerFactory(List<HttpListenerFactory> listenerFactory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epugh I know you worked on SolrClient immutability efforts. Here, specific to Http2SolrClient, the "listenerFactory" list can now be set on the builder. The setter should probably also be marked Deprecated. Do you think this should be separated to another PR?

@iamsanjay
Copy link
Contributor Author

iamsanjay commented May 1, 2024

Build failed due to one of the flaky test cases as test passing when ran again.

gradlew :solr:core:test --tests "org.apache.solr.search.TestCoordinatorRole.testNRTRestart" -Ptests.jvms=96 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=383FA9310A6D5B37 -Ptests.timeoutSuite=600000! -Ptests.file.encoding=US-ASCII

The error message that we are getting

initial error time threshold exceeded

try (SolrClient solrClient = new Http2SolrClient.Builder(qaNode).build()) {
for (int i = 0; i < 100; i++) {
try {
QueryResponse queryResponse = solrClient.query(COLL, q);
docs = queryResponse.getResults();
assertNotNull("Docs should not be null. Query response was: " + queryResponse, docs);
if (docs.size() > 0) {
found = true;
break;
}
} catch (SolrException ex) {
// we know we're doing tricky things that might cause transient errors
// TODO: all these query requests go to the QA node -- should QA propagate internal
// request errors to the external client (and the external client retry?) or should QA
// attempt to failover transparently in the event of an error?
if (i < 5) {
log.info("swallowing transient error", ex);
} else {
log.error("only expect actual _errors_ within a small window (e.g. 500ms)", ex);
fail("initial error time threshold exceeded");
}
}

I will try to see If i can get more details on this. Should I open a Jira ticket for this?

One observation:
Above test case is setting up the cluster outside the try block. Therefore, If there is any issue occurred during initialization then it would never shut down the cluster.

public void testNRTRestart() throws Exception {
// we restart jetty and expect to find on disk data - need a local fs directory
useFactory(null);
String COLL = "coordinator_test_coll";
MiniSolrCloudCluster cluster =
configureCluster(3)
.withJettyConfig(jetty -> jetty.enableV2(true))
.addConfig("conf", configset("conf2"))
.configure();
System.setProperty(NodeRoles.NODE_ROLES_PROP, "coordinator:on");
JettySolrRunner qaJetty = cluster.startJettySolrRunner();
String qaJettyBase = qaJetty.getBaseUrl().toString();
System.clearProperty(NodeRoles.NODE_ROLES_PROP);
ExecutorService executor =

@dsmiley
Copy link
Contributor

dsmiley commented May 1, 2024

I looked at the test history -- seems flaky. I found a JIRA issue where you can relay these comments.

Looking forward to the lingering small matters being addressed so we can get this merged!

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just maybe one little change needed to move a getter and I think it's done

@iamsanjay
Copy link
Contributor Author

iamsanjay commented May 8, 2024

Thinking out loud! The new logic, however bit ugly it is because it gives freedom to manipulate the Http2SolrClient directly instead of creating it through Builder patter -- anti-pattern, to copy the listeners from older Http2SolrClient to the new Http2SolrClient is added only because we needed new Http2SolrClient due to the custom timeouts configuration for replication logic.

Is there any way to avoid recreating the Http2SolrClient?

The default socketTimeout and connectionTimeout for UpdateShardHandler http client is 60 secs for both, and for IndexFetcher Recovery client It is 120 and 60 seconds respectively. Can we deprecate or basically remove the older way of setting these timeouts at replication Handler level and use the UpdateShardHandler config timeouts i.e. 60 seconds for both socket and connection? @dsmiley @markrmiller

@dsmiley
Copy link
Contributor

dsmiley commented May 8, 2024

The most important part of Http2SolrClient to be re-used (instead of re-created) is Jetty's HttpClient. Creating another Http2SolrClient that uses an existing HttpClient isn't too bad but yeah it'd be nice if we didn't have to re-create one. This PR is an improvement! IndexFetcher.getLatestVersion, fetchFileList, getStream, and getDetails not only re-use the underlying HttpClient, as before, but now use the SolrClient wrapper, which is more succinct & simple in so doing.

Some thoughts on the commit message summarizing this long one:

  • UpdateShardHandler
    • switch "recoveryOnlyHttpClient" to Http2SolrClient
  • RecoveryStrategy:
    • Use Http2SolrClient
    • Simplify cancelation of prep recovery command
  • IndexFetcher:
    • Use Http2SolrClient
    • Ensure the entire stream is consumed to avoid a connection reset
  • Http2SolrClient:
    • make HttpListenerFactory configurable (used for internal purposes)

It's unclear if there was any change with respect to gzip / "useExternalCompression" -- this part of the diff is confusing.

@dsmiley
Copy link
Contributor

dsmiley commented May 8, 2024

Unless I hear to the contrary, I'll commit this tonight like midnight EST. Just "main" for a bit.

The CHANGES.txt is currently:

SOLR-16505: Switch UpdateShardHandler.getRecoveryOnlyHttpClient to Jetty HTTP2
But want to reword do
SOLR-16505: Use Jetty HTTP2 for index replication and other "recovery" operations.

That will be more meaningful to users. CHANGES.txt needs to speak to users.

gspgsp and others added 4 commits May 8, 2024 17:02
for the pwd encode format, there missed a ')', which is confusing.
…pache#2395)


The URPF, `NumFieldLimitingUpdateRequestProcessorFactory`, blocks all
update requests that go through `processAdd` if the core exceeds a
configurable threshold of fields.

The factory accepts two parameters: `maxFields` is a required integer
representing the maximum field threshold,  and `warnOnly` is an optional
boolean that (when enabled) has the URP chain log warnings instead of
blocking updates.

The factory is included in the default configset, with warnOnly=false and
maxFields=1000.  (More lenient settings will be used on branch_9x)

---------

Co-authored-by: David Smiley <[email protected]>
@github-actions github-actions bot added documentation Improvements or additions to documentation configs labels May 8, 2024
@dsmiley dsmiley merged commit e62cb1b into apache:main May 9, 2024
4 checks passed
dsmiley pushed a commit that referenced this pull request May 14, 2024
UpdateShardHandler
* switch "recoveryOnlyHttpClient" to Http2SolrClient
RecoveryStrategy:
* Use Http2SolrClient
* Simplify cancelation of prep recovery command
IndexFetcher:
* Use Http2SolrClient
* Ensure the entire stream is consumed to avoid a connection reset
Http2SolrClient:
* make HttpListenerFactory configurable (used for internal purposes)

---------

Co-authored-by: iamsanjay <[email protected]>
Co-authored-by: David Smiley <[email protected]>
(cherry picked from commit e62cb1b)

fix buildUrl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants