Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Before this PR, if a user did basic search with
The time-to-first-docs was pretty high because we were running filtering and reranking before sending docs.
We also decided to look for other opportunities to reduce latency. In particular, the first LLM call we make to decide what tool and args to use (if any) often takes as long as a second. In that time, we now run
in parallel with tool choice.
All told, there used to be 1-2 seconds of latency after choosing a tool that is down to 0.1-0.2s with these changes, with the caveat that at worst the newly introduced parallelism makes the tool choice node take worst case 0.5 seconds longer.
NOTE: Technically we could pass the TimeoutThread objects around until their value is absolutely necessary, but there just isn't enough going on between the time I decided to join() and the place where those values are needed, so I opted for code cleanliness.
How Has This Been Tested?
Tested + Benchmarked in UI
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.