Reduce latency in fetch phase for large shard counts #120419

original-brownbear · 2025-01-17T20:54:27Z

We can reduce the latency of the fetch response by a non-trivial amount by moving the context-freeing for irrelevant shards to the end of the forked action.

For large shard counts (and/or with security in the mix) the old comment is not entirely correct and waking up the selector many times over + doing some authz work might consume macroscopic time.
Moving the freeing to the end of the task causes the fetches to be sent out potentially much quicker, reduces contention on the counter and reduces the impact of potential head-of-line blocking issues on the connections that might see context freeing needlessly queue after actual fetches.

We can reduce the latency of the fetch response by a non-trivial amount by moving the context-freeing for irrelevant shards to the end of the forked action. For large shard counts (and/or with security in the mix) the old comment is not entirely correct and waking up the selector many times over + doing some authz work might consume macroscopic time. Moving the freeing to the end of the task causes the fetches to be sent out potentially much quicker, reduces contention on the counter and reduces the impact of potential head-of-line blocking issues on the connections that might see context freeing needlessly queue after actual fetches.

elasticsearchmachine · 2025-01-17T20:54:51Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

original-brownbear · 2025-01-17T21:10:07Z

Just to illustrate how much of a concern this is:

For a request across ~5k shards that has only has matches in 10 of them after the final reduce step, sending out these requests uses close to 100% of the CPU time in fetch (that is including merging results! ... albeit without aggregations in this flamegraph). This translates into O(100ms) latency for 5k shards and is highly variable as well. In fact, after #118490 the sending of the free requests makes up the majority of the end-to-end latency for many queries. Counting down early and moving the freeing to the end of the action solves the problem just fine in practice and there's no need to fork this long action again.

original-brownbear added >non-issue :Search Foundations/Search Catch all for Search Foundations labels Jan 17, 2025

elasticsearchmachine added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.0.0 labels Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce latency in fetch phase for large shard counts #120419

Reduce latency in fetch phase for large shard counts #120419

original-brownbear commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

original-brownbear commented Jan 17, 2025

Reduce latency in fetch phase for large shard counts #120419

Are you sure you want to change the base?

Reduce latency in fetch phase for large shard counts #120419

Conversation

original-brownbear commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

original-brownbear commented Jan 17, 2025