[BUG] Search request processors receive different inputs than the search API request body #17358

ohltyler · 2025-02-14T01:11:38Z

Describe the bug

Search request processors offer manipulation on the search request when users execute a search API call. However, the request the processors receive may be different than the request the users send. This is due to pre-processing/translation/validation that OpenSearch performs before the processors have access to the request. As a result, this can lead to runtime failures.

For example, users may write an ML inference search request processor to parse out a field in the search request, and generate an embedding for it. But, if the request JSON changes between the user-supplied request, and the request received at runtime, it can lead to parsing failures, and failures in the processor to find the original intended value in the search request.

Related component

No response

To Reproduce

See example configurations in opensearch-project/dashboards-flow-framework#574

Expected behavior

From my perspective, ideally the search request processors have access to the exact search request that the users supply in the search API. This provides continuity and consistency, and lets users configure transformations and/or data parsing of that request in a consistent manner. After all search request processors have been completed, the original translation/validation can happen, and the API logic can continue as normal.

Additional Details

mingshl · 2025-02-18T01:03:10Z

I think this is a known assumption that a search request processor rewrites a search request, not query dsl string.

The query is constructed in an expanded form with optional parameters in search request before reaching the search pipeline. The JSON parsing should target this expanded query form.

I am thinking, we should have a way to show query expansion form, then it can help users to write proper json path.

for example for a match query:

GET _search
{
  "query": {
    "match": {
      "title": "wind"
    }
  }
}

we can have _search/view_source API

then when sending

GET _search/_view_source
{
  "query": {
    "match": {
      "title": "wind"
    }
  }
}

returns

{
  "query": {
    "match": {
      "title": {
        "query": "jeans",
        "operator": "OR",
        "prefix_length": 0,
        "max_expansions": 50,
        "fuzzy_transpositions": true,
        "lenient": false,
        "zero_terms_query": "NONE",
        "auto_generate_synonyms_phrase_query": true,
        "boost": 1.0
      }
    }
  }
}

this new preview API can help users to understand the expansion query with optional parameters. In this way, when users want to get the query text "jeans", they can come up the json path query.match.title.query

ohltyler · 2025-02-18T01:26:41Z

I think this is a known assumption that a search request processor rewrites a search request, not query dsl string.

As a user myself, I wouldn't agree with this assumption. See documentation as well.

A search request processor intercepts a search request (the query and the metadata passed in the request), performs an operation with or on the search request, and submits the search request to the index.

Query DSL is constrained to be a JSON interface, and my intuition is always that the search request processors are acting on just that - receiving that JSON, performing some transformation, and passing the JSON on. This is consistent with how ingest processors or search response processors act on individual JSON documents.

While I'm not opposed to having another API to fetch the translated/expanded query, I think it still is adding an unnecessary layer of knowledge for users to understand; and, would be required to understand when building processors that are dependent on them. Currently, I believe ML inference search request processors are the only ones that are prone to this, but I have further concerns that over time as more and more search request processors get added, will require understanding this. This makes it difficult on the user:

need to understand the nuance and where in the search execution this translation happens
need to keep executing some search to fetch the translated/expanded version (either via dummy search pipeline or some new API)
configure a processor against that mapping
if they update the query, it will indeterminately change the mapping, and users would need to re-run the search, fetch the new expanded query, and update the processor accordingly

msfroh · 2025-02-27T21:47:25Z

Query DSL is constrained to be a JSON interface, and my intuition is always that the search request processors are acting on just that - receiving that JSON, performing some transformation, and passing the JSON on. This is consistent with how ingest processors or search response processors act on individual JSON documents.

This is an incorrect intuition. Query DSL is constrained to be Java objects. For example, if you use a transport or gRPC client, there is no JSON involved at all. It happens that both ingest processors and search response processors have Map objects within them (corresponding to the document source), which mirror the shape of the requested/returned JSON, but they're still not JSON.

I'm not sure how to properly convey the fact that OpenSearch does not in fact operate on JSON strings. The OpenSearch REST APIs may accept and return JSON strings, but within OpenSearch, it's nothing but Java objects.

This is not a bug. When the QueryBuilder objects are created, as their name suggests, they are implementing a Builder pattern. Builders are allowed to have default field values. Builders are allowed to provide convenience methods to accept a parameter and store it elsewhere within their internal state. They are complex Java classes with their own logic. They are not JSON strings.

ohltyler · 2025-02-27T22:49:50Z

This is an incorrect intuition. Query DSL is constrained to be Java objects. For example, if you use a transport or gRPC client, there is no JSON involved at all. It happens that both ingest processors and search response processors have Map objects within them (corresponding to the document source), which mirror the shape of the requested/returned JSON, but they're still not JSON.

Makes sense. I'm just thinking from the REST request perspective, if there would be some mechanism for the search request processors to have access to the REST body (as JSON), similar to the "document source" for ingest and search response processors. Maybe even as an optional configuration.

If there is no reasonable technical way for accomplishing this, then I think we need to be careful when creating new search request processors, particularly ones that are designed to be dependent on the REST request body like the ML inference processors. Creating processors that are performing some operation against an unknown source input isn't a good user experience, as users are currently required to run an initial verbose_pipeline call, parse out the source input for the particular processor, then build mappings around that.

andrross · 2025-03-03T17:20:41Z

Catch All Triage - 1 2 3

ohltyler added bug Something isn't working untriaged labels Feb 14, 2025

github-actions bot added the _No response_ label Feb 14, 2025

ohltyler mentioned this issue Feb 14, 2025

[BUG] Query parsing discrepancies lead to runtime failures opensearch-project/dashboards-flow-framework#574

Open

andrross added Search Search query, autocomplete ...etc and removed untriaged labels Mar 3, 2025

github-project-automation bot added this to Search Project Board Mar 3, 2025

github-project-automation bot moved this to 🆕 New in Search Project Board Mar 3, 2025

andrross removed the _No response_ label Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Search request processors receive different inputs than the search API request body #17358

[BUG] Search request processors receive different inputs than the search API request body #17358

ohltyler commented Feb 14, 2025

mingshl commented Feb 18, 2025

ohltyler commented Feb 18, 2025

msfroh commented Feb 27, 2025 •

edited

Loading

ohltyler commented Feb 27, 2025

andrross commented Mar 3, 2025

[BUG] Search request processors receive different inputs than the search API request body #17358

[BUG] Search request processors receive different inputs than the search API request body #17358

Comments

ohltyler commented Feb 14, 2025

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

mingshl commented Feb 18, 2025

ohltyler commented Feb 18, 2025

msfroh commented Feb 27, 2025 • edited Loading

ohltyler commented Feb 27, 2025

andrross commented Mar 3, 2025

msfroh commented Feb 27, 2025 •

edited

Loading