Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Search request processors receive different inputs than the search API request body #17358

Open
ohltyler opened this issue Feb 14, 2025 · 5 comments
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@ohltyler
Copy link
Member

Describe the bug

Search request processors offer manipulation on the search request when users execute a search API call. However, the request the processors receive may be different than the request the users send. This is due to pre-processing/translation/validation that OpenSearch performs before the processors have access to the request. As a result, this can lead to runtime failures.

For example, users may write an ML inference search request processor to parse out a field in the search request, and generate an embedding for it. But, if the request JSON changes between the user-supplied request, and the request received at runtime, it can lead to parsing failures, and failures in the processor to find the original intended value in the search request.

Related component

No response

To Reproduce

See example configurations in opensearch-project/dashboards-flow-framework#574

Expected behavior

From my perspective, ideally the search request processors have access to the exact search request that the users supply in the search API. This provides continuity and consistency, and lets users configure transformations and/or data parsing of that request in a consistent manner. After all search request processors have been completed, the original translation/validation can happen, and the API logic can continue as normal.

Additional Details

@mingshl
Copy link
Contributor

mingshl commented Feb 18, 2025

I think this is a known assumption that a search request processor rewrites a search request, not query dsl string.

The query is constructed in an expanded form with optional parameters in search request before reaching the search pipeline. The JSON parsing should target this expanded query form.

I am thinking, we should have a way to show query expansion form, then it can help users to write proper json path.

for example for a match query:

GET _search
{
  "query": {
    "match": {
      "title": "wind"
    }
  }
}

we can have _search/view_source API

then when sending

GET _search/_view_source
{
  "query": {
    "match": {
      "title": "wind"
    }
  }
}

returns

{
  "query": {
    "match": {
      "title": {
        "query": "jeans",
        "operator": "OR",
        "prefix_length": 0,
        "max_expansions": 50,
        "fuzzy_transpositions": true,
        "lenient": false,
        "zero_terms_query": "NONE",
        "auto_generate_synonyms_phrase_query": true,
        "boost": 1.0
      }
    }
  }
}

this new preview API can help users to understand the expansion query with optional parameters. In this way, when users want to get the query text "jeans", they can come up the json path query.match.title.query

@ohltyler
Copy link
Member Author

I think this is a known assumption that a search request processor rewrites a search request, not query dsl string.

As a user myself, I wouldn't agree with this assumption. See documentation as well.

A search request processor intercepts a search request (the query and the metadata passed in the request), performs an operation with or on the search request, and submits the search request to the index.

Query DSL is constrained to be a JSON interface, and my intuition is always that the search request processors are acting on just that - receiving that JSON, performing some transformation, and passing the JSON on. This is consistent with how ingest processors or search response processors act on individual JSON documents.

While I'm not opposed to having another API to fetch the translated/expanded query, I think it still is adding an unnecessary layer of knowledge for users to understand; and, would be required to understand when building processors that are dependent on them. Currently, I believe ML inference search request processors are the only ones that are prone to this, but I have further concerns that over time as more and more search request processors get added, will require understanding this. This makes it difficult on the user:

  • need to understand the nuance and where in the search execution this translation happens
  • need to keep executing some search to fetch the translated/expanded version (either via dummy search pipeline or some new API)
  • configure a processor against that mapping
  • if they update the query, it will indeterminately change the mapping, and users would need to re-run the search, fetch the new expanded query, and update the processor accordingly

@msfroh
Copy link
Collaborator

msfroh commented Feb 27, 2025

Query DSL is constrained to be a JSON interface, and my intuition is always that the search request processors are acting on just that - receiving that JSON, performing some transformation, and passing the JSON on. This is consistent with how ingest processors or search response processors act on individual JSON documents.

This is an incorrect intuition. Query DSL is constrained to be Java objects. For example, if you use a transport or gRPC client, there is no JSON involved at all. It happens that both ingest processors and search response processors have Map objects within them (corresponding to the document source), which mirror the shape of the requested/returned JSON, but they're still not JSON.

I'm not sure how to properly convey the fact that OpenSearch does not in fact operate on JSON strings. The OpenSearch REST APIs may accept and return JSON strings, but within OpenSearch, it's nothing but Java objects.

This is not a bug. When the QueryBuilder objects are created, as their name suggests, they are implementing a Builder pattern. Builders are allowed to have default field values. Builders are allowed to provide convenience methods to accept a parameter and store it elsewhere within their internal state. They are complex Java classes with their own logic. They are not JSON strings.

@ohltyler
Copy link
Member Author

This is an incorrect intuition. Query DSL is constrained to be Java objects. For example, if you use a transport or gRPC client, there is no JSON involved at all. It happens that both ingest processors and search response processors have Map objects within them (corresponding to the document source), which mirror the shape of the requested/returned JSON, but they're still not JSON.

Makes sense. I'm just thinking from the REST request perspective, if there would be some mechanism for the search request processors to have access to the REST body (as JSON), similar to the "document source" for ingest and search response processors. Maybe even as an optional configuration.

If there is no reasonable technical way for accomplishing this, then I think we need to be careful when creating new search request processors, particularly ones that are designed to be dependent on the REST request body like the ML inference processors. Creating processors that are performing some operation against an unknown source input isn't a good user experience, as users are currently required to run an initial verbose_pipeline call, parse out the source input for the particular processor, then build mappings around that.

@andrross
Copy link
Member

andrross commented Mar 3, 2025

Catch All Triage - 1 2 3

@andrross andrross added Search Search query, autocomplete ...etc and removed untriaged labels Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants