Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker integ test with async API #1003

Merged
merged 11 commits into from
Jan 16, 2025

Conversation

normanj-bitquill
Copy link
Contributor

Description

Update the integration test docker stack to support the OpenSearch Async API and using Minio as an S3 storage engine. Also includes having everything configured on startup.

Related Issues

#992

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@normanj-bitquill
Copy link
Contributor Author

This is not complete yet. Still need to:

  1. Create a replacement for EMRServerlessClient that will use the local docker environment. This will include starting a docker container of Spark to run the query. Given that the idea is to start the container with spark-submit, it should be possible to easily change to the Spark EMR image instead.
  2. Alter the Python integration tests to use the OpenSearch Async API instead
  3. Remove the spark and spark-worker containers since Spark would be run on demand

@YANG-DB
Copy link
Member

YANG-DB commented Dec 31, 2024

@normanj-bitquill I've also started working on some similar PR with Iceberg based docker-compose -
lets see if we can share knowledge ...

@normanj-bitquill
Copy link
Contributor Author

Current Status

In my testing, I have added a second OpenSearch container. This resolves issues with the cluster and indices going into yellow state.

Changes in OpenSearch node:

  • Bind the docker socket
  • Replace the aws-java-sdk-emrserverless-*.jar file with on that uses a different EMRClient. The new EMRClient writes a file to a directory (/tmp/docker) with the arguments for running docker.
  • A script is run on the container that reads files from /tmp/docker and runs docker with the arguments.
  • Disable system indices and system indices permissions
  • Connect with second OpenSearch container

What works:

  • Initiate query with aysnc-api on OpenSearch container
  • Starts a new Spark docker container to call spark-submit using information from the start job request that the EMRClient received
  • New container connects to spark cluster to execute the job
  • Creates the request and result indices
  • Saves the result to the result index

What is missing:

  • Detecting that the job has finished (this is on the spark-submit container)
  • The query is for a table on an S3 data source, query fails with table not found

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have partially working async API in the latest commit. These are my testing steps:

  1. Start the cluster using docker compose up
  2. Run spark-shell on the master spark container.
    1. Create an external table with a location under s3a://integ-test/
    2. Put some data in the table
  3. Submit a query:
    curl -u '...' -X POST -H 'Content-Type: application/json' -d '{"datasource": "mys3", "lang": 
    "sql", "query": "SELECT * FROM mys3.default.foo"}
    
  4. Retrieve the result (will need to wait until it is ready, maybe about 1 minute)
    curl -u '...' -X POST -H 'Content-Type: application/json' -d '{}' 'http://localhost:9200/query_execution_result_mys3/_search?pretty'
    

The OpenSearch container will need to bind the docker socket /var/run/docker.sock. By default this isn't available on Mac, but can be enabled.

The OpenSearch container will start another container to process the async query. This is the place where we could slip in the EMR spark container (if there is any value from it).

@normanj-bitquill
Copy link
Contributor Author

I have not tested retrieving results using the Async API. This is likely broken, since cannot check on the EMR job status. I also haven't tested a streaming query (also likely broken).

docker/integ-test/metastore/Dockerfile Outdated Show resolved Hide resolved
docker/integ-test/opensearch/Dockerfile Outdated Show resolved Hide resolved
docker/integ-test/opensearch/docker-command-runner.sh Outdated Show resolved Hide resolved

su opensearch ./opensearch-docker-entrypoint.sh "$@"

kill -TERM `cat /var/run/docker-command-runner.pid`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called ?
can you plz explain the entire flow for the docker compose ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an async query is submitted, a start job request is build which contains arguments for running spark-submit. Normally the start job request is submitted to EMR and it will handle starting a docker container and running spark-submit. In our case, I want to call docker to start a new container to run spark-submit.

Something is preventing the OpenSearch Java process from running external commands. I tried updating the Java security policy but I was still unable to run external commands. As a workaround, the docker arguments are written to a file in /tmp/docker. There is a separate shell script running docker-command-runner.sh. It will read the files in /tmp/docker and run docker.

Ideally the configuration could be fixed so that docker can be run directly from the Java code.

The motivation for calling docker is to better emulate what is happening when EMR is called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok nice !!

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB Using the async API to get the results works once the results are available in the result index.

Some interesting behaviour though:

  • When you submit a query, the container starts up and runs the query. It then has a loop (looking for new queries) and continues until it finally times out after 3 minutes.
  • If you submit a query and then submit a second query (in a different session), the second query isn't actually run until the container for the first query has exited (after the 3 minute timeout)
  • When the container starts up, it downloads the Flint and PPL extensions. This adds latency and I am trying to have it use the locally built Flint and PPL jar files.

I am likely missing some code for the new EMRClient for job management. This may explain some of the behaviour above. If the Spark app is going to run for 3 minutes, we should skip calling docker when the same session is used and the container is still running.

@normanj-bitquill normanj-bitquill changed the title WIP: Docker integ test with async API Docker integ test with async API Jan 13, 2025
@normanj-bitquill
Copy link
Contributor Author

@YANG-DB Using the async API to get the results works once the results are available in the result index.

Some interesting behaviour though:

  • When you submit a query, the container starts up and runs the query. It then has a loop (looking for new queries) and continues until it finally times out after 3 minutes.
  • If you submit a query and then submit a second query (in a different session), the second query isn't actually run until the container for the first query has exited (after the 3 minute timeout)
  • When the container starts up, it downloads the Flint and PPL extensions. This adds latency and I am trying to have it use the locally built Flint and PPL jar files.

I am likely missing some code for the new EMRClient for job management. This may explain some of the behaviour above. If the Spark app is going to run for 3 minutes, we should skip calling docker when the same session is used and the container is still running.

I have fixed up this behaviour. The Spark cluster as is, only supports running one application at a time. I have changed the spark submit container to run the query locally. This allows multiple sessions to run at the same time, but each session could have its own container running.

I have also disabled downloading the dependencies on startup, to speed up query execution.

Can submit an async query and the result is written to the result index.

Need to create the external table in Spark before submitting the query

Signed-off-by: Norman Jordan <[email protected]>
* Adding the S3 access key and the S3/Glue data source is now done is a dedicated short lived container
* Added missing license headers

Signed-off-by: Norman Jordan <[email protected]>
* Hive container will now store data on a volume
* Spark containers now use a built image, no need for custom entrypoint script or to start with root user

Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
@normanj-bitquill normanj-bitquill force-pushed the docker-integ-with-async-api branch from 4c35a6c to 9aae9bb Compare January 14, 2025 23:36
@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have added some documentation in this PR. Can you take a look and help with getting a second reviewer?

Signed-off-by: Norman Jordan <[email protected]>
Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this IT for Flint Spark with SQL async query API? Just wondering if this should be in SQL repo?

@normanj-bitquill
Copy link
Contributor Author

Is this IT for Flint Spark with SQL async query API? Just wondering if this should be in SQL repo?

@dai-chen This is for three use cases:

  • Submitting queries directly to Spark in order to test the PPL extension for Spark.
  • Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension for Spark.
  • Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying S3/Glue datasources.

Something very similar would make sense for the SQL repo. The SQL repo will not need the spark master and spark containers.

@YANG-DB
Copy link
Member

YANG-DB commented Jan 15, 2025

Is this IT for Flint Spark with SQL async query API? Just wondering if this should be in SQL repo?

@dai-chen This is for three use cases:

  • Submitting queries directly to Spark in order to test the PPL extension for Spark.
  • Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension for Spark.
  • Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying S3/Glue datasources.

Something very similar would make sense for the SQL repo. The SQL repo will not need the spark master and spark containers.

Actually this make sense as a wide test harness for such use cases where we want to test connectivity between different opensearch datasources ...

OPENSEARCH_DASHBOARDS_PORT=5601
S3_ACCESS_KEY=Vt7jnvi5BICr1rkfsheT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it for minio? it could be anything string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is for minio container and is only valid for the minio container.


This container also has a docker volume used to persist the S3 data.

### Configuration-Updated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. Configuration-updater?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

docker compose down -d
```

## Creating Tables in S3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add example of how to submit query using async API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example of submitting a query and getting the results.

1. Submitting queries directly to Spark in order to test the PPL extension for Spark.
2. Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension
for Spark.
3. Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference of 2 and 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2, a query is sent directly to Spark and Spark needs to query the index on the OpenSearch server. An example of where this might be useful is testing PPL on Spark, querying OpenSearch indices.

For 3, a query is sent directly to the OpenSearch server for the S3/Glue datasource. A new Spark container is started to process the request. The Spark container needs to query the table using Hive and minio. The Spark container is also making use of the spark-sql-application in this repository to process the query.

@penghuo penghuo merged commit 5884fea into opensearch-project:main Jan 16, 2025
4 checks passed
@YANG-DB
Copy link
Member

YANG-DB commented Jan 16, 2025

thanks @penghuo !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing test related feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants