Docker integ test with async API #1003

normanj-bitquill · 2024-12-31T18:44:29Z

Description

Update the integration test docker stack to support the OpenSearch Async API and using Minio as an S3 storage engine. Also includes having everything configured on startup.

Related Issues

#992

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

normanj-bitquill · 2024-12-31T18:49:05Z

This is not complete yet. Still need to:

Create a replacement for EMRServerlessClient that will use the local docker environment. This will include starting a docker container of Spark to run the query. Given that the idea is to start the container with spark-submit, it should be possible to easily change to the Spark EMR image instead.
Alter the Python integration tests to use the OpenSearch Async API instead
Remove the spark and spark-worker containers since Spark would be run on demand

YANG-DB · 2024-12-31T19:48:10Z

@normanj-bitquill I've also started working on some similar PR with Iceberg based docker-compose -
lets see if we can share knowledge ...

normanj-bitquill · 2025-01-07T23:19:29Z

Current Status

In my testing, I have added a second OpenSearch container. This resolves issues with the cluster and indices going into yellow state.

Changes in OpenSearch node:

Bind the docker socket
Replace the aws-java-sdk-emrserverless-*.jar file with on that uses a different EMRClient. The new EMRClient writes a file to a directory (/tmp/docker) with the arguments for running docker.
A script is run on the container that reads files from /tmp/docker and runs docker with the arguments.
Disable system indices and system indices permissions
Connect with second OpenSearch container

What works:

Initiate query with aysnc-api on OpenSearch container
Starts a new Spark docker container to call spark-submit using information from the start job request that the EMRClient received
New container connects to spark cluster to execute the job
Creates the request and result indices
Saves the result to the result index

What is missing:

Detecting that the job has finished (this is on the spark-submit container)
The query is for a table on an S3 data source, query fails with table not found

normanj-bitquill · 2025-01-11T00:00:19Z

@YANG-DB I have partially working async API in the latest commit. These are my testing steps:

Start the cluster using docker compose up
Run spark-shell on the master spark container.
1. Create an external table with a location under s3a://integ-test/
2. Put some data in the table

Submit a query:

curl -u '...' -X POST -H 'Content-Type: application/json' -d '{"datasource": "mys3", "lang": 
"sql", "query": "SELECT * FROM mys3.default.foo"}

Retrieve the result (will need to wait until it is ready, maybe about 1 minute)

curl -u '...' -X POST -H 'Content-Type: application/json' -d '{}' 'http://localhost:9200/query_execution_result_mys3/_search?pretty'

The OpenSearch container will need to bind the docker socket /var/run/docker.sock. By default this isn't available on Mac, but can be enabled.

The OpenSearch container will start another container to process the async query. This is the place where we could slip in the EMR spark container (if there is any value from it).

normanj-bitquill · 2025-01-11T00:02:13Z

I have not tested retrieving results using the Async API. This is likely broken, since cannot check on the EMR job status. I also haven't tested a streaming query (also likely broken).

docker/integ-test/docker-compose.yml

docker/integ-test/opensearch/docker-command-runner.sh

...t/opensearch/emr-src/com/amazonaws/services/emrserverless/AWSEMRServerlessClientBuilder.java

docker/integ-test/metastore/Dockerfile

docker/integ-test/opensearch/Dockerfile

docker/integ-test/opensearch/docker-command-runner.sh

...eg-test/opensearch/emr-src/org/opensearch/spark/emrserverless/DockerEMRServerlessClient.java

YANG-DB · 2025-01-11T05:24:53Z

docker/integ-test/opensearch/opensearch-docker-it-entrypoint.sh

+
+su opensearch ./opensearch-docker-entrypoint.sh "$@"
+
+kill -TERM `cat /var/run/docker-command-runner.pid`


why is this called ?
can you plz explain the entire flow for the docker compose ?

When an async query is submitted, a start job request is build which contains arguments for running spark-submit. Normally the start job request is submitted to EMR and it will handle starting a docker container and running spark-submit. In our case, I want to call docker to start a new container to run spark-submit.

Something is preventing the OpenSearch Java process from running external commands. I tried updating the Java security policy but I was still unable to run external commands. As a workaround, the docker arguments are written to a file in /tmp/docker. There is a separate shell script running docker-command-runner.sh. It will read the files in /tmp/docker and run docker.

Ideally the configuration could be fixed so that docker can be run directly from the Java code.

The motivation for calling docker is to better emulate what is happening when EMR is called.

docker/integ-test/spark/spark-master-entrypoint.sh

normanj-bitquill · 2025-01-13T23:37:04Z

@YANG-DB Using the async API to get the results works once the results are available in the result index.

Some interesting behaviour though:

When you submit a query, the container starts up and runs the query. It then has a loop (looking for new queries) and continues until it finally times out after 3 minutes.
If you submit a query and then submit a second query (in a different session), the second query isn't actually run until the container for the first query has exited (after the 3 minute timeout)
When the container starts up, it downloads the Flint and PPL extensions. This adds latency and I am trying to have it use the locally built Flint and PPL jar files.

I am likely missing some code for the new EMRClient for job management. This may explain some of the behaviour above. If the Spark app is going to run for 3 minutes, we should skip calling docker when the same session is used and the container is still running.

normanj-bitquill · 2025-01-14T21:06:09Z

@YANG-DB Using the async API to get the results works once the results are available in the result index.

Some interesting behaviour though:

When you submit a query, the container starts up and runs the query. It then has a loop (looking for new queries) and continues until it finally times out after 3 minutes.

If you submit a query and then submit a second query (in a different session), the second query isn't actually run until the container for the first query has exited (after the 3 minute timeout)

When the container starts up, it downloads the Flint and PPL extensions. This adds latency and I am trying to have it use the locally built Flint and PPL jar files.

I am likely missing some code for the new EMRClient for job management. This may explain some of the behaviour above. If the Spark app is going to run for 3 minutes, we should skip calling docker when the same session is used and the container is still running.

I have fixed up this behaviour. The Spark cluster as is, only supports running one application at a time. I have changed the spark submit container to run the query locally. This allows multiple sessions to run at the same time, but each session could have its own container running.

I have also disabled downloading the dependencies on startup, to speed up query execution.

Signed-off-by: Norman Jordan <[email protected]>

Can submit an async query and the result is written to the result index. Need to create the external table in Spark before submitting the query Signed-off-by: Norman Jordan <[email protected]>

* Adding the S3 access key and the S3/Glue data source is now done is a dedicated short lived container * Added missing license headers Signed-off-by: Norman Jordan <[email protected]>

* Hive container will now store data on a volume * Spark containers now use a built image, no need for custom entrypoint script or to start with root user Signed-off-by: Norman Jordan <[email protected]>

Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill · 2025-01-14T23:37:50Z

@YANG-DB I have added some documentation in this PR. Can you take a look and help with getting a second reviewer?

Signed-off-by: Norman Jordan <[email protected]>

dai-chen

Is this IT for Flint Spark with SQL async query API? Just wondering if this should be in SQL repo?

normanj-bitquill · 2025-01-15T23:28:01Z

Is this IT for Flint Spark with SQL async query API? Just wondering if this should be in SQL repo?

@dai-chen This is for three use cases:

Submitting queries directly to Spark in order to test the PPL extension for Spark.
Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension for Spark.
Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying S3/Glue datasources.

Something very similar would make sense for the SQL repo. The SQL repo will not need the spark master and spark containers.

YANG-DB · 2025-01-15T23:30:08Z

Is this IT for Flint Spark with SQL async query API? Just wondering if this should be in SQL repo?

@dai-chen This is for three use cases:

Submitting queries directly to Spark in order to test the PPL extension for Spark.

Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension for Spark.

Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying S3/Glue datasources.

Something very similar would make sense for the SQL repo. The SQL repo will not need the spark master and spark containers.

Actually this make sense as a wide test harness for such use cases where we want to test connectivity between different opensearch datasources ...

penghuo · 2025-01-16T01:10:56Z

docker/integ-test/.env

 OPENSEARCH_DASHBOARDS_PORT=5601
+S3_ACCESS_KEY=Vt7jnvi5BICr1rkfsheT


is it for minio? it could be anything string?

Yes, this is for minio container and is only valid for the minio container.

penghuo · 2025-01-16T01:17:34Z

docs/docker/integ-test/README.md

+
+This container also has a docker volume used to persist the S3 data.
+
+### Configuration-Updated


typo. Configuration-updater?

penghuo · 2025-01-16T01:21:09Z

docs/docker/integ-test/README.md

+docker compose down -d
+```
+
+## Creating Tables in S3


could add example of how to submit query using async API?

Added an example of submitting a query and getting the results.

penghuo · 2025-01-16T01:22:43Z

docs/docker/integ-test/README.md

+1. Submitting queries directly to Spark in order to test the PPL extension for Spark.
+2. Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension
+   for Spark.
+3. Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying


what is the difference of 2 and 3?

For 2, a query is sent directly to Spark and Spark needs to query the index on the OpenSearch server. An example of where this might be useful is testing PPL on Spark, querying OpenSearch indices.

For 3, a query is sent directly to the OpenSearch server for the S3/Glue datasource. A new Spark container is started to process the request. The Spark container needs to query the table using Hive and minio. The Spark container is also making use of the spark-sql-application in this repository to process the query.

Signed-off-by: Norman Jordan <[email protected]>

YANG-DB · 2025-01-16T20:00:12Z

thanks @penghuo !

normanj-bitquill requested review from dai-chen, mengweieric, penghuo, seankao-az, anirudha, kaituo, YANG-DB, noCharger, LantaoJin and ykmr1224 as code owners December 31, 2024 18:44

normanj-bitquill mentioned this pull request Dec 31, 2024

(RFC) Integration Tests on Apache Spark and Spark EMR #992

Open

YANG-DB approved these changes Dec 31, 2024

View reviewed changes

YANG-DB mentioned this pull request Jan 4, 2025

Introduce basic sanity test for MV used by Observability Integrations #995

Merged

5 tasks

YANG-DB added the testing test related feature label Jan 5, 2025

YANG-DB mentioned this pull request Jan 5, 2025

[FEATURE] Understandable logical documentation #1008

Open

YANG-DB reviewed Jan 11, 2025

View reviewed changes

docker/integ-test/docker-compose.yml Outdated Show resolved Hide resolved

YANG-DB reviewed Jan 11, 2025

View reviewed changes

docker/integ-test/opensearch/docker-command-runner.sh Outdated Show resolved Hide resolved

YANG-DB reviewed Jan 11, 2025

View reviewed changes

...t/opensearch/emr-src/com/amazonaws/services/emrserverless/AWSEMRServerlessClientBuilder.java Show resolved Hide resolved

YANG-DB reviewed Jan 11, 2025

View reviewed changes

normanj-bitquill changed the title ~~WIP: Docker integ test with async API~~ Docker integ test with async API Jan 13, 2025

normanj-bitquill added 3 commits January 14, 2025 15:34

WIP: Docker integ test with async API

aa4eff6

Signed-off-by: Norman Jordan <[email protected]>

Partially have Async API working

85a0868

Can submit an async query and the result is written to the result index. Need to create the external table in Spark before submitting the query Signed-off-by: Norman Jordan <[email protected]>

Moved configuration to run in a separate container

403037d

* Adding the S3 access key and the S3/Glue data source is now done is a dedicated short lived container * Added missing license headers Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill added 4 commits January 14, 2025 15:34

Reworked image use for Spark nodes

d14cd1e

* Hive container will now store data on a volume * Spark containers now use a built image, no need for custom entrypoint script or to start with root user Signed-off-by: Norman Jordan <[email protected]>

Job IDs are now incremented

c4aec34

Signed-off-by: Norman Jordan <[email protected]>

Added Spark connect to the Spark master container

5974486

Signed-off-by: Norman Jordan <[email protected]>

Added some documentation for the integration test docker cluster

9aae9bb

Signed-off-by: Norman Jordan <[email protected]>

normanj-bitquill force-pushed the docker-integ-with-async-api branch from 4c35a6c to 9aae9bb Compare January 14, 2025 23:36

Updated sequence diagram

5d40352

Signed-off-by: Norman Jordan <[email protected]>

dai-chen reviewed Jan 15, 2025

View reviewed changes

penghuo reviewed Jan 16, 2025

View reviewed changes

normanj-bitquill added 3 commits January 16, 2025 09:24

Updated the integ-test docker README file

a570e67

Signed-off-by: Norman Jordan <[email protected]>

Fixed up the example query in the README

16f9e5e

Signed-off-by: Norman Jordan <[email protected]>

Added a sample async query results response

5981405

Signed-off-by: Norman Jordan <[email protected]>

penghuo approved these changes Jan 16, 2025

View reviewed changes

penghuo merged commit 5884fea into opensearch-project:main Jan 16, 2025
4 checks passed

normanj-bitquill mentioned this pull request Jan 16, 2025

[FEATURE] Add abstraction to Async API for using Spark opensearch-project/sql#3252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker integ test with async API #1003

Docker integ test with async API #1003

normanj-bitquill commented Dec 31, 2024

normanj-bitquill commented Dec 31, 2024

YANG-DB commented Dec 31, 2024

normanj-bitquill commented Jan 7, 2025

normanj-bitquill commented Jan 11, 2025

normanj-bitquill commented Jan 11, 2025

YANG-DB Jan 11, 2025

normanj-bitquill Jan 13, 2025

YANG-DB Jan 15, 2025

normanj-bitquill commented Jan 13, 2025

normanj-bitquill commented Jan 14, 2025

normanj-bitquill commented Jan 14, 2025

dai-chen left a comment •

edited

Loading

normanj-bitquill commented Jan 15, 2025

YANG-DB commented Jan 15, 2025

penghuo Jan 16, 2025

normanj-bitquill Jan 16, 2025

penghuo Jan 16, 2025

normanj-bitquill Jan 16, 2025

penghuo Jan 16, 2025

normanj-bitquill Jan 16, 2025

penghuo Jan 16, 2025

normanj-bitquill Jan 16, 2025

YANG-DB commented Jan 16, 2025


		su opensearch ./opensearch-docker-entrypoint.sh "$@"

		kill -TERM `cat /var/run/docker-command-runner.pid`

		OPENSEARCH_DASHBOARDS_PORT=5601
		S3_ACCESS_KEY=Vt7jnvi5BICr1rkfsheT


		This container also has a docker volume used to persist the S3 data.

		### Configuration-Updated

Docker integ test with async API #1003

Docker integ test with async API #1003

Conversation

normanj-bitquill commented Dec 31, 2024

Description

Related Issues

Check List

normanj-bitquill commented Dec 31, 2024

YANG-DB commented Dec 31, 2024

normanj-bitquill commented Jan 7, 2025

Current Status

normanj-bitquill commented Jan 11, 2025

normanj-bitquill commented Jan 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

normanj-bitquill commented Jan 13, 2025

normanj-bitquill commented Jan 14, 2025

normanj-bitquill commented Jan 14, 2025

dai-chen left a comment • edited Loading

Choose a reason for hiding this comment

normanj-bitquill commented Jan 15, 2025

YANG-DB commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YANG-DB commented Jan 16, 2025

dai-chen left a comment •

edited

Loading