-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker integ test with async API #1003
Docker integ test with async API #1003
Conversation
This is not complete yet. Still need to:
|
@normanj-bitquill I've also started working on some similar PR with Iceberg based docker-compose - |
Current StatusIn my testing, I have added a second OpenSearch container. This resolves issues with the cluster and indices going into yellow state. Changes in OpenSearch node:
What works:
What is missing:
|
@YANG-DB I have partially working async API in the latest commit. These are my testing steps:
The OpenSearch container will need to bind the docker socket The OpenSearch container will start another container to process the async query. This is the place where we could slip in the EMR spark container (if there is any value from it). |
I have not tested retrieving results using the Async API. This is likely broken, since cannot check on the EMR job status. I also haven't tested a streaming query (also likely broken). |
...t/opensearch/emr-src/com/amazonaws/services/emrserverless/AWSEMRServerlessClientBuilder.java
Show resolved
Hide resolved
...eg-test/opensearch/emr-src/org/opensearch/spark/emrserverless/DockerEMRServerlessClient.java
Outdated
Show resolved
Hide resolved
|
||
su opensearch ./opensearch-docker-entrypoint.sh "$@" | ||
|
||
kill -TERM `cat /var/run/docker-command-runner.pid` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this called ?
can you plz explain the entire flow for the docker compose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an async query is submitted, a start job request is build which contains arguments for running spark-submit
. Normally the start job request is submitted to EMR and it will handle starting a docker container and running spark-submit
. In our case, I want to call docker
to start a new container to run spark-submit
.
Something is preventing the OpenSearch Java process from running external commands. I tried updating the Java security policy but I was still unable to run external commands. As a workaround, the docker arguments are written to a file in /tmp/docker
. There is a separate shell script running docker-command-runner.sh
. It will read the files in /tmp/docker
and run docker.
Ideally the configuration could be fixed so that docker
can be run directly from the Java code.
The motivation for calling docker
is to better emulate what is happening when EMR is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok nice !!
@YANG-DB Using the async API to get the results works once the results are available in the result index. Some interesting behaviour though:
I am likely missing some code for the new EMRClient for job management. This may explain some of the behaviour above. If the Spark app is going to run for 3 minutes, we should skip calling docker when the same session is used and the container is still running. |
I have fixed up this behaviour. The Spark cluster as is, only supports running one application at a time. I have changed the spark submit container to run the query locally. This allows multiple sessions to run at the same time, but each session could have its own container running. I have also disabled downloading the dependencies on startup, to speed up query execution. |
Signed-off-by: Norman Jordan <[email protected]>
Can submit an async query and the result is written to the result index. Need to create the external table in Spark before submitting the query Signed-off-by: Norman Jordan <[email protected]>
* Adding the S3 access key and the S3/Glue data source is now done is a dedicated short lived container * Added missing license headers Signed-off-by: Norman Jordan <[email protected]>
* Hive container will now store data on a volume * Spark containers now use a built image, no need for custom entrypoint script or to start with root user Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
4c35a6c
to
9aae9bb
Compare
@YANG-DB I have added some documentation in this PR. Can you take a look and help with getting a second reviewer? |
Signed-off-by: Norman Jordan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this IT for Flint Spark with SQL async query API? Just wondering if this should be in SQL repo?
@dai-chen This is for three use cases:
Something very similar would make sense for the SQL repo. The SQL repo will not need the spark master and spark containers. |
Actually this make sense as a wide test harness for such use cases where we want to test connectivity between different opensearch datasources ... |
OPENSEARCH_DASHBOARDS_PORT=5601 | ||
S3_ACCESS_KEY=Vt7jnvi5BICr1rkfsheT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it for minio? it could be anything string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is for minio container and is only valid for the minio container.
docs/docker/integ-test/README.md
Outdated
|
||
This container also has a docker volume used to persist the S3 data. | ||
|
||
### Configuration-Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo. Configuration-updater?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
docker compose down -d | ||
``` | ||
|
||
## Creating Tables in S3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add example of how to submit query using async API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example of submitting a query and getting the results.
1. Submitting queries directly to Spark in order to test the PPL extension for Spark. | ||
2. Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension | ||
for Spark. | ||
3. Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference of 2 and 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2, a query is sent directly to Spark and Spark needs to query the index on the OpenSearch server. An example of where this might be useful is testing PPL on Spark, querying OpenSearch indices.
For 3, a query is sent directly to the OpenSearch server for the S3/Glue datasource. A new Spark container is started to process the request. The Spark container needs to query the table using Hive and minio. The Spark container is also making use of the spark-sql-application
in this repository to process the query.
Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
thanks @penghuo ! |
Description
Update the integration test docker stack to support the OpenSearch Async API and using Minio as an S3 storage engine. Also includes having everything configured on startup.
Related Issues
#992
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.