Skip to content

Latest commit

 

History

History
108 lines (79 loc) · 3.39 KB

File metadata and controls

108 lines (79 loc) · 3.39 KB

EMR Serverless Python API Example

This example shows how to call the EMR Serverless API using the boto3 module.

In it, we create a new virtualenv, install boto3~=1.23.9, and create a new EMR Serverless Application and Spark job.

Pre-requisites

The steps below were tested on macOS.

# Create and activate new virtualenv
python3 -m venv python-sdk-test
source python-sdk-test/bin/activate

# Install botocore in the virtualenv
python3 -m pip install boto3~=1.23.9

To verify your installation, you can run the following command which will show any EMR Serverless applications you currently have running.

python3 -c 'import boto3; import pprint; pprint.pprint(boto3.client("emr-serverless").list_applications())'

Example Python API Usage

The script shown below will:

  • Create a new EMR Serverless Application
  • Start a new Spark job with a sample wordcount application
  • Stop and delete your Application when done

It is intended as a high-level demo of how to use the boto3 with the EMR Serverless API.

import boto3

client = boto3.client("emr-serverless")

# Create Application
response = client.create_application(
    name="my-application", releaseLabel="emr-6.6.0", type="SPARK"
)

print(
    "Created application {name} with application id {applicationId}. Arn: {arn}".format_map(
        response
    )
)

# Start Application
# Note that application must be in `CREATED` or `STOPPED` state.
# Use client.get_application(applicationId='<application_id>') to fetch state.
client.start_application(applicationId="<application_id>")

# Submit Job
# Note that application must be in `STARTED` state.
response = client.start_job_run(
    applicationId="<application_id>",
    executionRoleArn="<execution_role_arn>",
    jobDriver={
        "sparkSubmit": {
            "entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py",
            "entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET/output"],
            "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1",
        }
    },
    configurationOverrides={
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {"logUri": "s3://DOC-EXAMPLE-BUCKET/logs"}
        }
    },
)

# Get the status of the job
client.get_job_run(applicationId="<application_id>", jobRunId="<job_run_id>")

# Shut down and delete your application
client.stop_application(applicationId="<application_id>")
client.delete_application(applicationId="<application_id>")

Once the job is running, you can also view Spark logs.

# View Spark logs
aws s3 ls s3://DOC-EXAMPLE-BUCKET/logs/applications/<application_id>/jobs/<job_run_id>/

Full EMR Serverless Python Example

For a more complete example, please see the emr_serverless.py file.

It can be used to run a full end-to-end PySpark sample job on EMR Serverless.

All you need to provide is a Job Role ARN and an S3 Bucket the Job Role has access to write to.

python emr_serverless.py \
    --job-role-arn arn:aws:iam::123456789012:role/emr-serverless-job-role \
    --s3-bucket my-s3-bucket