This directory contains YAML configuration files for the creation of two compute environments:
aws_fusion_nvme.yml
: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses Fusion V2 with the 6th generation intel instance type with NVMe storage.aws_plain_s3.yml
: This compute environment is designed to run on Amazon Web Services (AWS) Batch and uses the plain AWS Batch with S3 storage.
These YAML files provide best practice configurations for utilizing these two storage types in AWS Batch compute environments. The Fusion V2 configuration is tailored for high-performance workloads leveraging NVMe storage, while the plain S3 configuration offers a standard setup for comparison and workflows that don't require the advanced features of Fusion V2.
- You have access to the Seqera Platform.
- You have set up AWS credentials in the Seqera Platform workspace.
- Your AWS credentials have the correct IAM permissions if using Batch Forge.
- You have an S3 bucket for the Nextflow work directory.
- You have reviewed and updated the environment variables in env.sh to match your specific AWS setup.
The YAML configurations utilize environment variables defined in the env.sh
file. Here's a breakdown:
Variable | Description | Usage in YAML |
---|---|---|
$COMPUTE_ENV_PREFIX |
Prefix for compute environment name | name field |
$ORGANIZATION_NAME |
Seqera Platform organization | workspace field |
$WORKSPACE_NAME |
Seqera Platform workspace | workspace field |
$AWS_CREDENTIALS |
Name of AWS credentials | credentials field |
$AWS_REGION |
AWS region for compute | region field |
$AWS_WORK_DIR |
Path to Nextflow work directory | work-dir field |
$AWS_COMPUTE_ENV_ALLOWED_BUCKETS |
S3 buckets with read/write access | allow-buckets field |
Using these variables allows easy customization of the compute environment configuration without directly modifying the YAML file, promoting flexibility and reusability.
If we inspect the contents of aws_fusion_nvme.yml
as an example, we can see the overall structure is as follows:
compute-envs:
- type: aws-batch
config-mode: forge
name: "$COMPUTE_ENV_PREFIX_fusion_nvme"
workspace: "$ORGANIZATION_NAME/$WORKSPACE_NAME"
credentials: "$AWS_CREDENTIALS"
region: "$AWS_REGION"
work-dir: "$AWS_WORK_DIR"
wave: True
fusion-v2: True
fast-storage: True
no-ebs-auto-scale: True
provisioning-model: "SPOT"
instance-types: "c6id,m6id,r6id"
max-cpus: 1000
allow-buckets: "$AWS_COMPUTE_ENV_ALLOWED_BUCKETS"
labels: storage=fusionv2,project=benchmarking"
wait: "AVAILABLE"
overwrite: False
Click to expand: YAML format explanation
The top-level block compute-envs
mirrors the tw compute-envs
command. The type
and config-mode
options are seqerakit specific. The nested options in the YAML correspond to options available for the Seqera Platform CLI command. For example, running tw compute-envs add aws-batch forge --help
shows options like --name
, --workspace
, --credentials
, etc., which are provided to the tw compute-envs
command via this YAML definition.
We've pre-configured several options to optimize your Fusion V2 compute environment:
Option | Value | Purpose |
---|---|---|
wave |
True |
Enables Wave, required for Fusion in containerized workloads |
fusion-v2 |
True |
Enables Fusion V2 |
fast-storage |
True |
Enables fast instance storage with Fusion v2 for optimal performance |
no-ebs-auto-scale |
True |
Disables EBS auto-expandable disks (incompatible with Fusion V2) |
provisioning-model |
"SPOT" |
Selects cost-effective spot pricing model |
instance-types |
"c6id,m6id,r6id" |
Selects 6th generation Intel instance types with high-speed local storage |
max-cpus |
1000 |
Sets maximum number of CPUs for this compute environment |
These options ensure your Fusion V2 compute environment is optimized for performance and cost-effectiveness.
Similarly, if we inspect the contents of aws_plain_s3.yml
as an example, we can see the overall structure is as follows:
ccompute-envs:
- type: aws-batch
config-mode: forge
name: "aws_plain_s3"
workspace: "$ORGANIZATION_NAME/$WORKSPACE_NAME"
credentials: "your-aws-credentials"
region: "us-east-1"
work-dir: "s3://your-bucket"
wave: False
fusion-v2: False
fast-storage: False
no-ebs-auto-scale: False
provisioning-model: "SPOT"
max-cpus: 1000
allow-buckets: "s3://bucket1,s3://bucket2,s3://bucket3"
labels: "storage=plains3,project=benchmarking"
wait: "AVAILABLE"
overwrite: False
ebs-blocksize: 150
We've pre-configured several options to optimize your plain S3 compute environment:
Option | Value | Purpose |
---|---|---|
wave |
False |
Disables Wave, as it's not required for plain S3 storage |
fusion-v2 |
False |
Disables Fusion V2, as we're using standard S3 storage |
fast-storage |
False |
Disables the use of fast instance storage, as we're relying on an EBS volume |
no-ebs-auto-scale |
False |
Allows for EBS auto-scaling, which can be beneficial when not using Fusion V2 |
provisioning-model |
"SPOT" |
Selects cost-effective spot pricing model |
instance-types |
"c6i,m6i,r6i" |
Selects 6th generation Intel instance types without local storage |
max-cpus |
1000 |
Sets maximum number of CPUs for this compute environment |
ebs-blocksize |
150 |
Sets the initial EBS block size to 150 GB, providing additional storage for compute instances |
These options ensure your plain S3 compute environment is optimized for performance and cost-effectiveness, providing a baseline for comparison with Fusion V2 performance.
To fill in the details for each of the compute environments:
-
Navigate to the
/compute-envs
directory. -
Open the desired YAML file (
aws_fusion_nvme.yml
oraws_plain_s3.yml
) in a text editor. -
Review the details for each file. If you need to add:
- Labels: See the Labels section.
- Networking: See the Networking section.
-
Save the changes to each file.
-
Use these YAML files to create the compute environments in the Seqera Platform through seqerakit with the following commands.
To create the Fusion V2 compute environment:
seqerakit aws_fusion_nvme.yml
To create the plain S3 compute environment:
seqerakit aws_plain_s3.yml
-
Confirm your Compute Environments have been successfully created in the workspace and show a status of 'AVAILABLE' indicating they are ready for use.
Labels are name=value pairs that can be used to organize and categorize your AWS resources. In the context of our compute environments, labels can be useful for cost tracking and resource management.
We will additionally use process-level labels for further granularity, this is described in the 03_setup_pipelines section.
To add labels to your compute environment:
- In the YAML file, locate the
labels
field. - Add your desired labels as a comma-separated list of key-value pairs. We have pre-populated this with the
storage=fusion|plains3
andproject=benchmarking
labels for better organization.
If your compute environments require custom networking setup using a custom VPC, subnets, and security groups, these can be added as additional YAML fields.
To add networking details to your compute environment:
- In the YAML files for both Fusion V2 and Plain S3, add the following fields, replacing the values with your networking details:
subnets: "subnet-aaaabbbbccccdddd1,subnet-aaaabbbbccccdddd2,subnet-aaaabbbbccccdddd3"
vpc-id: "vpc-aaaabbbbccccdddd"
security-groups: "sg-aaaabbbbccccdddd"
Note: The values for your subnets, vpc-id and security groups must be a comma-separated string as shown above.
- Save your file and create your Compute Environments.