AWS HPC Batch Worker Manager

The AWS HPC worker manager offloads task execution to AWS Batch, running each Scaler task as a containerized job on managed EC2 compute. Use this worker manager when you need to burst workloads to the cloud, access specific hardware (GPUs, high memory), or run long-running jobs at scale.

The worker manager is designed as an extensible HPC framework — AWS Batch is the currently supported backend.

Prerequisites

  • An AWS account

  • AWS CLI installed and configured (aws configure)

  • Docker installed (for building the worker container image)

  • Python packages: pip install opengris-scaler boto3

Quick Start

Provision the required AWS resources (S3 bucket, IAM roles, compute environment, job queue, and job definition):

python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner provision \
    --region us-east-1 \
    --prefix scaler-batch \
    --vcpus 1 \
    --memory 2048 \
    --max-vcpus 256
source tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env

The provisioner creates resources named scaler-batch-*. The TOML below uses those names directly — just update s3_bucket with the value printed by the provisioner (it includes your AWS account ID):

config.toml
[[worker_manager]]
type = "aws_hpc"
scheduler_address = "tcp://127.0.0.1:8516"
object_storage_address = "tcp://127.0.0.1:8517"
worker_manager_id = "wm-batch"
job_queue = "scaler-batch-queue"
job_definition = "scaler-batch-job"
s3_bucket = "scaler-batch-123456789012-us-east-1"  # replace 123456789012 with your account ID
aws_region = "us-east-1"
max_concurrent_jobs = 100
job_timeout_minutes = 60
# Terminal 1 — Scheduler
scaler_scheduler tcp://0.0.0.0:8516

# Terminal 2 — AWS HPC Worker Manager (Batch backend)
$ scaler config.toml
my_client.py (Terminal 3)
from scaler import Client

def heavy_computation(x):
    return x ** 2

with Client(address="tcp://127.0.0.1:8516") as client:
    futures = client.map(heavy_computation, range(50))
    print([f.result() for f in futures])

If you don’t have AWS Batch resources yet, follow the detailed setup below.

Detailed Setup

Step 1: Configure AWS Credentials

aws configure
# Enter your AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), and output format (json)

Your IAM user needs the following permissions:

  • S3: s3:CreateBucket, s3:PutObject, s3:GetObject, s3:DeleteObject, s3:PutLifecycleConfiguration

  • IAM: iam:CreateRole, iam:AttachRolePolicy, iam:PutRolePolicy, iam:CreateInstanceProfile, iam:AddRoleToInstanceProfile, iam:GetRole, iam:PassRole

  • Batch: batch:CreateComputeEnvironment, batch:CreateJobQueue, batch:RegisterJobDefinition, batch:SubmitJob, batch:DescribeJobs, batch:DescribeComputeEnvironments, batch:DescribeJobQueues, batch:TerminateJob, batch:DeregisterJobDefinition

  • ECR: ecr:CreateRepository, ecr:GetAuthorizationToken, ecr:PutLifecyclePolicy, ecr:BatchDeleteImage

  • EC2: ec2:DescribeSubnets, ec2:DescribeSecurityGroups

  • CloudWatch Logs: logs:CreateLogGroup, logs:PutRetentionPolicy, logs:GetLogEvents

Or attach the following AWS managed policies for quick setup:

AmazonS3FullAccess
AWSBatchFullAccess
AmazonEC2ContainerRegistryFullAccess
IAMFullAccess
CloudWatchLogsFullAccess

Step 2: Provision AWS Resources

Warning

The provisioner creates resources for quick testing and development only. For production deployments, use your organization’s infrastructure-as-code tools (CloudFormation, CDK, Terraform) with proper security configurations.

Scaler includes a provisioner script that creates all required AWS infrastructure (S3 bucket, IAM roles, EC2 compute environment, job queue, job definition, and ECR repository):

python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner provision \
    --region us-east-1 \
    --prefix scaler-batch \
    --vcpus 1 \
    --memory 2048 \
    --max-vcpus 256

This will:

  1. Build and push a Docker worker image to ECR

  2. Create an S3 bucket for task payloads and results (with 1-day lifecycle policy)

  3. Create IAM roles with the minimum required permissions

  4. Create an EC2 compute environment and job queue

  5. Register a Batch job definition

The provisioner saves its configuration to:

  • tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env — shell environment file

  • tests/worker_manager_adapter/aws_hpc/.scaler_aws_batch_config.json — full resource details (used for cleanup)

Source the env file to set variables for subsequent commands:

source tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env

Memory configuration: Memory is rounded to the nearest multiple of 2048 MB and 90% is allocated to the container. For example, --memory 4000 → 4096 MB total → 3686 MB effective.

Using Existing Infrastructure

If you already have AWS Batch resources (created via CloudFormation, CDK, Terraform, etc.), skip the provisioner and create the env file manually:

cat > .scaler_aws_hpc.env << 'EOF'
export SCALER_AWS_REGION="us-east-1"
export SCALER_S3_BUCKET="your-existing-bucket"
export SCALER_JOB_QUEUE="your-existing-queue"
export SCALER_JOB_DEFINITION="your-existing-job-def"
EOF
source .scaler_aws_hpc.env

Then continue from Step 3.

Step 3: Start the Scheduler

scaler_scheduler tcp://0.0.0.0:8516

Note

The scheduler address must be reachable from the machine running the AWS HPC worker manager. Use 0.0.0.0 to bind to all interfaces, or your machine’s public/private IP.

Step 4: Start the AWS HPC Worker Manager

scaler_worker_manager aws_hpc tcp://<SCHEDULER_IP>:8516 \
    --job-queue "$SCALER_JOB_QUEUE" \
    --job-definition "$SCALER_JOB_DEFINITION" \
    --s3-bucket "$SCALER_S3_BUCKET" \
    --aws-region "$SCALER_AWS_REGION"

Or use a TOML configuration file:

$ scaler config.toml
config.toml
[[worker_manager]]
type = "aws_hpc"
scheduler_address = "tcp://<SCHEDULER_IP>:8516"
object_storage_address = "tcp://<SCHEDULER_IP>:8517"
worker_manager_id = "wm-batch"
job_queue = "scaler-batch-queue"
job_definition = "scaler-batch-job"
s3_bucket = "scaler-batch-123456789012-us-east-1"
aws_region = "us-east-1"
max_concurrent_jobs = 100
job_timeout_minutes = 60

Step 5: Submit Tasks

from scaler import Client

def heavy_computation(x):
    return x ** 2

with Client(address="tcp://<SCHEDULER_IP>:8516") as client:
    futures = client.map(heavy_computation, range(50))
    results = [f.result() for f in futures]
    print(results)

Cleanup

To tear down all provisioned AWS resources:

python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner cleanup \
    --region us-east-1 \
    --prefix scaler-batch

How It Works

  1. The worker manager connects to the Scaler scheduler as a worker and receives tasks.

  2. Each task is serialized with cloudpickle and either passed inline (≤ 28 KB) or uploaded to S3.

  3. The worker manager submits an AWS Batch job for each task.

  4. Inside the Batch container, a runner script (batch_job_runner.py) deserializes the task, executes the function, and writes the result to S3.

  5. The worker manager polls for job completion, fetches the result from S3, and returns it to the scheduler.

A semaphore limits concurrent Batch jobs (--max-concurrent-jobs) to prevent exceeding AWS service quotas.

Configuration Reference

AWS HPC Parameters

  • scheduler_address (positional, required): Address of the Scaler scheduler.

  • --job-queue (-q, required): AWS Batch job queue name.

  • --job-definition (-d, required): AWS Batch job definition name.

  • --s3-bucket (required): S3 bucket for task payloads and results.

  • --aws-region: AWS region (default: us-east-1).

  • --s3-prefix: S3 key prefix (default: scaler-tasks).

  • --max-concurrent-jobs (-mcj): Max concurrent Batch jobs (default: 100).

  • --job-timeout-minutes: Max job runtime in minutes (default: 60).

  • --backend (-b): HPC backend (default: batch).

  • --name (-n): Custom name for the worker manager instance.

Common Parameters

For networking, worker behavior, logging, and event loop options, see Common Worker Manager Parameters.

Provisioner Reference

The provisioner supports these commands:

provision      Create all AWS resources and push Docker image
cleanup        Tear down all AWS resources
show           Display saved configuration
build-image    Build and push Docker image only
Provisioner Flags

Flag

Default

Description

--region

us-east-1

AWS region

--prefix

scaler-batch

Resource name prefix

--image

(auto-build)

Container image URI (if provided, skips Docker build and push to ECR)

--vcpus

1

vCPUs per job

--memory

2048

Memory per job in MB (uses 90% of nearest 2048 MB multiple)

--max-vcpus

256

Max vCPUs for compute environment

--instance-types

default_x86_64

EC2 instance types (comma-separated)

--job-timeout

60

Job timeout in minutes

Architecture

┌─────────┐     ┌───────────┐     ┌─────────────────┐     ┌───────────────────┐     ┌───────────┐
│  Client  │────>│ Scheduler │────>│  AWSBatchWorker │────>│ AWSHPCTaskManager │────>│ AWS Batch │
└─────────┘     └───────────┘     └─────────────────┘     └───────────────────┘     └───────────┘
                                           │                        │                      │
                                           v                        v                      v
                                  ┌─────────────────┐         ┌───────────┐         ┌───────────┐
                                  │HeartbeatManager │         │ S3 Bucket │<────────│ Batch Job │
                                  └─────────────────┘         └───────────┘         └───────────┘
Components

Component

Description

Client

Submits tasks to the scheduler using the Scaler API

Scheduler

Distributes tasks to available workers via ZMQ streaming

AWSBatchWorker

Process that connects to the scheduler and routes messages to the TaskManager

AWSHPCTaskManager

Handles task queuing, priority, concurrency control, and AWS Batch job submission

HeartbeatManager

Sends periodic heartbeats to the scheduler with worker status

S3 Bucket

Stores task payloads (for large tasks) and job results

AWS Batch

Executes tasks as containerized jobs on an EC2 compute environment

Payload Handling

Task payloads are serialized with cloudpickle and delivered to AWS Batch jobs. Payloads larger than 4 KB are gzip-compressed before transfer. The resulting payload (compressed or not) is passed inline via job parameters if it fits within 28 KiB; otherwise it is uploaded to S3.

Condition

Method

Raw payload ≤ 4 KB

Inline (uncompressed)

Raw payload > 4 KB, compressed ≤ 28 KB

Inline (gzip compressed)

Raw payload > 4 KB, compressed > 28 KB

S3 upload

Troubleshooting

Jobs stuck in RUNNABLE: Check that your compute environment has sufficient capacity (--max-vcpus) and that subnets have internet access for pulling container images.

Permission errors: Ensure the IAM role attached to the job definition has S3 read/write access to the task bucket. The provisioner creates this automatically.

Credential expiration: The worker manager auto-refreshes expired AWS credentials. If using temporary credentials, ensure your session token is valid.

Container image issues: Your job definition image must have the same Python version as the client (required for cloudpickle compatibility), plus cloudpickle and boto3 installed.

Timeout waiting for result: Check the AWS Batch console for job status. Increase --max-concurrent-jobs if jobs are queued, or check compute environment capacity.