AWS HPC Batch Worker Manager¶
The AWS HPC worker manager offloads task execution to AWS Batch, running each Scaler task as a containerized job on managed EC2 compute. Use this worker manager when you need to burst workloads to the cloud, access specific hardware (GPUs, high memory), or run long-running jobs at scale.
The worker manager is designed as an extensible HPC framework — AWS Batch is the currently supported backend.
Prerequisites¶
Quick Start¶
Provision the required AWS resources (S3 bucket, IAM roles, compute environment, job queue, and job definition):
python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner provision \
--region us-east-1 \
--prefix scaler-batch \
--vcpus 1 \
--memory 2048 \
--max-vcpus 256
source tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env
The provisioner creates resources named scaler-batch-*. The TOML below uses those names
directly — just update s3_bucket with the value printed by the provisioner (it includes
your AWS account ID):
[[worker_manager]]
type = "aws_hpc"
scheduler_address = "tcp://127.0.0.1:8516"
object_storage_address = "tcp://127.0.0.1:8517"
worker_manager_id = "wm-batch"
job_queue = "scaler-batch-queue"
job_definition = "scaler-batch-job"
s3_bucket = "scaler-batch-123456789012-us-east-1" # replace 123456789012 with your account ID
aws_region = "us-east-1"
max_concurrent_jobs = 100
job_timeout_minutes = 60
# Terminal 1 — Scheduler
scaler_scheduler tcp://0.0.0.0:8516
# Terminal 2 — AWS HPC Worker Manager (Batch backend)
$ scaler config.toml
from scaler import Client
def heavy_computation(x):
return x ** 2
with Client(address="tcp://127.0.0.1:8516") as client:
futures = client.map(heavy_computation, range(50))
print([f.result() for f in futures])
If you don’t have AWS Batch resources yet, follow the detailed setup below.
Detailed Setup¶
Step 1: Configure AWS Credentials¶
aws configure
# Enter your AWS Access Key ID, Secret Access Key, region (e.g. us-east-1), and output format (json)
Your IAM user needs the following permissions:
S3:
s3:CreateBucket,s3:PutObject,s3:GetObject,s3:DeleteObject,s3:PutLifecycleConfigurationIAM:
iam:CreateRole,iam:AttachRolePolicy,iam:PutRolePolicy,iam:CreateInstanceProfile,iam:AddRoleToInstanceProfile,iam:GetRole,iam:PassRoleBatch:
batch:CreateComputeEnvironment,batch:CreateJobQueue,batch:RegisterJobDefinition,batch:SubmitJob,batch:DescribeJobs,batch:DescribeComputeEnvironments,batch:DescribeJobQueues,batch:TerminateJob,batch:DeregisterJobDefinitionECR:
ecr:CreateRepository,ecr:GetAuthorizationToken,ecr:PutLifecyclePolicy,ecr:BatchDeleteImageEC2:
ec2:DescribeSubnets,ec2:DescribeSecurityGroupsCloudWatch Logs:
logs:CreateLogGroup,logs:PutRetentionPolicy,logs:GetLogEvents
Or attach the following AWS managed policies for quick setup:
AmazonS3FullAccess
AWSBatchFullAccess
AmazonEC2ContainerRegistryFullAccess
IAMFullAccess
CloudWatchLogsFullAccess
Step 2: Provision AWS Resources¶
Warning
The provisioner creates resources for quick testing and development only. For production deployments, use your organization’s infrastructure-as-code tools (CloudFormation, CDK, Terraform) with proper security configurations.
Scaler includes a provisioner script that creates all required AWS infrastructure (S3 bucket, IAM roles, EC2 compute environment, job queue, job definition, and ECR repository):
python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner provision \
--region us-east-1 \
--prefix scaler-batch \
--vcpus 1 \
--memory 2048 \
--max-vcpus 256
This will:
Build and push a Docker worker image to ECR
Create an S3 bucket for task payloads and results (with 1-day lifecycle policy)
Create IAM roles with the minimum required permissions
Create an EC2 compute environment and job queue
Register a Batch job definition
The provisioner saves its configuration to:
tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env— shell environment filetests/worker_manager_adapter/aws_hpc/.scaler_aws_batch_config.json— full resource details (used for cleanup)
Source the env file to set variables for subsequent commands:
source tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env
Memory configuration: Memory is rounded to the nearest multiple of 2048 MB and 90% is allocated to the container. For example, --memory 4000 → 4096 MB total → 3686 MB effective.
Using Existing Infrastructure¶
If you already have AWS Batch resources (created via CloudFormation, CDK, Terraform, etc.), skip the provisioner and create the env file manually:
cat > .scaler_aws_hpc.env << 'EOF'
export SCALER_AWS_REGION="us-east-1"
export SCALER_S3_BUCKET="your-existing-bucket"
export SCALER_JOB_QUEUE="your-existing-queue"
export SCALER_JOB_DEFINITION="your-existing-job-def"
EOF
source .scaler_aws_hpc.env
Then continue from Step 3.
Step 3: Start the Scheduler¶
scaler_scheduler tcp://0.0.0.0:8516
Note
The scheduler address must be reachable from the machine running the AWS HPC worker manager. Use 0.0.0.0 to bind to all interfaces, or your machine’s public/private IP.
Step 4: Start the AWS HPC Worker Manager¶
scaler_worker_manager aws_hpc tcp://<SCHEDULER_IP>:8516 \
--job-queue "$SCALER_JOB_QUEUE" \
--job-definition "$SCALER_JOB_DEFINITION" \
--s3-bucket "$SCALER_S3_BUCKET" \
--aws-region "$SCALER_AWS_REGION"
Or use a TOML configuration file:
$ scaler config.toml
[[worker_manager]]
type = "aws_hpc"
scheduler_address = "tcp://<SCHEDULER_IP>:8516"
object_storage_address = "tcp://<SCHEDULER_IP>:8517"
worker_manager_id = "wm-batch"
job_queue = "scaler-batch-queue"
job_definition = "scaler-batch-job"
s3_bucket = "scaler-batch-123456789012-us-east-1"
aws_region = "us-east-1"
max_concurrent_jobs = 100
job_timeout_minutes = 60
Step 5: Submit Tasks¶
from scaler import Client
def heavy_computation(x):
return x ** 2
with Client(address="tcp://<SCHEDULER_IP>:8516") as client:
futures = client.map(heavy_computation, range(50))
results = [f.result() for f in futures]
print(results)
Cleanup¶
To tear down all provisioned AWS resources:
python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner cleanup \
--region us-east-1 \
--prefix scaler-batch
How It Works¶
The worker manager connects to the Scaler scheduler as a worker and receives tasks.
Each task is serialized with
cloudpickleand either passed inline (≤ 28 KB) or uploaded to S3.The worker manager submits an AWS Batch job for each task.
Inside the Batch container, a runner script (
batch_job_runner.py) deserializes the task, executes the function, and writes the result to S3.The worker manager polls for job completion, fetches the result from S3, and returns it to the scheduler.
A semaphore limits concurrent Batch jobs (--max-concurrent-jobs) to prevent exceeding AWS service quotas.
Configuration Reference¶
AWS HPC Parameters¶
scheduler_address(positional, required): Address of the Scaler scheduler.--job-queue(-q, required): AWS Batch job queue name.--job-definition(-d, required): AWS Batch job definition name.--s3-bucket(required): S3 bucket for task payloads and results.--aws-region: AWS region (default:us-east-1).--s3-prefix: S3 key prefix (default:scaler-tasks).--max-concurrent-jobs(-mcj): Max concurrent Batch jobs (default:100).--job-timeout-minutes: Max job runtime in minutes (default:60).--backend(-b): HPC backend (default:batch).--name(-n): Custom name for the worker manager instance.
Common Parameters¶
For networking, worker behavior, logging, and event loop options, see Common Worker Manager Parameters.
Provisioner Reference¶
The provisioner supports these commands:
provision Create all AWS resources and push Docker image
cleanup Tear down all AWS resources
show Display saved configuration
build-image Build and push Docker image only
Flag |
Default |
Description |
|---|---|---|
|
|
AWS region |
|
|
Resource name prefix |
|
(auto-build) |
Container image URI (if provided, skips Docker build and push to ECR) |
|
|
vCPUs per job |
|
|
Memory per job in MB (uses 90% of nearest 2048 MB multiple) |
|
|
Max vCPUs for compute environment |
|
|
EC2 instance types (comma-separated) |
|
|
Job timeout in minutes |
Architecture¶
┌─────────┐ ┌───────────┐ ┌─────────────────┐ ┌───────────────────┐ ┌───────────┐
│ Client │────>│ Scheduler │────>│ AWSBatchWorker │────>│ AWSHPCTaskManager │────>│ AWS Batch │
└─────────┘ └───────────┘ └─────────────────┘ └───────────────────┘ └───────────┘
│ │ │
v v v
┌─────────────────┐ ┌───────────┐ ┌───────────┐
│HeartbeatManager │ │ S3 Bucket │<────────│ Batch Job │
└─────────────────┘ └───────────┘ └───────────┘
Component |
Description |
|---|---|
Client |
Submits tasks to the scheduler using the Scaler API |
Scheduler |
Distributes tasks to available workers via ZMQ streaming |
AWSBatchWorker |
Process that connects to the scheduler and routes messages to the TaskManager |
AWSHPCTaskManager |
Handles task queuing, priority, concurrency control, and AWS Batch job submission |
HeartbeatManager |
Sends periodic heartbeats to the scheduler with worker status |
S3 Bucket |
Stores task payloads (for large tasks) and job results |
AWS Batch |
Executes tasks as containerized jobs on an EC2 compute environment |
Payload Handling¶
Task payloads are serialized with cloudpickle and delivered to AWS Batch jobs.
Payloads larger than 4 KB are gzip-compressed before transfer. The resulting payload
(compressed or not) is passed inline via job parameters if it fits within 28 KiB;
otherwise it is uploaded to S3.
Condition |
Method |
|---|---|
Raw payload ≤ 4 KB |
Inline (uncompressed) |
Raw payload > 4 KB, compressed ≤ 28 KB |
Inline (gzip compressed) |
Raw payload > 4 KB, compressed > 28 KB |
S3 upload |
Troubleshooting¶
Jobs stuck in RUNNABLE:
Check that your compute environment has sufficient capacity (--max-vcpus) and that subnets have internet access for pulling container images.
Permission errors: Ensure the IAM role attached to the job definition has S3 read/write access to the task bucket. The provisioner creates this automatically.
Credential expiration: The worker manager auto-refreshes expired AWS credentials. If using temporary credentials, ensure your session token is valid.
Container image issues:
Your job definition image must have the same Python version as the client (required for cloudpickle compatibility), plus cloudpickle and boto3 installed.
Timeout waiting for result:
Check the AWS Batch console for job status. Increase --max-concurrent-jobs if jobs are queued, or check compute environment capacity.