AWS HPC Batch Worker Manager¶
The AWS HPC worker manager offloads task execution to AWS Batch, running each Scaler task as a containerized job on managed EC2 compute. Use this worker manager when you need to burst workloads to the cloud, access specific hardware (GPUs, high memory), or run long-running jobs at scale.
The worker manager is designed as an extensible HPC framework — AWS Batch is the currently supported backend.
Prerequisites¶
An AWS account
AWS CLI installed and configured (
aws configure)Docker installed (for building the worker container image)
Python 3.10+ with
pip install opengris-scaler[aws]
Note
Building from source requires C++ build tools (cmake, ninja, pkg-config, and a C++20 compiler).
On macOS: brew install ninja pkg-config.
Alternatively, use the project’s devcontainer which has all dependencies pre-installed:
docker build -t scaler-dev -f .devcontainer/Dockerfile .
docker run -it --rm -v $(pwd):/workspace -w /workspace scaler-dev bash
Quick Start¶
Install AWS CLI v2:
Warning
Do not use pip install awscli for this setup. That installs AWS CLI v1.
Use the official AWS CLI v2 installer instead.
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Create a virtual environment, install Scaler AWS extras, and authenticate:
python -m venv .venv
source .venv/bin/activate
pip install opengris-scaler[aws]
aws login
Click the page link and proceed in your default browser to sign in, then follow the AWS CLI instructions in the terminal.
aws login --remote
Open the URL shown by the command in your local browser, complete sign-in, then copy the returned code/token and paste it back into the remote terminal to complete authentication.
Provision required AWS resources (S3 bucket, IAM roles, compute environment, job queue, and job definition):
python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner provision --region us-east-1 --prefix scaler-batch --vcpus 1 --memory 2048 --max-vcpus 256
source tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env
The provisioner also builds and pushes the worker image from Dockerfile.batch.
The image must use the same Python version as the client and include cloudpickle and boto3.
The provisioner also creates required AWS resources and applies required IAM
permissions. See Required AWS Resources and
Required AWS IAM Permissions.
Start scheduler and worker manager from one TOML file (replace account ID):
[object_storage_server]
bind_address = "tcp://127.0.0.1:2346"
[scheduler]
bind_address = "tcp://127.0.0.1:2345"
object_storage_address = "tcp://127.0.0.1:2346"
[[worker_manager]]
type = "aws_hpc"
scheduler_address = "tcp://127.0.0.1:2345"
object_storage_address = "tcp://127.0.0.1:2346"
worker_manager_id = "wm-batch"
job_queue = "scaler-batch-queue"
job_definition = "scaler-batch-job"
s3_bucket = "scaler-batch-123456789012-us-east-1" # replace 123456789012 with your account ID
aws_region = "us-east-1"
max_concurrent_jobs = 100
job_timeout_minutes = 60
scaler config.toml
from scaler import Client
def heavy_computation(x):
return x ** 2
with Client(address="tcp://127.0.0.1:2345") as client:
results = client.map(heavy_computation, range(50))
print(results)
Detailed Setup¶
Step 1: Provision AWS Resources¶
Warning
The provisioner creates resources for quick testing and development only. For production deployments, use your organization’s infrastructure-as-code tools (CloudFormation, CDK, Terraform) with proper security configurations.
Scaler includes a provisioner script that creates all required AWS infrastructure (S3 bucket, IAM roles, EC2 compute environment, job queue, job definition, and ECR repository):
python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner provision \
--region us-east-1 \
--prefix scaler-batch \
--vcpus 1 \
--memory 2048 \
--max-vcpus 256
This will:
Build and push a Docker worker image to ECR
Create an S3 bucket for task payloads and results (with 1-day lifecycle policy)
Create IAM roles with the minimum required permissions
Create an EC2 compute environment and job queue
Register a Batch job definition
The provisioner saves its configuration to:
tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env— shell environment filetests/worker_manager_adapter/aws_hpc/.scaler_aws_batch_config.json— full resource details (used for cleanup)
Source the env file to set variables for subsequent commands:
source tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env
Memory configuration: Memory is rounded to the nearest multiple of 2048 MB and 90% is allocated to the container. For example, --memory 4000 → 4096 MB total → 3686 MB effective.
Using Existing Infrastructure¶
If you already have AWS Batch resources (created via CloudFormation, CDK, Terraform, etc.), skip the provisioner and create the env file manually:
cat > .scaler_aws_hpc.env << 'EOF'
export SCALER_AWS_REGION="us-east-1"
export SCALER_S3_BUCKET="your-existing-bucket"
export SCALER_JOB_QUEUE="your-existing-queue"
export SCALER_JOB_DEFINITION="your-existing-job-def"
EOF
source .scaler_aws_hpc.env
Then continue from Step 2.
Step 2: Set Up the Environment¶
Option A: Native install (macOS/Linux)
pip install opengris-scaler[aws]
On macOS, you may need build tools: brew install ninja pkg-config.
Option B: Devcontainer (recommended for development/testing)
The devcontainer has all C++ build tools and dependencies pre-installed.
Build the devcontainer image (on host):
docker build -t scaler-dev -f .devcontainer/Dockerfile .
Export AWS credentials and start the container:
# If using assumed roles (isengardcli, SSO, etc.)
eval $(aws configure export-credentials --format env)
docker run -it --rm \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e AWS_SESSION_TOKEN \
-e AWS_DEFAULT_REGION=us-east-1 \
-v $(pwd):/workspace -w /workspace scaler-dev bash
# If using static credentials (~/.aws/credentials)
# docker run -it --rm -v ~/.aws:/root/.aws:ro \
# -v $(pwd):/workspace -w /workspace scaler-dev bash
Install inside the container:
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pip install boto3
Note
Session credentials expire (typically 1 hour). If you get credential errors, exit the container, re-export credentials on the host, and restart the container.
Step 3: Start All Processes¶
Use a single TOML configuration file to start the object storage server, scheduler, and worker manager together.
Source the provisioned config and generate the TOML:
source tests/worker_manager_adapter/aws_hpc/.scaler_aws_hpc.env
sed -e "s|scaler-batch-queue|${SCALER_JOB_QUEUE}|" \
-e "s|scaler-batch-job|${SCALER_JOB_DEFINITION}|" \
-e "s|scaler-batch-ACCOUNT_ID-us-east-1|${SCALER_S3_BUCKET}|" \
-e "s|aws_region = \"us-east-1\"|aws_region = \"${SCALER_AWS_REGION}\"|" \
tests/worker_manager_adapter/aws_hpc/scaler_aws_hpc_batch.toml > /tmp/config.toml
Start all processes:
scaler /tmp/config.toml
See tests/worker_manager_adapter/aws_hpc/scaler_aws_hpc_batch.toml for the template.
Alternatively, start each process separately:
scaler_object_storage_server tcp://127.0.0.1:2346 &
scaler_scheduler tcp://0.0.0.0:2345 --object-storage-address tcp://127.0.0.1:2346 &
scaler_worker_manager aws_hpc tcp://127.0.0.1:2345 -wmi wm-batch \
--job-queue "$SCALER_JOB_QUEUE" --job-definition "$SCALER_JOB_DEFINITION" \
--s3-bucket "$SCALER_S3_BUCKET" --aws-region "$SCALER_AWS_REGION" &
Note
The scheduler address must be reachable from the machine running the AWS HPC worker manager. Use 0.0.0.0 to bind to all interfaces, or your machine’s public/private IP.
Step 4: Run Tests¶
With all processes running (from Step 2), submit tasks:
python tests/worker_manager_adapter/aws_hpc/aws_hpc_test_harness.py \
--scheduler tcp://127.0.0.1:2345 --test all
Or use the Scaler client directly:
from scaler import Client
def heavy_computation(x):
return x ** 2
with Client(address="tcp://<SCHEDULER_IP>:2345") as client:
futures = client.map(heavy_computation, range(50))
results = [f.result() for f in futures]
print(results)
Cleanup¶
To tear down all provisioned AWS resources:
python -m scaler.worker_manager_adapter.aws_hpc.utility.provisioner cleanup \
--region us-east-1 \
--prefix scaler-batch
How It Works¶
The worker manager connects to the Scaler scheduler as a worker and receives tasks.
Each task is serialized with
cloudpickleand either passed inline (≤ 28 KB) or uploaded to S3.When multiple tasks arrive within a short window (0.5s), they are automatically batched into a single AWS Batch array job, reducing API calls from N to 1. Single tasks are submitted individually.
Inside the Batch container, a runner script (
batch_job_runner.py) deserializes the task, executes the function, and writes the result to S3. For array jobs, each child container uses itsAWS_BATCH_JOB_ARRAY_INDEXto pick the correct payload.The worker manager polls for job completion, fetches the result from S3, and returns it to the scheduler.
A semaphore limits concurrent Batch jobs (--max-concurrent-jobs) to prevent exceeding AWS service quotas. All AWS API calls run in a thread pool to avoid blocking the heartbeat loop.
Configuration Reference¶
AWS HPC Parameters¶
scheduler_address(positional, required): Address of the Scaler scheduler.--worker-manager-id(-wmi, required): Unique identifier for this worker manager instance.--job-queue(-q, required): AWS Batch job queue name.--job-definition(-d, required): AWS Batch job definition name.--s3-bucket(required): S3 bucket for task payloads and results.--aws-region: AWS region (default:us-east-1).--s3-prefix: S3 key prefix (default:scaler-tasks).--max-concurrent-jobs(-mcj): Max concurrent Batch jobs (default:100).--job-timeout-minutes: Max job runtime in minutes (default:60).--backend(-b): HPC backend (default:batch).--name(-n): Custom name for the worker manager instance.
Common Parameters¶
For networking, worker behavior, logging, and event loop options, see Common Worker Manager Parameters.
Provisioner Reference¶
The provisioner supports these commands:
provision Create all AWS resources and push Docker image
cleanup Tear down all AWS resources
show Display saved configuration
build-image Build and push Docker image only
Flag |
Default |
Description |
|---|---|---|
|
|
AWS region |
|
|
Resource name prefix |
|
(auto-build) |
Container image URI (if provided, skips Docker build and push to ECR) |
|
|
vCPUs per job |
|
|
Memory per job in MB (uses 90% of nearest 2048 MB multiple) |
|
|
Max vCPUs for compute environment |
|
|
EC2 instance types (comma-separated) |
|
|
Job timeout in minutes |
Architecture¶
┌─────────┐ ┌───────────┐ ┌─────────────────┐ ┌───────────────────┐ ┌───────────┐
│ Client │────>│ Scheduler │────>│ AWSBatchWorker │────>│ AWSHPCTaskManager │────>│ AWS Batch │
└─────────┘ └───────────┘ └─────────────────┘ └───────────────────┘ └───────────┘
│ │ │
v v v
┌─────────────────┐ ┌───────────┐ ┌───────────┐
│HeartbeatManager │ │ S3 Bucket │<────────│ Batch Job │
└─────────────────┘ └───────────┘ └───────────┘
Component |
Description |
|---|---|
Client |
Submits tasks to the scheduler using the Scaler API |
Scheduler |
Distributes tasks to available workers via ZMQ streaming |
AWSBatchWorker |
Process that connects to the scheduler and routes messages to the TaskManager |
AWSHPCTaskManager |
Handles task queuing, priority, concurrency control, and AWS Batch job submission |
HeartbeatManager |
Sends periodic heartbeats to the scheduler with worker status |
S3 Bucket |
Stores task payloads (for large tasks) and job results |
AWS Batch |
Executes tasks as containerized jobs on an EC2 compute environment |
Payload Handling¶
Task payloads are serialized with cloudpickle and delivered to AWS Batch jobs.
Payloads larger than 4 KB are gzip-compressed before transfer. The resulting payload
(compressed or not) is passed inline via job parameters if it fits within 28 KiB;
otherwise it is uploaded to S3.
Condition |
Method |
|---|---|
Raw payload ≤ 4 KB |
Inline (uncompressed) |
Raw payload > 4 KB, compressed ≤ 28 KB |
Inline (gzip compressed) |
Raw payload > 4 KB, compressed > 28 KB |
S3 upload |
Troubleshooting¶
Jobs stuck in RUNNABLE:
Check that your compute environment has sufficient capacity (--max-vcpus) and that subnets have internet access for pulling container images.
Permission errors: Ensure the IAM role attached to the job definition has S3 read/write access to the task bucket. The provisioner creates this automatically.
Credential expiration: The worker manager auto-refreshes expired AWS credentials. If using temporary credentials, ensure your session token is valid.
Container image issues:
Your job definition image must have the same Python version as the client (required for cloudpickle compatibility), plus cloudpickle and boto3 installed.
Timeout waiting for result:
Check the AWS Batch console for job status. Increase --max-concurrent-jobs if jobs are queued, or check compute environment capacity.
Required AWS Resources¶
The provisioner creates these resources automatically:
Resource |
Name pattern |
Purpose |
|---|---|---|
S3 bucket |
|
Task payloads and results (1-day lifecycle) |
IAM job role |
|
Assumed by Batch container tasks |
IAM instance role |
|
Assumed by EC2 instances in compute environment |
EC2 instance profile |
|
Wraps instance role for EC2 |
ECR repository |
|
Stores worker Docker image |
Compute environment |
|
Managed EC2 fleet |
Job queue |
|
Routes jobs to compute environment |
Job definition |
|
Container spec and entrypoint |
CloudWatch log group |
|
Job logs (30-day retention) |
Required AWS IAM Permissions¶
Your IAM user/role (to run the provisioner and worker manager)
S3:
s3:CreateBucket,s3:PutObject,s3:GetObject,s3:DeleteObject,s3:PutLifecycleConfigurationIAM:
iam:CreateRole,iam:AttachRolePolicy,iam:PutRolePolicy,iam:CreateInstanceProfile,iam:AddRoleToInstanceProfile,iam:GetRole,iam:PassRoleBatch:
batch:CreateComputeEnvironment,batch:CreateJobQueue,batch:RegisterJobDefinition,batch:SubmitJob,batch:DescribeJobs,batch:DescribeComputeEnvironments,batch:DescribeJobQueues,batch:TerminateJob,batch:DeregisterJobDefinitionECR:
ecr:CreateRepository,ecr:GetAuthorizationToken,ecr:PutLifecyclePolicy,ecr:BatchDeleteImageEC2:
ec2:DescribeSubnets,ec2:DescribeSecurityGroupsCloudWatch Logs:
logs:CreateLogGroup,logs:PutRetentionPolicy,logs:GetLogEvents
Quick-setup managed policies: AmazonS3FullAccess, AWSBatchFullAccess,
AmazonEC2ContainerRegistryFullAccess, IAMFullAccess, CloudWatchLogsFullAccess.
Batch job role (
{prefix}-job-role, assumed byecs-tasks.amazonaws.com)
AmazonECSTaskExecutionRolePolicy(managed; covers ECR pulls and CloudWatch logs writes)Inline S3 policy:
s3:GetObject,s3:PutObject,s3:DeleteObjectonarn:aws:s3:::{bucket}/{prefix}/*
EC2 instance role (
{prefix}-instance-role, assumed byec2.amazonaws.com)
AmazonEC2ContainerServiceforEC2Role(managed; allows EC2 to register with ECS/Batch, pull images, report status)