OCI HPC Worker Manager¶
The OCI HPC worker manager offloads task execution to OCI Container Instances, running each Scaler task as a dedicated containerized job. Unlike the OCI Raw worker manager, which launches long-running worker processes, the OCI HPC worker manager creates a new Container Instance for every task. Task payloads and results are relayed through OCI Object Storage. Use this worker manager when you need per-task isolation, access to high-memory or specialized hardware shapes, or want to burst large workloads to OCI.
Prerequisites¶
Quick Start¶
Install OCI CLI and configure credentials:
bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"
oci setup config
Install Scaler with OCI extras:
python -m venv .venv
source .venv/bin/activate
pip install opengris-scaler[oci]
Provision required OCI resources (Object Storage bucket, Dynamic Group, IAM policies, OCIR repository):
python -m scaler.worker_manager_adapter.oci_hpc.utility.provisioner provision \
--compartment-id ocid1.compartment.oc1..example \
--region us-ashburn-1 \
--availability-domain AD-1 \
--subnet-id ocid1.subnet.oc1..example
This writes a .scaler_oci_config.json file in the current directory. Build and push the job runner image:
python -m scaler.worker_manager_adapter.oci_hpc.utility.provisioner \
build-image --config .scaler_oci_config.json
Validate all phases with the test harness:
python tests/worker_manager_adapter/oci_hpc/oci_hpc_test_harness.py \
--config .scaler_oci_config.json
Start all services:
[object_storage_server]
bind_address = "tcp://127.0.0.1:8517"
[scheduler]
bind_address = "tcp://127.0.0.1:8516"
object_storage_address = "tcp://127.0.0.1:8517"
[[worker_manager]]
type = "oci_hpc"
scheduler_address = "tcp://127.0.0.1:8516"
object_storage_address = "tcp://127.0.0.1:8517"
worker_manager_id = "wm-oci-hpc"
object_storage_namespace = "<tenancy-namespace>"
object_storage_bucket = "scaler-tasks"
oci_region = "us-ashburn-1"
compartment_id = "ocid1.compartment.oc1..example"
availability_domain = "AD-1"
subnet_id = "ocid1.subnet.oc1..example"
container_image = "us-ashburn-1.ocir.io/<namespace>/<repo>:latest"
base_concurrency = 100
job_timeout_seconds = 3600
Run command:
scaler config.toml
scaler_object_storage_server tcp://127.0.0.1:8517
scaler_scheduler tcp://0.0.0.0:8516 \
--object-storage-address tcp://127.0.0.1:8517
scaler_worker_manager oci_hpc tcp://127.0.0.1:8516 \
--worker-manager-id wm-oci-hpc \
--object-storage-address tcp://127.0.0.1:8517 \
--object-storage-namespace "<tenancy-namespace>" \
--object-storage-bucket scaler-tasks \
--oci-region us-ashburn-1 \
--compartment-id ocid1.compartment.oc1..example \
--availability-domain AD-1 \
--subnet-id ocid1.subnet.oc1..example \
--container-image us-ashburn-1.ocir.io/<namespace>/<repo>:latest \
--base-concurrency 100 \
--job-timeout-seconds 3600
from scaler import Client
def heavy_computation(x):
return x ** 2
with Client(address="tcp://127.0.0.1:8516") as client:
futures = client.map(heavy_computation, range(50))
print([f.result() for f in futures])
Using Existing Infrastructure¶
If you already have OCI resources, skip the provisioner and set environment variables directly:
export OCI_REGION="us-ashburn-1"
export OCI_COMPARTMENT_ID="ocid1.compartment.oc1..example"
export OCI_AVAILABILITY_DOMAIN="AD-1"
export OCI_SUBNET_ID="ocid1.subnet.oc1..example"
export OCI_CONTAINER_IMAGE="us-ashburn-1.ocir.io/<namespace>/<repo>:latest"
export OCI_OBJECT_STORAGE_NAMESPACE="<tenancy-namespace>"
export OCI_OBJECT_STORAGE_BUCKET="scaler-tasks"
How It Works¶
The worker manager connects to the Scaler scheduler as a worker and receives tasks.
Each task is serialized with
cloudpickleand uploaded to OCI Object Storage.A new Container Instance is created for each task. The job runner script inside the container fetches the payload from Object Storage, deserializes and executes the function, and writes the result back to Object Storage.
The worker manager polls for container completion, fetches the result from Object Storage, and returns it to the scheduler.
A semaphore limits concurrent Container Instances (
base_concurrency) to prevent exceeding OCI service limits.
Container instances authenticate to Object Storage via OCI Resource Principals — the container does not need credentials embedded in the image.
Configuration Reference¶
OCI HPC Parameters¶
scheduler_address(positional, required): Address of the Scaler scheduler.--worker-manager-id(-wmi, required): Unique identifier for this worker manager instance.--object-storage-address: Scaler Object Storage address (used by the worker manager itself).--object-storage-namespace(required): OCI Object Storage tenancy namespace.--object-storage-bucket(required): OCI Object Storage bucket name for task payloads and results.--object-storage-prefix: Key prefix for task objects (default:scaler-tasks).--base-concurrency(-bc): Maximum number of concurrently running container instances (default:100).--job-timeout-seconds: Maximum runtime in seconds for a single container instance (default:3600).
Container Instance Config¶
--oci-region: OCI region identifier (default:us-ashburn-1).--compartment-id(required): OCI Compartment OCID where container instances are launched.--availability-domain(required): OCI Availability Domain (e.g.AD-1orUocm:US-ASHBURN-AD-1).--subnet-id(required): Subnet OCID for container instance network interfaces.--container-image(required): OCIR image URI for the job runner container.--instance-shape: Container instance shape (default:CI.Standard.E4.Flex).--auth-type: OCI authentication mode —config_file(default) orinstance_principal.--oci-profile: OCI config file profile name (default:DEFAULT).
Sizing Parameters¶
--instance-ocpus: Number of OCPUs per container instance (default:1.0).--instance-memory-gb: Memory in GB per container instance (default:6.0).
Common Parameters¶
For worker behavior, logging, and event loop options, see Common Worker Manager Parameters.
Provisioner Reference¶
The provisioner supports these commands:
provision Create all OCI resources and generate config file
cleanup Tear down all OCI resources
build-image Build and push job runner Docker image
Flag |
Default |
Description |
|---|---|---|
|
(required) |
OCI Compartment OCID |
|
|
OCI region identifier |
|
(required) |
OCI Availability Domain |
|
(required) |
OCI Subnet OCID |
|
|
Path to saved provisioner config (used by |
Architecture¶
┌─────────┐ ┌───────────┐ ┌──────────────────────┐ ┌──────────────────────────┐
│ Client │────>│ Scheduler │────>│ OCI HPC WorkerProc │────>│ OCI Container Instances │
└─────────┘ └───────────┘ └──────────────────────┘ └────────────┬─────────────┘
│ │
v v
┌─────────────────┐ ┌──────────────────┐
│ HeartbeatManager│ │ Object Storage │
└─────────────────┘ └──────────────────┘
Component |
Description |
|---|---|
Client |
Submits tasks to the scheduler using the Scaler API |
Scheduler |
Distributes tasks to available workers via ZMQ streaming |
OCI HPC WorkerProc |
Connects to the scheduler, receives tasks, submits them as Container Instance jobs, and returns results |
Object Storage |
Relays task payloads (serialized with |
OCI Container Instances |
Runs the job runner script per task; authenticates to Object Storage via Resource Principals |
Payload Handling¶
Task payloads are serialized with cloudpickle and uploaded to OCI Object Storage. The container instance job runner downloads the payload, executes the task, and writes the result back to Object Storage. The worker manager fetches the result and returns it to the scheduler.
Note
The provisioner sets a 7-day lifecycle policy on the Object Storage bucket. Ensure this is at least as long as job_timeout_seconds (default: 1 hour). Objects that expire before the worker manager fetches the result will cause the task to fail silently.
Troubleshooting¶
Container instances created but result never returns:
Check container logs in OCI Console → Container Instances → your instance → Logs. Verify that the Object Storage bucket lifecycle is at least as long as job_timeout_seconds.
Permission errors when launching container instances:
Ensure your IAM user or Dynamic Group has manage container-instances and manage virtual-network-family policies in the compartment. The provisioner creates these automatically.
Resource Principal errors inside the container:
The job runner uses OCI Resource Principals to access Object Storage. Ensure the container instance’s compartment has a Dynamic Group policy that grants manage objects on the bucket. The provisioner sets this up automatically.
Jobs stuck in CREATING:
Check OCI service limits for Container Instances in your region and availability domain. If you have hit the limit, reduce base_concurrency or request a limit increase.
Container image pull errors: Ensure the OCIR repository is in the same region and that the container instance’s subnet can reach the OCIR endpoint. For private repositories, ensure the OCIR pull secret is accessible.