Basic parallel tasks using parfun

For more details about parfun features and APIs, see the parfun documentation.

Count words with parfun

This simple example counts the number of words in a single large file.

Using parfun, the computation will be parallelized across multiple workers on a Scaler cluster using a map-reduce strategy.

The @pf.parallel decorator is used to specify how input data should be split and how results should be combined.

Parfun count words flowchart
import argparse
from collections import Counter
from typing import List

import parfun as pf


@pf.parallel(
    split=pf.per_argument(lines=pf.py_list.by_chunk),  # Parallelize by splitting the file content
    combine_with=lambda results: sum(results, start=Counter()),  # Sum the result counters
)
def count_words(lines: List[str]) -> Counter:
    counter: Counter = Counter()
    for line in lines:
        for word in line.split():
            counter[word] += 1
    return counter


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("filename", nargs="?", default=__file__)
    parser.add_argument("--scaler-address", dest="scaler_address", default=None)

    args = parser.parse_args()

    with open(args.filename, "r") as file:
        lines = file.readlines()

    if args.scaler_address:
        # Connects to a remote Scaler Cluster
        with pf.set_parallel_backend_context("scaler_remote", args.scaler_address):
            results = count_words(lines)
    else:
        # Creates a temporary local Scaler cluster
        with pf.set_parallel_backend_context("scaler_local"):
            results = count_words(lines)

    print(results)

To count the words in a file, you can run the following command, connecting to a running Scaler cluster:

python examples/libraries/count_words.py README.md --scaler-address tcp://127.0.0.1:8516