Implementation details
======================

.. note::

    This section provides additional design insights on how we implemented the Parfun library. **Most users should not be
    required to go through this section in detail**.


Design goals
------------

* Distribute the computation of parallelized functions on **multiple servers**.
* Users of the decorator **should not be required to know anything about the underlying infrastructure**.
* The system should **automatically determine the optimal partition size**.


Implementation
--------------

Parfun **uses on either a local or a distributed worker pool to manage and execute tasks** on multiple
machines (see the :py:class:`~parfun.backend.mixins.BackendEngine` interface).

**This parallel engine is responsible for queuing and executing the partitioned tasks**. The library relies on
**heuristics to determine the ideal partition size** based on feedback from the previously executed tasks.


Partition size estimation
-------------------------

Parfun determines the best partition size for parallelization by testing the function on a range of partition sizes.
This is called the **learning phase**.

There are multiple ways of determining the optimal partition size, depending on what constraints you wish to optimize.
**We designed our partition size estimator so that it finds a partition size that provides decent parallelism without
overusing too much computing resources**. In other words, our estimator will prefer a 10x speedup that requires 15 cores
to a 12x speedup that requires 50 cores.

Consider the following graph of how our ``relative_metrics()`` function performs on various partition sizes:


.. image:: images/partition_size_estimator_graph_1.png


**We observe that the relative processing speed (i.e., the time to process one dataset row) drops dramatically as partition size
increases, but stabilizes around a partition size of about 3000** (graph is log scale). This tells us that
smaller partition sizes cause too much overhead (code initialization, input preprocessing ...), while partition sizes
above a few thousands do not significantly reduce the overhead.

Mathematically, we could fit a linear function ``f(partition_size) = α + β / partition_size``, where ``α`` would be the
time it takes to process one row, and ``β`` the function scheduling and initialization duration:


.. image:: images/partition_size_estimator_graph_2.png


Knowing ``α`` and ``β``, the partition size estimator finds a partition size that satisfies these two constraints:

* **The function initialization time (** ``β`` **) should be less than 5% of the total execution time**. This ensures
  that parallelization overhead is balanced with the actual processing time.

* **The processing of a single partition should at least exceed the CPU time required to schedule the task**. There is
  some unmeasured overhead (IPC, OS task management) that will make parallelization of short tasks inefficient.

Based on these constraints, the estimator suggests a partition size of about 2000 for our example function (``relative_metrics()``).