Skip to content

Workers

The fitting procedure for continuous distributions is performed by invoking the .fit() method. This method provides a parameter n_workers (int, optional), which designates the number of parallel processes employed during the fitting procedure. By default, this parameter is set to 1. An example usage is illustrated below:

python
phi.fit(n_workers=4)

Although parallelization can reduce overall processing time, the relationship between sample size, number of workers, and execution speed is non-trivial. Excessive parallelization may introduce significant overhead and resource contention, causing longer run times in certain scenarios.

Benchmarks: Standard Google Colab Instance

Continuous Fit Time: Sample Size vs. Workers

Sample Size / Workers1261020
1K8.29817.12428.96679.928716.2246
10K20.871114.264710.561211.600417.8562
100K152.629697.235957.731051.618253.2313
500K914.9291640.8153370.0323267.4597257.7534
1M1580.8501972.3985573.5429496.5569425.7809

Analysis (Continuous):

  • For smaller sample sizes (e.g., 1K), the use of two workers yields a moderate improvement over a single worker, whereas a higher number of workers (6, 10, and 20) actually results in increased computation time. This behavior arises from synchronization overhead and resource competition.
  • With larger sample sizes (e.g., 500K and 1M), adding more workers generally decreases computation time, yet the benefit can plateau or even revert if the overhead becomes too large.

Discrete Fit Time: Sample Size vs. Workers

Sample Size / Workers124
1K0.16882.64022.8719
10K0.44622.44523.0471
100K4.55986.32467.5869
500K19.017221.804719.8420
1M39.806529.836030.2334

Analysis (Discrete):

  • In the discrete case, employing multiple workers for smaller samples frequently leads to significantly longer times, indicating that overhead can outweigh the benefits of parallelization.
  • When the sample size is large (e.g., 1M), increased parallelization may reduce time; however, certain configurations still display regressions in performance, likely due to factors such as inter-process communication overhead.

Conclusion:
The optimal choice of n_workers depends on the interplay among sample size, distribution type, and computational resources. Although using more workers can be advantageous in many situations—particularly for extensive datasets—there are clear instances where excessive parallelization diminishes performance. It is therefore advisable to test multiple worker configurations to identify the most efficient setting for a given context.