Speeding Up the Fitting Process for Large Datasets
When dealing with extensive datasets, Phitter provides two primary strategies to accelerate the fitting procedure:
Strategy 1: Specify a subsample size smaller than the original dataset by using the
subsample_size
parameter. A subsample size under 100,000 is recommended. Phitter will randomly select observations up to the specified limit. Detailed usage information is located in the sections Create Continuous Fit and Create Discrete Fit.Strategy 2: Specify a smaller subsample for parameter estimation by using the
subsample_estimation_size
parameter. A subsample size under 10,000 is suggested. In this approach, Phitter will draw a random subsample to estimate distribution parameters. Additional instructions appear in the sections Create Continuous Fit and Create Discrete Fit.
Either of these strategies—or both in combination—may be implemented to achieve improved performance. For instance:
import phitter
# Defining a dataset
data: list[int | float] = [...]
# Applying both subsample strategies
phi = phitter.Phitter(
data=data,
subsample_size=10000,
subsample_estimation_size=10000,
)
phi.fit(n_workers=4)
By specifying these parameters, it is possible to reduce computational overhead significantly, particularly when the original dataset is very large.
Example
This tutorial shows how to fit a million data points in less than 20 seconds on a Google Colab standard instance.
Tutorial | Notebooks |
---|---|
Fit Accelerate [Sample>100K] |