Parametric Distribution Fitting

Description

Parametric distribution fitting is the process used to select a statistical distribution that best fits a data set.

Why to use

Statistical Analysis

When to use

To decide the distribution best suited for data description.

When not to use

When the distribution of data is known.

Prerequisites


Input

A dataset containing continuous data.

Output

Best fit distributions sorted by the Goodness of Fit tests.

Statistical Methods used

Limitations

It can be used only on continuous data. It doesn’t work on categorical/textual data.

Parametric Distribution Fitting is located under Model Studio () in Statistical Analysis, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Parametric Distribution Fitting.

In Distribution fitting, first, the data is matched against probability distributions using parametric distribution-fitting. For each distribution, this fitting establishes a set of parameters that best describe the data characteristics. Next, one of the several Goodness-of-Fit tests is employed to determine the closeness of each fit. Finally, out of the results obtained, the highest-ranked fit is selected to represent the data.

Properties of Parametric Distribution Fitting

The available properties of Parametric Distribution Fitting are as shown in the figure given below.
















The table given below describes the different fields present on the Properties pane of Parametric Distribution Fitting.

Field

Description

Remark

Task Name

It is the name of the task selected on the workbook canvas.

You can click the text field to edit or modify the name of the task as required.

Data Column

It allows you to select the column on which you want to apply Parametric Distribution Fitting.

Only a numerical column can be selected.

Distributions to Fit

It allows you to select the various distributions to fit the data.

  • The available values are –

o   Beta

o   Cauchy

o   Exponential

o   Exponentially modified Normal

o   Exponentiated Wiebull

o   F

o   Gamma

o   Log-normal

o   Logistic

o   Normal

o   Pearson type III

o   Uniform

o   Weibull maximum

o   Weibull minimum

  • You can select more than one value.
  • In the output, scores of the selected fits are displayed.

Rank Distributions by Goodness of Fit Tests

It allows you to rank the best fit distributions by the Goodness of Fit tests.


  • The available Goodness of Fit tests are –

o   Kolmogorov-Smirnov

o   Chi square

o   Anderson Darling

  • You can select more than one test.
  • In the output, the distributions are sorted by Goodness of Fit tests.

Advanced

Node Configuration

It allows you to select the instance of the AWS server to provide control on the execution of a task in a workbook or workflow.

For more details, refer to Worker Node Configuration.

Example of Parametric Distribution Fitting

Consider a dataset of a disk radius. A snippet of input data is shown in the figure given below.

The Properties selected for Parametric Distribution Fitting are shown in the table below.

Property

Value

Data Column

radius

Distributions to Fit

All

Rank Distributions by Goodness of Fit Tests

Kolmogorov-Smirnov

The various elements of the Result page are shown in the figures below.

The Result page displays Descriptive Statistics and Shapiro Wilk’s Test for Normality, as shown in the figure below.

It also displays Distributions sorted by the Goodness of Fit scores, as shown in the figure below.

A Histogram of the frequency of values for the selected column with Sample Distribution and Beta Distribution is also displayed.

The Result page also displays PP Plot and QQ Plot, as shown in the figure below

PP Plot (Probability-Probability Plot) is a probability plot to assess how closely two datasets agree. It plots the two cumulative distribution functions against each other. PP Plots are used to evaluate the skewness of a distribution.

QQ Plot (Quartile-Quartile Plot) is a plot of two quartiles against each other. A quartile is a fraction where certain values of the dataset fall below that quartile and certain values are above it. Thus, QQ Plot shows the percentiles of a standard normal distribution against the corresponding percentile of the observed data.

Some important points about Parametric Distribution Fitting -

  • This feature helps you to determine which Distribution fits best on your data.
  • The distributions are sorted by the highest values of the p-value for the selected test.
  • Not all distributions have the same set of parameters.
  • Anderson Darling test to be used only with these distributions – Normal, Exponential, and Logistic.

Table of Contents