Missing Value Imputation

Description

Missing value imputation is the attribution of values in place of missing values in a real-world dataset. 

Why to use

Numerical Analysis – Data Preparation 

When to use

When there are missing values in the data. 

When not to use

  • On textual data.
  • When there are no missing values.

Prerequisites

It should be used on numerical data.

Input


Output


In this example, the missing data is imputed by mean of the respective column values.

Statistical Methods used

  • Mean
  • Median
  • Min
  • Max
  • Remove
  • Constant

Limitations

  • It is not very accurate.
  • It does not account for the uncertainty in the imputations.
  • It can introduce bias in the data.

Missing Value Imputation is located under Model Studio (  ) in Data Preparation, in the task pane on the left. Use the drag-and-drop method to use algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Missing Value Imputation.






There are many ways data can end up with missing values. For example

  • A 2-bedroom house would not include an answer for "How large is the third bedroom?"
  • Someone being surveyed may choose not to share their income.

Python libraries represent missing numbers as NaN which is short for "not a number".
Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values. So, you will need to choose one of the strategies to impute missing values.
Missing value imputation is the attribution of values in place of missing values in a real-world dataset.
Many times, there are missing values in datasets. These datasets are incompatible for scikit estimators because these estimators assume that all values are meaningful numerical values. If we eliminate the rows in a dataset containing missing values, we may lose important and relevant data. Hence, missing value imputation fills the missing gaps by inferring the value from the known part of the data.
Missing value imputation can be univariate or multivariate. In univariate imputation, the missing value is replaced by a constant value or a statistical value like the mean or the median of the corresponding column. In multivariate imputation, each feature with missing value is modeled as a function of other features, and then this estimate is used for imputation.

Properties of Missing Value Imputation

The available properties of Missing Value Imputation are as shown in the figure given below.

The table given below describes different fields present on the properties of missing value imputation.

Field

Description

Remark

Task Name

It displays the name of the selected task.

You can click the text field to edit or modify the name of the task as required.

Continuous Variables

It allows you to select continuous variables to perform missing value imputation.

  • Multiple data fields can be selected.
  • Only the numerical data fields selected for the reader are visible.

Allow Single Select
(For Continuous Variables)

It allows you to impute individual missing values separately, for selected data fields.

  • Point to the data field and click the gear icon ( ).
  • The available imputation methods are,
  • Mean
  • Median
  • Min
  • Max
  • Remove
  • Constant (If selected, enter the constant value)

Select Imputation Method
(For Continuous Variables)

It allows you to select the imputation method from the drop-list to apply for the selected data fields.

The available imputation methods are,

  • Mean
  • Median
  • Min
  • Max
  • Remove
  • Constant (If selected, enter the constant value)

Categorical Variables

It allows you to select continuous variables to perform missing value imputation.

  • Multiple data fields can be selected.
  • Only the categorical data fields selected for reader are visible.

Allow Single Select
(For Categorical Variables)

It allows you to select the check box if you want to impute individual missing values separately, for selected data fields.

  • Point to the data field and click the gear icon
  • The available imputation methods are,
  • Mode
  • Remove
  • Constant (If selected, enter the constant value)

Select Imputation Method
(For Categorical Variables)

It allows you to select the imputation method from the drop-list to apply for the selected data fields.

The available imputation methods are,

  • Mode
  • Remove
  • Constant (If selected, enter the constant value)

Interpretation from Missing Value Imputation

The table given below describes the result for different imputation methods selected.

Imputation Method

Result

Remark

Mean

It replaces the missing values with the mean of the non-missing values within each column separately and independently from the others.

  • It only works on the column level.
  • It can only be used with numerical data.

Median

It replaces the missing values with the median of the non-missing values within each column separately and independently from the others.

  • It only works on the column level.
  • It can only be used with numeric data

Min

It replaces the missing values with the minimum value present in that column.

Max

It replaces the missing values with the maximum value present in that column.

Remove

It discards the rows that contain missing values.

  • It can be used for a small amount of missing data (20-30%)
  • Removing a large amount of data may cause huge variations in the results.
  • If there is a large amount of missing data, it is recommended to remove complete column (If you want to remove a column, do not select that particular column while analyzing)

Constant

It replaces the missing values with the constant value that you have entered.

  • Works well with categorical features
  • It can introduce bias in the data

Mode (only for categorical variables)

It replaces the missing values with the mode of the values present in that column.

The distribution of data can become highly biased by mode imputation.


Table of Contents