Missing Value Imputation | |||||
Description | Missing value imputation is the attribution of values in place of missing values in a real-world dataset. | ||||
Why to use | Numerical Analysis – Data Preparation | ||||
When to use | When there are missing values in the data. | When not to use |
| ||
Prerequisites | It should be used on numerical data. | ||||
Input | Output | | |||
Statistical Methods used |
| Limitations |
|
Missing Value Imputation is located under Model Studio ( ) in Data Preparation, in the task pane on the left. Use the drag-and-drop method to use algorithm in the canvas. Click the algorithm to view and select different properties for analysis.
Refer to Properties of Missing Value Imputation.
There are many ways data can end up with missing values. For example
- A 2-bedroom house would not include an answer for "How large is the third bedroom?"
- Someone being surveyed may choose not to share their income.
Python libraries represent missing numbers as NaN which is short for "not a number".
Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values. So, you will need to choose one of the strategies to impute missing values.
Missing value imputation is the attribution of values in place of missing values in a real-world dataset.
Many times, there are missing values in datasets. These datasets are incompatible for scikit estimators because these estimators assume that all values are meaningful numerical values. If we eliminate the rows in a dataset containing missing values, we may lose important and relevant data. Hence, missing value imputation fills the missing gaps by inferring the value from the known part of the data.
Missing value imputation can be univariate or multivariate. In univariate imputation, the missing value is replaced by a constant value or a statistical value like the mean or the median of the corresponding column. In multivariate imputation, each feature with missing value is modeled as a function of other features, and then this estimate is used for imputation.
Properties of Missing Value Imputation
The available properties of Missing Value Imputation are as shown in the figure given below.
The table given below describes different fields present on the properties of missing value imputation.
Field | Description | Remark |
---|---|---|
Task Name | It displays the name of the selected task. | You can click the text field to edit or modify the name of the task as required. |
Continuous Variables | It allows you to select continuous variables to perform missing value imputation. |
|
Allow Single Select | It allows you to impute individual missing values separately, for selected data fields. |
|
Select Imputation Method | It allows you to select the imputation method from the drop-list to apply for the selected data fields. | The available imputation methods are,
|
Categorical Variables | It allows you to select continuous variables to perform missing value imputation. |
|
Allow Single Select | It allows you to select the check box if you want to impute individual missing values separately, for selected data fields. |
|
Select Imputation Method | It allows you to select the imputation method from the drop-list to apply for the selected data fields. | The available imputation methods are,
|
Interpretation from Missing Value Imputation
The table given below describes the result for different imputation methods selected.
Imputation Method | Result | Remark |
---|---|---|
Mean | It replaces the missing values with the mean of the non-missing values within each column separately and independently from the others. |
|
Median | It replaces the missing values with the median of the non-missing values within each column separately and independently from the others. |
|
Min | It replaces the missing values with the minimum value present in that column. | — |
Max | It replaces the missing values with the maximum value present in that column. | — |
Remove | It discards the rows that contain missing values. |
|
Constant | It replaces the missing values with the constant value that you have entered. |
|
Mode (only for categorical variables) | It replaces the missing values with the mode of the values present in that column. | The distribution of data can become highly biased by mode imputation. |
Table of Contents