Aggregation

Description

Aggregation of categorical data involves the gathering of information for statistical analysis and expressing it in a summarized form.

Why to use

Numerical Analysis – Data Preparation

When to use

When you want to collect specific information about particular groups based on specific variables.

When not to use

On textual data.


Prerequisites

It should be used on numerical and categorical data.

Input

Any dataset that contains categorical as well as numerical data.

Output

Aggregated numerical or categorical data.

Statistical Methods used

  • Sum
  • Mean
  • Mode
  • Minimum
  • Maximum
  • Count
  • Standard Deviation
  • Variance

Limitations

Sometimes using only aggregation is not enough as it gives only single level analysis. You may need to use other methods to get accurate results.


Aggregation is located under Model Studio () in Data Preparation, in the task pane on the left. Use drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to Properties of Aggregation.

Aggregation is a group-by algorithm in which a given data is grouped for a certain categorical data variable like name, date, color, educational level and so on. The data that is grouped is the numerical data and is called the Aggregate Function. You cannot use this algorithm unless you have selected the GroupBy function.

Properties of Aggregation

The available properties of Aggregation are as shown in the figure given below.

The table given below describes different fields present on the properties of Aggregation.

Field

Description

Remark

Task Name

It displays the name of the selected task.

You can click the text field to edit or modify the name of the task as required.

GroupBy

It allows you to select the function for which you want to group the data.

  • Multiple functions can be selected.
  • You can group by numerical as well as categorical function.

Aggregate Function

It allows you to select the type of data that is to be aggregated.

  • Multiple functions can be selected.
  • Same data can be grouped according to different statistical measures at the same time.
  • You can aggregate numerical as well as categorical data.
  • Numerical data is aggregated by
    • Sum
    • Mean
    • Mode
    • Minimum
    • Maximum
    • Count
    • Standard Deviation
    • Variance
  • Categorical data is aggregated by
    • Minimum
    • Maximum
    • Count

Interpretation from Aggregation

The aggregation algorithm can be used on either the numerical data or the categorical data. The table given below describes the results for Aggregation.

Aggregation Method

Result

Remark

Sum

It gives the aggregation of the given data by the sum of the values in that column.

It can be performed only on numerical data.

Mean

It gives the aggregation of the given data by mean of the values in that column.

It can be performed only on numerical data.

Mode

It gives the aggregation of the given data by mode of the values in that column.

It can be performed only on numerical data.

Minimum

It gives the aggregation of the given data by the minimum value of the variable in that column.

It can be performed on numerical as well as categorical data.

Maximum

It gives the aggregation of the given data by the maximum value of the variable in that column.

It can be performed on numerical as well as categorical data.

Count

It gives the aggregation of the given data by the count of the variable in that column.

It can be performed on numerical as well as categorical data.

Standard Deviation

It gives the aggregation of the given data by the standard deviation of the variable in that column.

It can be performed only on numerical data.

Variance

It gives the aggregation of the given data by the variance of the variable in that column.

It can be performed only on numerical data.

Example of Aggregation

The figure given below displays the output of aggregation performed on sample data. The data of the number of deaths (numerical data) in a US county is aggregated by sum, mean, standard deviation, and the maximum value of the number of deaths. The data is grouped by the name of the county and date (both categorical data).












Field

Result

county

It displays the name of the US county whose data corresponding to the number of deaths is aggregated.

deaths_Aggr_0

It displays the aggregate deaths in that county by the sum of deaths on a particular date.

date

It displays the date corresponding to which the data is aggregated.

deaths_Aggr_1

It displays the aggregate deaths in that county by the mean number of deaths on a particular date.

deaths_Aggr_2

It displays the aggregate deaths in that county by the standard deviation of the deaths on a particular date.

deaths_Aggr_3

It displays the aggregate deaths in that county by the maximum value of the number of deaths on a particular date.

Table of Contents