Density Based Clustering

Description

It classifies the given set of data by building clusters based on the idea that a cluster in the data space is a continuous region of high point density, separated from other clusters by continuous regions of low density.

Why to use

It works well to separate data areas with a high density of observations from data areas that are not very dense with observation. DBSCAN can sort data into clusters of arbitrary shapes as well.

When to use

  • When the number of clusters is not known.
  • When there is too much noise in data.
  • To separate data points of high density from data points of low density.

When not to use

When the number of clusters is known.

Prerequisites

Input data should be of text type and should not contain special characters and numbers.

Input

Textual Data

Output

Data divided into clusters.

Statistical Methods used

  • Ball tree
  • Kd tree
  • Brute

Limitations

It does not work well in the case of high-dimensional data or with clusters of varying densities.

Density Based Clustering is located under Textual Analysis ( ) in Clustering, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Density Based Clustering.

Density-based clustering is an unsupervised learning method. It identifies distinctive clusters in data to be the regions of high point density, clearly separated from other clusters by a region of low point density. These separating regions of low point density are considered as noise or outliers.
In density-based clustering, core samples of high point density are identified, and clusters are developed from them. This method is suitable for data that contains data of comparable density. Also, clusters found in density-based clustering can be of any shape as opposed to the k-means method, where clusters are assumed to be convex-shaped.

Properties of Density Based Clustering

The available properties of Density Based Clustering are as shown in the figure given below.
















The table given below describes the different fields present on the properties of Density Based Clustering.

Field

Description

Remark

Task Name

It is the name of the task selected on the workbook canvas.

You can click the text field to edit or modify the name of the task as required.

Text

It allows you to select Independent variables.

  • You can select more than one variable.
  • You can select any type of variable.

Advanced

Epsilon

It allows you to enter the maximum distance between two data points for them to be considered in the neighborhood of each other.

The default value is 0.5

Minimum Number of Samples

It allows you to enter the minimum number of samples to be considered while assigning clusters.

The default value is 10.

Algorithm

It allows you to select the algorithm to be used for searching the nearest neighbor while assigning clusters.

The available options are –

  • Auto – It determines the algorithm among Ball_tree, Kd_tree, and Brute that is best suited for the input dataset.
  • Ball_tree – It divides the data points into two clusters, and each cluster is contained by either a circle or a sphere.
  • Kd_tree – It divides the data points into two sets at each node.
  • Brute – It uses the brute-force method to compute distances between all pairs of data points.

Example of Density Based Clustering

Consider a dataset of Musical Instruments review. A snippet of input data is shown in the figure given below.

After using the Density Based Clustering, the following results are displayed.











As seen in the above figure, each cluster's size is mentioned along with the Silhouette Score.

Table of Contents