DBSCAN

Description

  • DBSCAN stands for Density Based Spatial Clustering of Applications with Noise.
  • It is an unsupervised ML algorithm used for segregating high-density clusters from those having low density.

Why to use

To create data point clusters based on density.

When to use

When you want to convert data into clusters based on their density.

When not to use

For textual data

Prerequisites

  • The number of clusters need not be specified.
  • The selected variables need to be scaled before clustering.
  • The number of data points in a cluster should be greater than or equal to the dimension.
  • There should be at least two variables/features that can be selected as independent variables.

Input

Any numerical dataset containing unlabeled data

Output

  • Clustered Data with anomalies (noise points)
  • Cluster Plot
  • Silhouette Coefficient

Statistical Methods used

Limitations

  • Does not work well with high dimensional data
  • Choosing an epsilon value can be difficult

DBSCAN is located under Machine Learning ( ) in Anomaly Detection, in the left task pane. Use the drag-and-drop method to use the algorithm on the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of DBSCAN.


Properties of DBSCAN

The available properties of DBSCAN are shown in the figure below.




The table below describes different fields present on the properties of DBSCAN.

Field

Description

Remark

Task Name

It is the name of the task selected on the workbook canvas.

You can click the text field to edit or modify the task's name as required.

Independent Variable

It allows you to select the independent variable.

  • You can select multiple variables.
  • Make sure to apply scaling on each variable before running the algorithm.
  • The available scaling options are
  • Scaling
  • Normalization
  • StandardScaler

Advanced








Epsilon

It allows you to select a distance measure at which two data points can be said to be belonging to each other's neighborhood.

  • It is the maximum distance (radius) between a data point from the cluster core at which it still belongs to the same cluster.
  • It is used to locate data points in a cluster and determine the density of other data points in the vicinity of a datapoint.
  • The data points belonging to a cluster lie at a distance less than epsilon from the cluster core.
  • If epsilon is too small, a majority of data points are not clustered. If it is too large, clusters will merge to combine maximum points in the same cluster.
  • By default, the value of epsilon is 0.5.
  • You can select any positive float value. However, it is advisable to select a value between 0 and 1.

Minimum Sample

It allows you to select a minimum number of data points required to form a cluster.

  • It is the minimum number of samples in the neighborhood of a data point (including the datapoint itself) to call it a core point.
  • The value for the minimum sample depends on the dataset.
  • By default, the value for the minimum sample is five (5). It means that a minimum of five data points is required to form a cluster.
  • You can select any integral value as the minimum sample.

Metric

It allows you to select a method to determine the distance between two data points belonging to the independent variables.

  • You can choose any one of the following metrics.
  • cityblock
  • cosine
  • euclidean
  • l1
  • l2
  • manhattan
  • By default, the metric selected is euclidean.

Algorithm

It is used to select the algorithm for the Nearest Neighbor module to determine inter-point distances and find the nearest neighbors.

  • You can choose any one of the following algorithms.
  • auto
  • ball_tree
  • kd_tree
  • brute
  • By default, the auto option is selected. It means that DBSCAN decides which of the three algorithms (ball_tree, kd_tree, and brute) should be selected.

Power to Calculate Distance

It allows you to select the power used in some of the metrics above.

  • It is used in the formula for calculating the p value in some of the metrics.
  • You can select any positive value as power.
  • The power of the Minkowski metric is used to calculate the distance between points.

Leaf Size

It allows you to select the number of data points associated with a single leaf in the tree.

  • It applies to tree-type algorithms (ball_tree and kd_tree) mentioned above.
    The algorithms generate a tree such that each leaf node contains a minimum number of data points equal to leaf size.

Number of Parallel Jobs

It allows you to select the number of concurrently running processes.

  • You can select any positive or negative integer as the number of parallel jobs.

Node Configuration

It allows you to select the instance of the Amazon Web Services (AWS) server to provide control on the execution of a task in a workbook or workflow.

For more details, refer to Worker Node Configuration.

Example of DBSCAN

Consider an iris dataset containing several flower species documented according to sepal and petal dimensions like width and length.
A snippet of input data is shown in the figure below.










We select the following properties and apply DBSCAN.

Independent Variables

Sepal Length, Sepal Width, Petal Length, Petal Width

Epsilon

0.55

Minimum Sample

5

Metric

euclidean

Algorithm

auto

Power to Calculate Distance

2.0

Leaf Size

30

Number of Parallel Jobs

10


The following parameters calculated by the algorithm are displayed on the Result page.

  • Silhouette Coefficient: (0.5802)

Also called the Silhouette score, it indicates the goodness of fit of the DBSCAN technique. Its values range between -1 to 1. The higher the value, the more is the DBSCAN successful in assigning data points to the correct cluster. This also ensures well-defined clusters.

  • A value of -1 indicates that data points are wrongly assigned to a cluster.
  • A zero value indicates an insignificant or no distance between clusters (overlapping clusters)
  • A value of 1 indicates that clusters are significantly separated and can be distinguished.
  • Estimated Number of Clusters: (2)

It is the number of clusters created by the DBSCAN technique.

  • Estimated Number of Noise Points: (6)

It is the number of outliers (non-clustered data points), which could not be assigned to any of the clusters.


On the same result page, you also see the cluster plot between different sets of variables. By default, the first two variables in the dataset are selected for the Cluster Plot. For example, in the image below, you see a cluster plot of sepal width against sepal length.

  • You can change the variables from X-axis and Y-axis drop-downs and plot different cluster plots. (You should select different variables for both axes.)
  • The data points belonging to different clusters and the noise points are identified using different colors.
  • You can hover over any data point to determine its coordinates.

The figure below shows the resultant dataset in the Data tab.

  • Along with the columns present in the original dataset, you can see the ID and Label columns added.
  • In the Label column,
  • A zero value (0) indicates that the data point is assigned to any one of the clusters
  • A value of -1 indicates that the point is an outlier or a noise point.

Table of Contents