DBSCAN Clustering in Machine Learning

DBSCAN stands for Density-Based Spatial Clustering for Applications with Noise. This is an unsupervised clustering algorithm which is used to find high-density base samples to extend the clusters. In this article, I will introduce you to DBSCAN clustering in Machine Learning using Python.

What is Clustering?

In machine learning, clustering is the task of unsupervised machine learning. Clustering means bringing together similar instances. Similarity parameters depend on the task at hand, for example, in some cases, two close samples are considered similar while in some cases they are completely different after being in the same cluster.

Also, Read – 200+ Machine Learning Projects Solved and Explained.

In Machine Learning, some of the most popular clustering algorithms are:

  1. K-Means
  2. DBSCAN
  3. Agglomerative Clustering
  4. BIRCH
  5. Mean-Shift
  6. Affinity Propagation
  7. Spectral Clustering

In the section below, I will introduce you to the concepts of the DBSCAN clustering algorithm first, and then we will see how to implement it using Python.

DBSCAN Clustering in Machine Learning

The DBSCAN Clustering algorithm is based on the concept of core samples, non-core samples, and outliers:

  1. Core Samples: The samples present in the high-density area have minimum sample points with the eps radius.
  2. Non-core samples: The samples close to core samples but are not core samples but are very near to the core samples. The no-core samples lie within the eps radius of the core samples but they don’t have minimum samples points.
  3. Outliers: The samples that are not part of the core samples and the non-core samples and are far away from all the samples.

The DBSCAN clustering algorithm works well if all the clusters are dense enough and are well represented by the low-density regions.

DBSCAN Clustering using Python

Now in this section, I will walk you through how to implement the DBSCAN algorithm using Python. The dataset I’m using here is a credit card dataset. Now let’s import the necessary Python libraries and the dataset:

credit card dataset
The Dataset Contains More Columns

Before moving forward let’s have a look at the null values in the dataset:

data.isnull().sum()
CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64

So we have some null values in the Maximum Payments column. I will fill these values with the average values and here I will also remove the customer id column as it is useless:

data = data.drop('CUST_ID', axis=1)
data.fillna(data.mean(), inplace=True)

Now let’s scale and normalize the dataset:

Now I will implement the Principal Component Analysis (PCA) algorithm in machine learning to reduce the dimensionality of the data for visualization:

      V1        V2
0 -0.489825 -0.679678
1 -0.518791  0.545012
2  0.330885  0.268978
3 -0.482374 -0.092110
4 -0.563289 -0.481915

Now let’s implement the DBSCAN algorithm and have a look at the data and the clusters after implementing it:

dbscan = DBSCAN(eps=0.036, min_samples=4).fit(x_principal)
labels = dbscan.labels_
data['cluster'] = dbscan.labels_
print(data.tail())
         BALANCE  BALANCE_FREQUENCY  ...  TENURE  cluster
8945   28.493517           1.000000  ...       6        0
8946   19.183215           1.000000  ...       6        0
8947   23.398673           0.833333  ...       6        0
8948   13.457564           0.833333  ...       6        0
8949  372.708075           0.666667  ...       6        0

[5 rows x 18 columns]
DBSCAN Clustering Algorithm

Summary

DBSCAN clustering algorithm is a very simple and powerful clustering algorithm in machine learning. It can identify any cluster of any shape. It is robust to outliers and has only two hyperparameters. It may be difficult for it to capture the clusters properly if the cluster density increases significantly.

I hope you liked this article on DBSCAN Clustering algorithm in Machine Learning and its implementation using Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

2 Comments

  1. Thanks for providing the wonderful project on clustering, but I have few queries.

    1. How you decided to fill the blank values with mean? I know there are multiple ways to fill the missing values, but not sure on choosing a way, need more practice I guess on this. If I can get answers to few scenarios then I can start relating those

    2. How you decided to choose DBScan, why not other clustering techniques?

    3. How to decide on Hyperparameters, EPS and number of samples (I guess this is key for dbscan clustering)

    4. If I can find the description for the above data set, that will be really great.

Leave a Reply