DBSCAN stands for Density-Based Spatial Clustering for Applications with Noise. This is an unsupervised clustering algorithm which is used to find high-density base samples to extend the clusters. In this article, I will introduce you to DBSCAN clustering in Machine Learning using Python.
What is Clustering?
In machine learning, clustering is the task of unsupervised machine learning. Clustering means bringing together similar instances. Similarity parameters depend on the task at hand, for example, in some cases, two close samples are considered similar while in some cases they are completely different after being in the same cluster.
In Machine Learning, some of the most popular clustering algorithms are:
- Agglomerative Clustering
- Affinity Propagation
- Spectral Clustering
In the section below, I will introduce you to the concepts of the DBSCAN clustering algorithm first, and then we will see how to implement it using Python.
DBSCAN Clustering in Machine Learning
The DBSCAN Clustering algorithm is based on the concept of core samples, non-core samples, and outliers:
- Core Samples: The samples present in the high-density area have minimum sample points with the eps radius.
- Non-core samples: The samples close to core samples but are not core samples but are very near to the core samples. The no-core samples lie within the eps radius of the core samples but they don’t have minimum samples points.
- Outliers: The samples that are not part of the core samples and the non-core samples and are far away from all the samples.
The DBSCAN clustering algorithm works well if all the clusters are dense enough and are well represented by the low-density regions.
DBSCAN Clustering using Python
Now in this section, I will walk you through how to implement the DBSCAN algorithm using Python. The dataset I’m using here is a credit card dataset. Now let’s import the necessary Python libraries and the dataset:
Before moving forward let’s have a look at the null values in the dataset:
CUST_ID 0 BALANCE 0 BALANCE_FREQUENCY 0 PURCHASES 0 ONEOFF_PURCHASES 0 INSTALLMENTS_PURCHASES 0 CASH_ADVANCE 0 PURCHASES_FREQUENCY 0 ONEOFF_PURCHASES_FREQUENCY 0 PURCHASES_INSTALLMENTS_FREQUENCY 0 CASH_ADVANCE_FREQUENCY 0 CASH_ADVANCE_TRX 0 PURCHASES_TRX 0 CREDIT_LIMIT 1 PAYMENTS 0 MINIMUM_PAYMENTS 313 PRC_FULL_PAYMENT 0 TENURE 0 dtype: int64
So we have some null values in the Maximum Payments column. I will fill these values with the average values and here I will also remove the customer id column as it is useless:
data = data.drop('CUST_ID', axis=1) data.fillna(data.mean(), inplace=True)
Now let’s scale and normalize the dataset:
Now I will implement the Principal Component Analysis (PCA) algorithm in machine learning to reduce the dimensionality of the data for visualization:
V1 V2 0 -0.489825 -0.679678 1 -0.518791 0.545012 2 0.330885 0.268978 3 -0.482374 -0.092110 4 -0.563289 -0.481915
Now let’s implement the DBSCAN algorithm and have a look at the data and the clusters after implementing it:
dbscan = DBSCAN(eps=0.036, min_samples=4).fit(x_principal) labels = dbscan.labels_ data['cluster'] = dbscan.labels_ print(data.tail())
BALANCE BALANCE_FREQUENCY ... TENURE cluster 8945 28.493517 1.000000 ... 6 0 8946 19.183215 1.000000 ... 6 0 8947 23.398673 0.833333 ... 6 0 8948 13.457564 0.833333 ... 6 0 8949 372.708075 0.666667 ... 6 0 [5 rows x 18 columns]
DBSCAN clustering algorithm is a very simple and powerful clustering algorithm in machine learning. It can identify any cluster of any shape. It is robust to outliers and has only two hyperparameters. It may be difficult for it to capture the clusters properly if the cluster density increases significantly.
I hope you liked this article on DBSCAN Clustering algorithm in Machine Learning and its implementation using Python. Feel free to ask your valuable questions in the comments section below.