The geometric intuition behind clustering in machine learning is simple: you want to group data points that are “close” in a certain sense. So, for any algorithm to work, you need to have a concrete way to measure “proximity”; such a measure is called a metric.
The metric and clusters you need to use will depend on the shape of your data; for example, your data may consist of real-valued vectors, lists of elements, or sequences of bits. Let’s have a look at the most popular clustering algorithms.
The most basic clustering method is so simple that it is not even generally considered a clustering method: that is, choose one or more dimensions and define each cluster as the group of elements that share values in a particular dimension.
In SQL, this is known as the GROUP BY statement, so we call this technique “grouping”. For example, if you are grouping by IP address, you will define a cluster by IP address, and the cluster elements will be entities sharing the same IP address.
k-means is usually the first algorithm that comes to mind when you think of clustering. k-means applies to real-valued vectors when you know how many clusters to expect; the number of clusters is denoted by k.
The goal of the algorithm is to assign each data point to a cluster such that the sum of the distances from each point to its cluster centre of gravity is minimized. k-means is a simple and efficient clustering algorithm that adapts well to very large datasets.
Since k is a fixed parameter of the algorithm, you must choose it appropriately. If you know how many clusters you are looking for (for example, if you are trying to group different malware families), you can just choose k as the number.
Otherwise, you will have to experiment with different values of k. It is also common to choose values of k that are between one to three times the number of classes (labels) in your data, in case some categories are discontinuous.
Unlike the k-means algorithm, hierarchical clustering methods are not parameterized by a k value selected by the operator (the number of clusters you want to create). Choosing an appropriate k is a non-trivial task and can significantly affect the results of clusters.
Divisive hierarchical grouping (top to bottom) is another form of hierarchical grouping that works in the opposite direction. Instead of starting with as many clusters as there are data points, we start with a single cluster made up of all the data points and start dividing the clusters based on the distance metric, stopping when each data point is in its separate cluster.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is one of the most popular and widely used clustering algorithms due to its generally good performance in different scenarios.
Unlike k-means, the number of clusters is not defined by the operator but rather deduced from the data. Unlike hierarchical, distance-based clustering, DBSCAN is a density-based algorithm that divides data sets into subgroups of high-density regions.
In naive implementations, this classification step is done by iterating through each point in the dataset, calculating its distance to all other points in the dataset, and then associating each point with its neighbours.
I hope you liked this article on clustering and some common machine learning algorithms of clustering. Feel free to ask your valuable questions in the comments section below.