The K-Means Clustering is a clustering algorithm capable of clustering an unlabeled dataset quickly and efficiently in just a very few iterations. In this article, I will take you through the K-Means clustering in machine learning using Python.
K-Means Clustering in Machine Learning
Clustering means identifying similar instances and assigning them to clusters or groups of similar instances. It is used in a wide variety of applications such as:
- Customer Segmentation
- Data Analysis
- Dimensionality Reduction
- Anomaly Detection
- Semi-supervised learning
- Searching Images
- Image Segmentation
K-Means is a clustering algorithm in machine learning that can group an unlabeled dataset very quickly and efficiently in just a few iterations. It works by labelling all instances on the cluster with the closest centroid. When the instances are centred around a particular point, that point is called a centroid.
If you receive the instance labels, you can easily locate all items by averaging all instances for each cluster. But here we are not given a label or centroids, so we have to start by placing the centroids randomly by selecting k random instances and using their locations as the centroids.
Then we label the instances, update the centroids, re-label the instances, update the centroids again and so on. The K-Means clustering algorithm is guaranteed to converge in a few iterations, it will not continue to iterate forever.
K-Means Clustering using Python
The computational complexity of the K-Means clustering algorithm is generally linear concerning:
- the number of instances m,
- the number of clusters k,
- and the number of dimensions n.
This is only true when the dataset has a clustering structure if the dataset has no clustering structure, the worst-case time complexity of the algorithm may increase exponentially with the number of instances. In real-time issues, this never happens and K-means clustering is considered to be one of the fastest clustering algorithms.
Now let’s see how to implement K-means clustering using Python. To implement this using Python, I will use the California housing dataset to create economic segments in different areas of California. Let’s start by importing the necessary Python dataset and libraries:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value', 'ocean_proximity'], dtype='object') median_income latitude longitude 0 8.3252 37.88 -122.23 1 8.3014 37.86 -122.22 2 7.2574 37.85 -122.24 3 5.6431 37.85 -122.25 4 3.8462 37.85 -122.25
Now let’s see how to implement the K-means clustering algorithm using Python. Since it is scaled sensitive, it will be a good idea to resize or normalize the data with extreme values:
median_income latitude longitude Cluster 0 8.3252 37.88 -122.23 2 1 8.3014 37.86 -122.22 2 2 7.2574 37.85 -122.24 2 3 5.6431 37.85 -122.25 2 4 3.8462 37.85 -122.25 0
Now let’s have a look at the clusters identified by the algorithm by using a scatterplot:
The scatter plot above shows the geographic distribution of the clusters. It appears that the algorithm created separate segments for the high-income area.
This is how we can implement the K-means clustering algorithm using Python. It is important to scale the input features before running the K-means, otherwise, the clusters can get very stretched and therefore the algorithm will perform poorly. However, scaling the features does not guarantee that the clusters will become nice and spherical, but it usually improves them a lot.
I hope you liked this article on the K-means algorithm in machine learning and its implementation using Python. Feel free to ask your valuable questions in the comments section below.