Principal Component Analysis in Machine Learning

The principal component analysis (PCA) is a dimensionality reduction algorithm. This is one of the easiest and most intuitive ways to reduce the dimensions of a dataset. In this article, I will walk you through the Principal Component Analysis in Machine Learning and its implementation using Python.

Principal Component Analysis

In machine learning, principal component analysis (PCA) is the most widely used algorithm for dimensionality reduction. It works by identifying the hyperplane closest to the dataset, and then it simply projects the data onto it. PCA selects the axis that preserves the maximum amount of variance because it is the axis that minimizes the root mean square error between the original data and its projections on the axis. This is how the principal component analysis works.

Also, Read – 200+ Machine Learning Projects Solved and Explained.

It identifies the axis that represents the greatest amount of variance in the training data. Then it also finds the second axis which is orthogonal to the first axis, this represents the greatest amount of the remaining variance.

For each main component, the PCA finds a unit vector centred at zero pointing in the direction of the main component. Since the two opposite vectors lie on the same axis, the directions of the unit vectors returned by PCA are not stable. In some cases, a pair of unit vectors may rotate or flip, but the plane they represent will generally remain the same.

Types of PCA

There are four methods to implement PCA:

  1. Regular PCA: The Regular PCA is the default version, but it only works if the data fits in memory.
  2. Incremental PCA: Incremental PCA is useful for large datasets that will not fit into the memory of regular PCA, but it is slower than the regular PCA, so if the data fits in memory you should use the regular PCA. Incremental PCA is useful for online tasks where you need to reduce the dimensions of the dataset on the fly each time a new sample of data arrives.
  3. Randomized PCA: Randomized PCA is very useful when you want to drastically reduce dimensionality and the dataset fits in memory. In such cases, it works faster than the regular principal component analysis.
  4. Kernal PCA: The Kernal PCA is only preferred when the dataset is nonlinear.

Principal Component Analysis using Python

To implement main component analysis using Python, we can use the PCA class provided by the Scikit-Learn library in python. Here’s how we can implement principal component analysis using Python to reduce the dimensionality of data:

[[-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79207044e+256]
 [-1.31210990e+275 -1.79130276e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275  5.25933599e+257]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -2.90503278e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79147236e+256]
 [-1.31210990e+275 -1.79166311e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205612e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275  5.25908783e+257]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79165620e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205511e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -3.71913271e+256]
 [-1.31210990e+275 -1.79116717e+256]
 [-1.31210990e+275 -1.79205612e+256]
 [ 7.74144843e+276 -1.02176393e+241]
 [-1.31210990e+275 -1.79205512e+256]
 [-1.31210990e+275 -1.79205512e+256]]

When implementing PCA, you should be aware that this algorithm assumes that the dataset is centred around the origin. The PCA class provided by scikit-learn takes care of centring data around the origin. But if you want to implement it without using Scikit-learn, don’t forget to centre the data first. Here’s how you can implement the Principal Component Analysis using Python without using the scikit-learn library:

[[-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76220325e+256]
 [-1.31210990e+275  1.76143556e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275 -5.26232271e+257]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  2.87516558e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76160516e+256]
 [-1.31210990e+275  1.76179592e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218893e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275 -5.26207455e+257]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76178901e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76222579e+256]
 [-1.31210990e+275  1.76218793e+256]
 [-1.31210990e+275  3.68926551e+256]
 [-1.31210990e+275  1.76129997e+256]
 [-1.31210990e+275  1.76218893e+256]
 [ 7.74144843e+276  1.76216447e+256]
 [-1.31210990e+275  1.76218792e+256]
 [-1.31210990e+275  1.76218792e+256]]

Summary

PCA can be used significantly to reduce the dimensionality of most datasets, even if the dataset is highly nonlinear because it can at least get rid of unnecessary dimensions. You should never use it when there are no unnecessary dimensions in the data as it will result in the loss of too much information.

I hope you liked this article on Principal Component Analysis in machine learning and its implementation using Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1534

Leave a Reply