Feature Selection is one of the most important concepts of Machine Learning, as it carries large importance in training your model. The features that you use from your dataset carry huge importance with the end performance of your trained model.
You all have faced the problem in identification of the related features from the dataset to remove the less relevant and less important features, which contribute less in our target for achieving better accuracy in training your model.
In this article, you will learn the feature selection techniques for machine learning that you can use in training your model perfectly.
What is Feature Selection?
Feature Selection is the procedure of selection of those relevant features from your dataset, automatically or manually which will be contributing the most in training your machine learning model to get the most accurate predictions as your output.
A model which is trained on less relevant features will not give an accurate prediction, as a result, it will be known as a less trained model.
Feature Selection Techniques in Machine Learning with Python
In this article, I will share the three major techniques of Feature Selection in Machine Learning with Python.
- Univariate Selection
- Feature Importance
- Correlation Matrix
Now let’s go through each model with the help of a dataset that you can download from below.
1. Univariate Selection
Statistics can be used in the selection of those features that carry a high relevance with the output. In the example below I will use the statistical test for the positive features to select the 10 best features from the dataset.
import pandas as pd import numpy as np from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 data = pd.read_csv("train.csv") X = data.iloc[:,0:20] #independent columns y = data.iloc[:,-1] #target column i.e price range #apply SelectKBest class to extract top 10 best features bestfeatures = SelectKBest(score_func=chi2, k=10) fit = bestfeatures.fit(X,y) dfscores = pd.DataFrame(fit.scores_) dfcolumns = pd.DataFrame(X.columns) #concat two dataframes for better visualization featureScores = pd.concat([dfcolumns,dfscores],axis=1) featureScores.columns = ['Specs','Score'] #naming the dataframe columns print(featureScores.nlargest(10,'Score'))
2. Feature Importance
With this technique, you can get the feature importance of every feature from your dataset with the use of feature importance tool of the model.
Feature Importance works by giving a relevancy score to your to every feature of your dataset, the higher the score it will give, the higher relevant that feature will be for the training of your model.
In the example below I will use the feature importance technique to select the top 10 features from the dataset which will be more relevant in training the model.
import pandas as pd import numpy as np data = pd.read_csv("train.csv") X = data.iloc[:,0:20] #independent columns y = data.iloc[:,-1] #target column i.e price range from sklearn.ensemble import ExtraTreesClassifier import matplotlib.pyplot as plt model = ExtraTreesClassifier() model.fit(X,y) print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers #plot graph of feature importances for better visualization feat_importances = pd.Series(model.feature_importances_, index=X.columns) feat_importances.nlargest(10).plot(kind='barh') plt.show()
3. Correlation Matrix
With this technique, we can see how the features are correlated with each other and the target. The Correlation Matrix shows Positive output if the feature is highly relevant and will show a Negative output if the feature is less relevant to the data.
A Heatmap always makes it easy to see how much the data is correlated with each other and the target. In the example below I will create a heatmap of the correlated features to explain the Correlation Matrix technique.
import pandas as pd import numpy as np import seaborn as sns data = pd.read_csv("train.csv") X = data.iloc[:,0:20] #independent columns y = data.iloc[:,-1] #target column i.e price range #get correlations of each features in dataset corrmat = data.corr() top_corr_features = corrmat.index plt.figure(figsize=(20,20)) #plot heat map g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
In this way, you can select the most relevant features from your dataset using the Feature Selection Techniques in Machine Learning with Python.