# Data Science Project – Student Performance Analysis with Machine Learning

Student marks Performance Analysis with Machine Learning

It takes a lot of manual effort to complete the evaluation process as even one college may contain thousands of students.

In this Data Science Project we will evaluate the Performance of a student using Machine Learning techniques and python.

You can download the data set you need for this project from here:

```# for some basic operations
import numpy as np
import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import dabl```

To read the data set :

```data = pd.read_csv('StudentsPerformance.csv')

# getting the shape of the data
print(data.shape)```

#Output- (1000, 8)

To look at the first 5 records in the data set

`data.head()`

### Descriptive Statistics

`data.describe()`

Lets check the no. of unique items present in the categorical column

`data.select_dtypes('object').nunique()`
```#Output
gender                         2
race/ethnicity                 5
parental level of education    6
lunch                          2
test preparation course        2
dtype: int64```

lets check the percentage of missing data in each columns present in the data :

```no_of_columns = data.shape
percentage_of_missing_data = data.isnull().sum()/no_of_columns
print(percentage_of_missing_data)```
```#Output
gender                         0.0
race/ethnicity                 0.0
parental level of education    0.0
lunch                          0.0
test preparation course        0.0
math score                     0.0
writing score                  0.0
dtype: float64```

To see comparison of all other attributes with respect to Math Marks

```plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')
dabl.plot(data, target_col = 'math score')```

Comparison of all other attributes with respect to Reading Marks :

```plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')

Lets check the Effect of Lunch on Student’s Performnce

```data[['lunch','gender','math score','writing score',

Lets check the Effect of Test Preparation Course on Scores

```data[['test preparation course',
'gender',
'math score',
'writing score',

## Data Visualizations

Visualizing the number of male and female in the data set

```plt.rcParams['figure.figsize'] = (15, 5)
sns.countplot(data['gender'], palette = 'bone')
plt.title('Comparison of Males and Females', fontweight = 30)
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()```

Visualizing the different groups in the data set

```plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('ggplot')

sns.countplot(data['race/ethnicity'], palette = 'pink')
plt.title('Comparison of various groups', fontweight = 30, fontsize = 20)
plt.xlabel('Groups')
plt.ylabel('count')
plt.show()```

Visualizing the different parental education levels

```plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('fivethirtyeight')

sns.countplot(data['parental level of education'], palette = 'Blues')
plt.title('Comparison of Parental Education', fontweight = 30, fontsize = 20)
plt.xlabel('Degree')
plt.ylabel('count')
plt.show()```

Visualizing Maths score

```plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('tableau-colorblind10')

sns.countplot(data['math score'], palette = 'BuPu')
plt.title('Comparison of math scores', fontweight = 30, fontsize = 20)
plt.xlabel('score')
plt.ylabel('count')
plt.xticks(rotation = 90)
plt.show()```

Computing the total score for each student

```import warnings
warnings.filterwarnings('ignore')

data['total_score'] = data['math score'] + data['reading score'] + data['writing score']

sns.distplot(data['total_score'], color = 'magenta')

plt.title('comparison of total score of all the students', fontweight = 30, fontsize = 20)
plt.xlabel('total score scored by the students')
plt.ylabel('count')
plt.show()```

Computing percentage for each of the students

```# importing math library to use ceil
from math import *
import warnings
warnings.filterwarnings('ignore')

data['percentage'] = data['total_score']/3

for i in range(0, 1000):
data['percentage'][i] = ceil(data['percentage'][i])

plt.rcParams['figure.figsize'] = (15, 9)
sns.distplot(data['percentage'], color = 'orange')

plt.title('Comparison of percentage scored by all the students', fontweight = 30, fontsize = 20)
plt.xlabel('Percentage scored')
plt.ylabel('Count')
plt.show()```
``` Assigning grades to the grades according to the following criteria :
0  - 40 marks : grade E
41 - 60 marks : grade D
60 - 70 marks : grade C
70 - 80 marks : grade B
80 - 90 marks : grade A
90 - 100 marks : grade O```
```def getgrade(percentage, status):
if status == 'Fail':
return 'E'
if(percentage >= 90):
return 'O'
if(percentage >= 80):
return 'A'
if(percentage >= 70):
return 'B'
if(percentage >= 60):
return 'C'
if(percentage >= 40):
return 'D'
else :
return 'E'

```#Output
B    260
C    252
D    223
A    156
O     58
E     51

## Label Encoding

```from sklearn.preprocessing import LabelEncoder

# creating an encoder
le = LabelEncoder()

# label encoding for test preparation course
data['test preparation course'] = le.fit_transform(data['test preparation course'])

# label encoding for lunch
data['lunch'] = le.fit_transform(data['lunch'])

# label encoding for race/ethnicity
# we have to map values to each of the categories
data['race/ethnicity'] = data['race/ethnicity'].replace('group A', 1)
data['race/ethnicity'] = data['race/ethnicity'].replace('group B', 2)
data['race/ethnicity'] = data['race/ethnicity'].replace('group C', 3)
data['race/ethnicity'] = data['race/ethnicity'].replace('group D', 4)
data['race/ethnicity'] = data['race/ethnicity'].replace('group E', 5)

# label encoding for parental level of education
data['parental level of education'] = le.fit_transform(data['parental level of education'])

#label encoding for gender
data['gender'] = le.fit_transform(data['gender'])

# label encoding for pass_math
data['pass_math'] = le.fit_transform(data['pass_math'])

# label encoding for pass_writing
data['pass_writing'] = le.fit_transform(data['pass_writing'])

# label encoding for status
data['status'] = le.fit_transform(data['status'])```

## Data Preparation

Splitting the dependent and independent variables

```x = data.iloc[:,:14]
y = data.iloc[:,14]

print(x.shape)
print(y.shape)```

#Output-
(1000, 14)
(1000,)

Splitting the data set into training and test sets

```from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 45)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)```

#Output
(750, 14)
(750,)
(250, 14)
(250,)

```# importing the MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# creating a scaler
mm = MinMaxScaler()

# feeding the independent variable into the scaler
x_train = mm.fit_transform(x_train)
x_test = mm.transform(x_test)```

Applying principal components analysis

```from sklearn.decomposition import PCA

# creating a principal component analysis model
pca = PCA(n_components = None)

# feeding the independent variables to the PCA model
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)

# visualising the principal components that will explain the highest share of variance
explained_variance = pca.explained_variance_ratio_
print(explained_variance)

# creating a principal component analysis model
pca = PCA(n_components = 2)

# feeding the independent variables to the PCA model
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)```

## Modelling

Logistic Regression

```from sklearn.linear_model import  LogisticRegression

# creating a model
model = LogisticRegression()

# feeding the training data to the model
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)

# calculating the classification accuracies
print("Training Accuracy :", model.score(x_train, y_train))
print("Testing Accuracy :", model.score(x_test, y_test))```

Output-
Training Accuracy : 0.3933333333333333
Testing Accuracy : 0.424

Printing the confusion matrix

```from sklearn.metrics import confusion_matrix

# creating a confusion matrix
cm = confusion_matrix(y_test, y_pred)

# printing the confusion matrix
plt.rcParams['figure.figsize'] = (8, 8)
sns.heatmap(cm, annot = True, cmap = 'Greens')
plt.title('Confusion Matrix for Logistic Regression', fontweight = 30, fontsize = 20)
plt.show()```

Random Forest

```from sklearn.ensemble import RandomForestClassifier

# creating a model
model = RandomForestClassifier()

# feeding the training data to the model
model.fit(x_train, y_train)

# predicting the x-test results
y_pred = model.predict(x_test)

# calculating the accuracies
print("Training Accuracy :", model.score(x_train, y_train))
print("Testing Accuracy :", model.score(x_test, y_test))```

Output
Training Accuracy : 0.9986666666666667
Testing Accuracy : 0.784

```from sklearn.metrics import confusion_matrix

# creating a confusion matrix
cm = confusion_matrix(y_test, y_pred)

# printing the confusion matrix
plt.rcParams['figure.figsize'] = (8, 8)
sns.heatmap(cm, annot = True, cmap = 'Reds')
plt.title('Confusion Matrix for Random Forest', fontweight = 30, fontsize = 20)
plt.show()```
```from pandas.plotting import radviz
fig, ax = plt.subplots(figsize=(12, 12))
new_df = x.copy()
new_df["status"] = y
plt.title('Radial Visualization for Target', fontsize = 20)
plt.show()```

It gives a clear Idea that Students getting very low grades have high correlation on Lunch and Parental Education ##### Aman Kharwal
Coder with the ♥️ of a Writer || Data Scientist | Solopreneur | Founder
Articles: 1102

1. #### Hritika Aggarwal

Amazing project Aman sir

2. #### Work on Data Science Projects | Data Science | Machine Learning | Python

[…] Data Science Project – Student Performance Analysis with Machine Learning […]

3. #### Jeet Banerjee

This is great 👍
It’s help me to grow my knowledge.thanks brother.

• #### Aman Kharwal

Thanks, Jeet😀, Keep Visiting us.