Machine Learning Project Walkthrough with Python

A Complete Machine Learning Project Walkthrough with Python.

In this article, I will take you through a complete Machine Learning Project Walkthrough with Python programming language. This complete machine learning project walkthrough includes the implementation of algorithms provided by Scikit-Learn which is one of the best Python libraries for Machine Learning.

Below are the steps that are covered in this Machine Learning project walkthrough:

  1. Importing the Data
  2. Data Visualization
  3. Data Cleaning and Transformation
  4. Encoding the Data
  5. Splitting the data into Training and Test sets
  6. Fine Tuning Algorithms
  7. Cross Validate with KFold
  8. Prediction on the test set

Also, Read – 100+ Machine Learning Projects Solved and Explained.

Machine Learning Project Walkthrough with Python

Now in this section, I will take you through a complete Machine Learning project walkthrough with Python programming language. I will start by importing the necessary Python libraries and the dataset:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')

Now let’s see how to visualize this data. Data visualization is crucial to recognize the underlying patterns to properly train the machine learning model:

sns.barplot(x="Embarked", y="Survived", hue="Sex", data=data_train)
plt.show()
titanic dataset

Data Cleaning and Transformation:

Now the next step is to clean and transform data according to the output that we need. Here are the steps that I will consider in this step:

  1. To avoid overfitting, I’m going to group people into logical human age groups.
  2. Each booth begins with a letter. I bet this letter is much larger than the number that follows, let’s cut it off.
  3. The tariff is another continuous value that should be simplified.
  4. Extract the information from the “Name” function. Rather than using the full name, I extracted the last name and name prefix (Mr, Mrs etc.) and then added them as characteristics.
  5. Finally, we need to remove unnecessary features.

Encoding Features:

The next step is to standardize the labels. The Label encoder converts each unique string into a number, making the data more flexible that can be used for various algorithms. The result is a scary array of numbers for humans, but beautiful for machines:

Now the next step is to divide the data into training and testing sets. Here I’ll be using one variable to store all the features minus the value we want to predict, and the other variable to store only the value we want to predict.

For this task, I’m going to randomly mix this data into four variables. In this case, I train 80% of the data, then I test the remaining 20%:

from sklearn.model_selection import train_test_split

X_all = data_train.drop(['Survived', 'PassengerId'], axis=1)
y_all = data_train['Survived']

num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)

Fitting and Tuning Machine Learning Algorithm:

Now is the time to determine which algorithm will provide the best model. In this task, I am going with the RandomForestClassifier, but you can also use any other classifier here, such as Support Vector Machines or Naive Bayes:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=5, max_features='log2', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=9, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))
0.804469273743

Now we need to use KFold cross-validation to validate our machine learning model. KFold cross-validation helps to understand that is our model good? This makes it possible to verify the efficiency of the algorithm using KFold. This will divide our data into 10 compartments, then run the algorithm using a different compartment as the test set for each iteration:

Fold 1 accuracy: 0.8111111111111111
Fold 2 accuracy: 0.8764044943820225
Fold 3 accuracy: 0.8089887640449438
Fold 4 accuracy: 0.8764044943820225
Fold 5 accuracy: 0.8314606741573034
Fold 6 accuracy: 0.8089887640449438
Fold 7 accuracy: 0.7865168539325843
Fold 8 accuracy: 0.7528089887640449
Fold 9 accuracy: 0.8764044943820225
Fold 10 accuracy: 0.8089887640449438
Mean Accuracy: 0.8238077403245943

Testing the Model:

Now we need to predict on the actual test data:

ids = data_test['PassengerId']
predictions = clf.predict(data_test.drop('PassengerId', axis=1))
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.head()
PassengerIdSurvived
08920
18931
28940
38950
48961

I hope you liked this article on a complete machine learning project walkthrough for beginners. Feel free to ask your valuable questions in the comments section below.

Default image
Aman Kharwal

I am a programmer from India, and I am here to guide you with Data Science, Machine Learning, Python, and C++ for free. I hope you will learn a lot in your journey towards Coding, Machine Learning and Artificial Intelligence with me.

Leave a Reply

Data Science | Machine Learning | Python | C++ | Coding | Programming | JavaScript