Machine Learning Project Walkthrough with Python

In this article, I will take you through a complete Machine Learning Project Walkthrough with Python programming language. This complete machine learning project walkthrough includes the implementation of algorithms provided by Scikit-Learn which is one of the best Python libraries for Machine Learning.

Below are the steps that are covered in this Machine Learning project walkthrough:

  1. Importing the Data
  2. Data Visualization
  3. Data Cleaning and Transformation
  4. Encoding the Data
  5. Splitting the data into Training and Test sets
  6. Fine Tuning Algorithms
  7. Cross Validate with KFold
  8. Prediction on the test set

Also, Read – 100+ Machine Learning Projects Solved and Explained.

Machine Learning Project Walkthrough with Python

Now in this section, I will take you through a complete Machine Learning project walkthrough with Python programming language. I will start by importing the necessary Python libraries and the dataset:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')

Now let’s see how to visualize this data. Data visualization is crucial to recognize the underlying patterns to properly train the machine learning model:

sns.barplot(x="Embarked", y="Survived", hue="Sex", data=data_train)
titanic dataset

Data Cleaning and Transformation:

Now the next step is to clean and transform data according to the output that we need. Here are the steps that I will consider in this step:

  1. To avoid overfitting, I’m going to group people into logical human age groups.
  2. Each booth begins with a letter. I bet this letter is much larger than the number that follows, let’s cut it off.
  3. The tariff is another continuous value that should be simplified.
  4. Extract the information from the “Name” function. Rather than using the full name, I extracted the last name and name prefix (Mr, Mrs etc.) and then added them as characteristics.
  5. Finally, we need to remove unnecessary features.

Encoding Features:

The next step is to standardize the labels. The Label encoder converts each unique string into a number, making the data more flexible that can be used for various algorithms. The result is a scary array of numbers for humans, but beautiful for machines:

Now the next step is to divide the data into training and testing sets. Here I’ll be using one variable to store all the features minus the value we want to predict, and the other variable to store only the value we want to predict.

For this task, I’m going to randomly mix this data into four variables. In this case, I train 80% of the data, then I test the remaining 20%:

from sklearn.model_selection import train_test_split

X_all = data_train.drop(['Survived', 'PassengerId'], axis=1)
y_all = data_train['Survived']

num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)

Fitting and Tuning Machine Learning Algorithm:

Now is the time to determine which algorithm will provide the best model. In this task, I am going with the RandomForestClassifier, but you can also use any other classifier here, such as Support Vector Machines or Naive Bayes:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=5, max_features='log2', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=9, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

Now we need to use KFold cross-validation to validate our machine learning model. KFold cross-validation helps to understand that is our model good? This makes it possible to verify the efficiency of the algorithm using KFold. This will divide our data into 10 compartments, then run the algorithm using a different compartment as the test set for each iteration:

Fold 1 accuracy: 0.8111111111111111
Fold 2 accuracy: 0.8764044943820225
Fold 3 accuracy: 0.8089887640449438
Fold 4 accuracy: 0.8764044943820225
Fold 5 accuracy: 0.8314606741573034
Fold 6 accuracy: 0.8089887640449438
Fold 7 accuracy: 0.7865168539325843
Fold 8 accuracy: 0.7528089887640449
Fold 9 accuracy: 0.8764044943820225
Fold 10 accuracy: 0.8089887640449438
Mean Accuracy: 0.8238077403245943

Testing the Model:

Now we need to predict on the actual test data:

ids = data_test['PassengerId']
predictions = clf.predict(data_test.drop('PassengerId', axis=1))
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })

I hope you liked this article on a complete machine learning project walkthrough for beginners. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1538


  1. Really nice of you to share a concise workflow. However, as a complete beginner, I miss the statement of the problem being attacked, and what we want to determine. I know you have compiled 180 different projects, which is an impressively herculean task, but from what I’ve read on data science so far, without an understanding of the problem at hand and without a clear definition of what is it we want to know, there is not much meaning simply from the workflow.

    Congratulations on such an impressive compilation, and for making it available publicly.

  2. I agree with everything that @mverissimoalves commented on. The HUGE list of project/tutorials that you have put together is overwhelmingly impressive. As for this specific tutorial, made for beginners in ML, I’m lost. I have no idea what the point of the script is and what’s really messing with my head is the way that you have laid out the code… does it all go in 1 file, or, does it have to be split into multiple files? If so, where’s that line? You would go from one chunk of code that started at line 1 and ended at line 39 and, then the next block of code started at line 1 and went to line 13. Some of the new chunks of code added library calls which would all normally be put at the top. I hope it’s not too late and that you get to see my comment as I could really use some clarification on the point of the project and maybe, after doing separate code blocks spread throughout the tutorial, at the very bottom maybe put the 1 full script all in one page and if it’s supposed to be split into separate pages, then put each full page at the bottom and label them with names to help avoid the confusion. You could also include a brief description of each script, or, just comment it all into the script/s. Thank you so much, as I have completed and learned a lot from your list of tutorials and plan to continue. Please… KEEP UP THE AMAZING WORK!!!

Leave a Reply