Titanic Survival with Machine Learning

In this article, I will take you through a very famous case study for machine learning practitioners which is to predict titanic survival with Machine Learning. I will first introduce you to this case study and then I will show you how we can build a predictive model to predict survival with Machine Learning.

Machine Learning Case Study: Titanic Survival Analysis

The sinking of the Titanic is one of the most infamous wrecks in history. On April 15, 1912, during her maiden voyage, the RMS Titanic, widely considered “unsinkable”, sank after hitting an iceberg.

Also, Read – Google’s BERT Algorithm in Machine Learning.

Unfortunately, there were not enough lifeboats for everyone on board, resulting in the deaths of 1,502 out of 2,224 passengers and crew. While there was an element of luck in survival, it appears that certain groups of people were more likely to survive than others.

Here, your challenge is to build a predictive model that can give a solution to the question, “What types of people were more likely to survive?” using passenger data (i.e. name, age, sex, socio-economic class, etc.).

Predict Titanic Survival with Machine Learning

Now, as a solution to the above case study for predicting titanic survival with machine learning, I’m using a now-classic dataset, which relates to passenger survival rates on the Titanic, which sank in 1912. I’ll start this task by loading the test and training dataset using pandas:

import pandas as pd
train = pd.read_csv('train.csv') 
test = pd.read_csv('test.csv')
train[:4]Code language: JavaScript (javascript)
image for post

Scikit-learn’s algorithms generally cannot be powered by missing data, so I’ll be looking at the columns to see if there are any that contain missing data:

train.isnull().sum()Code language: CSS (css)
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
test.isnull().sum()Code language: CSS (css)
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Data Preparation

In statistics and machine learning examples like this, a typical task is to predict whether a passenger will survive based on the characteristics of the data. A model is fitted to a training data set and then evaluated on an out-of-sample test data set.

I would like to use Age as a predictor, but data is missing. There are several ways to do missing data imputation, but I’ll make a simple one and use the median of the training dataset to fill in the null values ​​in both tables:

impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)Code language: JavaScript (javascript)

We now need to specify our models. I’ll add an IsFemale column as the encoded version of the ‘Sex’ column:

train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)Code language: JavaScript (javascript)

Next, we decide on some model variables and create NumPy arrays:

predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
X_test = test[predictors].values
y_train = train['Survived'].values
X_train[:5]Code language: JavaScript (javascript)
array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])

Machine Learning Model to Predict Titanic Survival

Now I’m going to use the LogisticRegression model from scikit-learn and create a model instance:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()Code language: JavaScript (javascript)

Now we can fit this model to the training data using the scikit-learn’s fit method:

model.fit(X_train, y_train)Code language: CSS (css)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,

Now, we can make predictions on the test dataset using model.predict:

y_predict = model.predict(X_test)
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

In practice, there are often many additional layers of complexity in training the models. Many models have parameters that can be adjusted, and there are techniques such as cross-validation that can be used for parameter tuning to prevent overfitting of training data. This can often improve predictive performance or the robustness of new data.

Implementing Cross-Validation

Cross-validation works by splitting training data to simulate out-of-sample prediction. Based on a model accuracy score such as the root mean square error, one can perform a grid search on the model parameters. Some models, like logistic regression, have classes of estimators with built-in cross-validation.

For example, the LogisticRegressionCV class can be used with a parameter indicating the degree of precision of a grid search to be performed on the model regularization parameter C:

from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(10)
model_cv.fit(X_train, y_train)Code language: JavaScript (javascript)
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

To perform cross-validation by hand, you can use the cross_val_score helper function, which handles the process of splitting data. For example, to validate our model with four non-overlapping divisions of training data, we can do:

from sklearn.model_selection import cross_val_score
model = LogisticRegression(C=10)
scores = cross_val_score(model, X_train, y_train, cv=4)
scoresCode language: JavaScript (javascript)
array([0.77578475, 0.79820628, 0.77578475, 0.78828829])

The default rating metric depends on the model, but it is possible to choose an explicit rating function. Cross-validated models take longer to train, but can often improve model performance.

Also, Read – Five Python Projects for Beginners

I hope you like this article on my work on the case study to predict titanic survival with machine learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1534

Leave a Reply