In this article, I will take you through a very famous case study for machine learning practitioners which is to predict titanic survival with Machine Learning. I will first introduce you to this case study and then I will show you how we can build a predictive model to predict survival with Machine Learning.
Machine Learning Case Study: Titanic Survival Analysis
The sinking of the Titanic is one of the most infamous wrecks in history. On April 15, 1912, during her maiden voyage, the RMS Titanic, widely considered “unsinkable”, sank after hitting an iceberg.
Unfortunately, there were not enough lifeboats for everyone on board, resulting in the deaths of 1,502 out of 2,224 passengers and crew. While there was an element of luck in survival, it appears that certain groups of people were more likely to survive than others.
Here, your challenge is to build a predictive model that can give a solution to the question, “What types of people were more likely to survive?” using passenger data (i.e. name, age, sex, socio-economic class, etc.).
Predict Titanic Survival with Machine Learning
Now, as a solution to the above case study for predicting titanic survival with machine learning, I’m using a now-classic dataset, which relates to passenger survival rates on the Titanic, which sank in 1912. I’ll start this task by loading the test and training dataset using pandas:
import pandas as pd train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') train[:4]
Scikit-learn’s algorithms generally cannot be powered by missing data, so I’ll be looking at the columns to see if there are any that contain missing data:
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64
In statistics and machine learning examples like this, a typical task is to predict whether a passenger will survive based on the characteristics of the data. A model is fitted to a training data set and then evaluated on an out-of-sample test data set.
I would like to use Age as a predictor, but data is missing. There are several ways to do missing data imputation, but I’ll make a simple one and use the median of the training dataset to fill in the null values in both tables:
impute_value = train['Age'].median() train['Age'] = train['Age'].fillna(impute_value) test['Age'] = test['Age'].fillna(impute_value)
We now need to specify our models. I’ll add an IsFemale column as the encoded version of the ‘Sex’ column:
train['IsFemale'] = (train['Sex'] == 'female').astype(int) test['IsFemale'] = (test['Sex'] == 'female').astype(int)
Next, we decide on some model variables and create NumPy arrays:
predictors = ['Pclass', 'IsFemale', 'Age'] X_train = train[predictors].values X_test = test[predictors].values y_train = train['Survived'].values X_train[:5]
array([[ 3., 0., 22.], [ 1., 1., 38.], [ 3., 1., 26.], [ 1., 1., 35.], [ 3., 0., 35.]])
Machine Learning Model to Predict Titanic Survival
Now I’m going to use the LogisticRegression model from scikit-learn and create a model instance:
from sklearn.linear_model import LogisticRegression model = LogisticRegression()
Now we can fit this model to the training data using the scikit-learn’s fit method:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
Now, we can make predictions on the test dataset using model.predict:
y_predict = model.predict(X_test) y_predict[:10]
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])
In practice, there are often many additional layers of complexity in training the models. Many models have parameters that can be adjusted, and there are techniques such as cross-validation that can be used for parameter tuning to prevent overfitting of training data. This can often improve predictive performance or the robustness of new data.
Cross-validation works by splitting training data to simulate out-of-sample prediction. Based on a model accuracy score such as the root mean square error, one can perform a grid search on the model parameters. Some models, like logistic regression, have classes of estimators with built-in cross-validation.
For example, the LogisticRegressionCV class can be used with a parameter indicating the degree of precision of a grid search to be performed on the model regularization parameter C:
from sklearn.linear_model import LogisticRegressionCV model_cv = LogisticRegressionCV(10) model_cv.fit(X_train, y_train)
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False, fit_intercept=True, intercept_scaling=1.0, l1_ratios=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
To perform cross-validation by hand, you can use the cross_val_score helper function, which handles the process of splitting data. For example, to validate our model with four non-overlapping divisions of training data, we can do:
from sklearn.model_selection import cross_val_score model = LogisticRegression(C=10) scores = cross_val_score(model, X_train, y_train, cv=4) scores
array([0.77578475, 0.79820628, 0.77578475, 0.78828829])
The default rating metric depends on the model, but it is possible to choose an explicit rating function. Cross-validated models take longer to train, but can often improve model performance.
I hope you like this article on my work on the case study to predict titanic survival with machine learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.