Loan Approval Prediction is one of the problems that Machine Learning has solved in fintech businesses like banks and financial institutions. Loan approval prediction means using credit history data of the loan applicants and algorithms to build an intelligent system that can determine loan approvals. So, if you want to learn how to use Machine Learning for Loan Approval Prediction, this article is for you. In this article, I’ll take you through the task of Loan Approval Prediction with Machine Learning using Python.
Loan Approval Prediction: Overview and Dataset
Loan approval prediction involves the analysis of various factors, such as the applicant’s financial history, income, credit rating, employment status, and other relevant attributes. By leveraging historical loan data and applying machine learning algorithms, businesses can build models to determine loan approvals for new applicants.
I found an ideal dataset for the task of Loan Approval Prediction. You can download the dataset from here.
In the section below, I’ll take you through the task of Loan Approval Prediction with Machine Learning using Python.
Loan Approval Prediction using Python
Let’s start this task by importing the necessary Python libraries and the dataset:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier df = pd.read_csv('loan_prediction.csv') print(df.head())
Loan_ID Gender Married Dependents Education Self_Employed \ 0 LP001002 Male No 0 Graduate No 1 LP001003 Male Yes 1 Graduate No 2 LP001005 Male Yes 0 Graduate Yes 3 LP001006 Male Yes 0 Not Graduate No 4 LP001008 Male No 0 Graduate No ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \ 0 5849 0.0 NaN 360.0 1 4583 1508.0 128.0 360.0 2 3000 0.0 66.0 360.0 3 2583 2358.0 120.0 360.0 4 6000 0.0 141.0 360.0 Credit_History Property_Area Loan_Status 0 1.0 Urban Y 1 1.0 Rural N 2 1.0 Urban Y 3 1.0 Urban Y 4 1.0 Urban Y
I’ll drop the loan id column and move further:
df = df.drop('Loan_ID', axis=1)
Now let’s have a look if the data has missing values or not:
df.isnull().sum()
Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
The data has missing values in some of the categorical columns and some numerical columns. Let’s have a look at the descriptive statistics of the dataset before filling in the missing values:
print(df.describe())
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \ count 614.000000 614.000000 592.000000 600.00000 mean 5403.459283 1621.245798 146.412162 342.00000 std 6109.041673 2926.248369 85.587325 65.12041 min 150.000000 0.000000 9.000000 12.00000 25% 2877.500000 0.000000 100.000000 360.00000 50% 3812.500000 1188.500000 128.000000 360.00000 75% 5795.000000 2297.250000 168.000000 360.00000 max 81000.000000 41667.000000 700.000000 480.00000 Credit_History count 564.000000 mean 0.842199 std 0.364878 min 0.000000 25% 1.000000 50% 1.000000 75% 1.000000 max 1.000000
Now let’s fill in the missing values. In categorical columns, we can fill in missing values with the mode of each column. The mode represents the value that appears most often in the column and is an appropriate choice when dealing with categorical data:
# Fill missing values in categorical columns with mode df['Gender'].fillna(df['Gender'].mode()[0], inplace=True) df['Married'].fillna(df['Married'].mode()[0], inplace=True) df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True) df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
To fill in the missing values of numerical columns, we have to choose appropriate measures:
- We can fill in the missing values of the loan amount column with the median value. The median is an appropriate measure to fill in missing values when dealing with skewed distributions or when outliers are present in the data;
- We can fill in the missing values of the loan amount term column with the mode value of the column. Since the term of the loan amount is a discrete value, the mode is an appropriate metric to use;
- We can fill in the missing values of the credit history column with the mode value. Since credit history is a binary variable (0 or 1), the mode represents the most common value and is an appropriate choice for filling in missing values.
# Fill missing values in LoanAmount with the median df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True) # Fill missing values in Loan_Amount_Term with the mode df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True) # Fill missing values in Credit_History with the mode df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
Exploratory Data Analysis
Now let’s have a look at the distribution of the loan status column:
import plotly.express as px loan_status_count = df['Loan_Status'].value_counts() fig_loan_status = px.pie(loan_status_count, names=loan_status_count.index, title='Loan Approval Status') fig_loan_status.show()

Now let’s have a look at the distribution of the gender column:
gender_count = df['Gender'].value_counts() fig_gender = px.bar(gender_count, x=gender_count.index, y=gender_count.values, title='Gender Distribution') fig_gender.show()

Now let’s have a look at the distribution of the martial status column:
married_count = df['Married'].value_counts() fig_married = px.bar(married_count, x=married_count.index, y=married_count.values, title='Marital Status Distribution') fig_married.show()

Now let’s have a look at the distribution of the education column:
education_count = df['Education'].value_counts() fig_education = px.bar(education_count, x=education_count.index, y=education_count.values, title='Education Distribution') fig_education.show()

Now let’s have a look at the distribution of the self-employment column:
self_employed_count = df['Self_Employed'].value_counts() fig_self_employed = px.bar(self_employed_count, x=self_employed_count.index, y=self_employed_count.values, title='Self-Employment Distribution') fig_self_employed.show()

Now let’s have a look at the distribution of the Applicant Income column:
fig_applicant_income = px.histogram(df, x='ApplicantIncome', title='Applicant Income Distribution') fig_applicant_income.show()

Now let’s have a look at the relationship between the income of the loan applicant and the loan status:
fig_income = px.box(df, x='Loan_Status', y='ApplicantIncome', color="Loan_Status", title='Loan_Status vs ApplicantIncome') fig_income.show()

The “ApplicantIncome” column contains outliers which need to be removed before moving further. Here’s how to remove the outliers:
# Calculate the IQR Q1 = df['ApplicantIncome'].quantile(0.25) Q3 = df['ApplicantIncome'].quantile(0.75) IQR = Q3 - Q1 # Define the lower and upper bounds for outliers lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Remove outliers df = df[(df['ApplicantIncome'] >= lower_bound) & (df['ApplicantIncome'] <= upper_bound)]
Now let’s have a look at the relationship between the income of the loan co-applicant and the loan status:
fig_coapplicant_income = px.box(df, x='Loan_Status', y='CoapplicantIncome', color="Loan_Status", title='Loan_Status vs CoapplicantIncome') fig_coapplicant_income.show()

The income of the loan co-applicant also contains outliers. Let’s remove the outliers from this column as well:
# Calculate the IQR Q1 = df['CoapplicantIncome'].quantile(0.25) Q3 = df['CoapplicantIncome'].quantile(0.75) IQR = Q3 - Q1 # Define the lower and upper bounds for outliers lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Remove outliers df = df[(df['CoapplicantIncome'] >= lower_bound) & (df['CoapplicantIncome'] <= upper_bound)]
Now let’s have a look at the relationship between the loan amount and the loan status:
fig_loan_amount = px.box(df, x='Loan_Status', y='LoanAmount', color="Loan_Status", title='Loan_Status vs LoanAmount') fig_loan_amount.show()

Now let’s have a look at the relationship between credit history and loan status:
fig_credit_history = px.histogram(df, x='Credit_History', color='Loan_Status', barmode='group', title='Loan_Status vs Credit_His') fig_credit_history.show()

Now let’s have a look at the relationship between the property area and the loan status:
fig_property_area = px.histogram(df, x='Property_Area', color='Loan_Status', barmode='group', title='Loan_Status vs Property_Area') fig_property_area.show()

Data Preparation and Training Loan Approval Prediction Model
In this step, we will:
- convert categorical columns into numerical ones;
- split the data into training and test sets;
- scale the numerical features;
- train the loan approval prediction model.
# Convert categorical columns to numerical using one-hot encoding cat_cols = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area'] df = pd.get_dummies(df, columns=cat_cols) # Split the dataset into features (X) and target (y) X = df.drop('Loan_Status', axis=1) y = df['Loan_Status'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Scale the numerical columns using StandardScaler scaler = StandardScaler() numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History'] X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols]) X_test[numerical_cols] = scaler.transform(X_test[numerical_cols]) from sklearn.svm import SVC model = SVC(random_state=42) model.fit(X_train, y_train)
Now let’s make predictions on the test set:
y_pred = model.predict(X_test) print(y_pred)
['Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'N' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y']
Now let’s add the predicted loan approval values to the testing set as a new column in a DataFrame called X_test_df and show the predicted loan approval values alongside the original features:
# Convert X_test to a DataFrame X_test_df = pd.DataFrame(X_test, columns=X_test.columns) # Add the predicted values to X_test_df X_test_df['Loan_Status_Predicted'] = y_pred print(X_test_df.head())
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \ 277 -0.544528 -0.037922 -0.983772 0.305159 84 -0.067325 -0.931554 -1.571353 -1.430680 275 -0.734870 0.334654 -0.298262 0.305159 392 -0.824919 0.522317 -0.200332 0.305159 537 -0.267373 -0.931554 -0.454950 0.305159 Credit_History Gender_Female Gender_Male Married_No Married_Yes \ 277 0.402248 0 1 0 1 84 0.402248 0 1 0 1 275 0.402248 0 1 0 1 392 0.402248 0 1 0 1 537 0.402248 0 1 1 0 Dependents_0 ... Dependents_2 Dependents_3+ Education_Graduate \ 277 1 ... 0 0 1 84 0 ... 0 0 1 275 0 ... 0 0 1 392 1 ... 0 0 1 537 0 ... 1 0 1 Education_Not Graduate Self_Employed_No Self_Employed_Yes \ 277 0 1 0 84 0 1 0 275 0 1 0 392 0 1 0 537 0 1 0 Property_Area_Rural Property_Area_Semiurban Property_Area_Urban \ 277 0 0 1 84 0 0 1 275 0 1 0 392 0 0 1 537 0 1 0 Loan_Status_Predicted 277 Y 84 Y 275 Y 392 Y 537 Y [5 rows x 21 columns]
So this is how you can train a Machine Learning model to predict loan approval using Python.
Summary
Loan approval prediction involves the analysis of various factors, such as the applicant’s financial history, income, credit rating, employment status, and other relevant attributes. By leveraging historical loan data and applying machine learning algorithms, businesses can build models to determine loan approvals for new applicants. I hope you liked this article on Loan Approval Prediction with Machine Learning using Python. Feel free to ask valuable questions in the comments section below.