Credit Scoring and Segmentation using Python

Credit scoring and segmentation refer to the process of evaluating the creditworthiness of individuals or businesses and dividing them into distinct groups based on their credit profiles. It aims to assess the likelihood of borrowers repaying their debts and helps financial institutions make informed decisions regarding lending and managing credit risk. If you want to learn how to calculate credit scores and segment customers based on their credit scores, this article is for you. In this article, I will take you through the task of Credit Scoring and Segmentation using Python.

Credit Scoring and Segmentation: Overview

The process of calculating credit scores and segmenting customers based on their credit scores involves several steps. Firstly, relevant data about borrowers, such as payment history, credit utilization, credit history, and credit mix, is collected and organized. Then, using complex algorithms and statistical models, the collected data is analyzed to generate credit scores for each borrower.

These credit scores are numerical representations of the borrower’s creditworthiness and indicate the likelihood of default or timely repayment. Once the credit scores are calculated, customers are segmented into different risk categories or credit tiers based on predefined thresholds.

This segmentation helps financial institutions assess the credit risk associated with each customer and make informed decisions regarding loan approvals, interest rates, and credit limits. By categorizing customers into segments, financial institutions can better manage their lending portfolios and effectively mitigate the risk of potential defaults.

So to get started with the task of credit scoring and segmentation, we first need to have appropriate data. I found an ideal dataset for this task. You can download the dataset from here.

Credit Scoring and Segmentation using Python

Now let’s get started with the task of Credit Scoring and Segmentation by importing the necessary Python libraries and the dataset:

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"

data = pd.read_csv("credit_scoring.csv")
print(data.head())
   Age  Gender Marital Status Education Level Employment Status  \
0   60    Male        Married          Master          Employed   
1   25    Male        Married     High School        Unemployed   
2   30  Female         Single          Master          Employed   
3   58  Female        Married             PhD        Unemployed   
4   32    Male        Married        Bachelor     Self-Employed   

   Credit Utilization Ratio  Payment History  Number of Credit Accounts  \
0                      0.22           2685.0                          2   
1                      0.20           2371.0                          9   
2                      0.22           2771.0                          6   
3                      0.12           1371.0                          2   
4                      0.99            828.0                          2   

   Loan Amount  Interest Rate  Loan Term   Type of Loan  
0      4675000           2.65         48  Personal Loan  
1      3619000           5.19         60      Auto Loan  
2       957000           2.76         12      Auto Loan  
3      4731000           6.57         60      Auto Loan  
4      3289000           6.28         36  Personal Loan  

Below is the description of all the features in the data:

  1. Age: This feature represents the age of the individual.
  2. Gender: This feature captures the gender of the individual.
  3. Marital Status: This feature denotes the marital status of the individual.
  4. Education Level: This feature represents the highest level of education attained by the individual.
  5. Employment Status: This feature indicates the current employment status of the individual.
  6. Credit Utilization Ratio: This feature reflects the ratio of credit used by the individual compared to their total available credit limit.
  7. Payment History: It represents the monthly net payment behaviour of each customer, taking into account factors such as on-time payments, late payments, missed payments, and defaults.
  8. Number of Credit Accounts: It represents the count of active credit accounts the person holds.
  9. Loan Amount: It indicates the monetary value of the loan.
  10. Interest Rate: This feature represents the interest rate associated with the loan.
  11. Loan Term: This feature denotes the duration or term of the loan.
  12. Type of Loan: It includes categories like “Personal Loan,” “Auto Loan,” or potentially other types of loans.

Now let’s have a look at column insights before moving forward:

print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        1000 non-null   int64  
 1   Gender                     1000 non-null   object 
 2   Marital Status             1000 non-null   object 
 3   Education Level            1000 non-null   object 
 4   Employment Status          1000 non-null   object 
 5   Credit Utilization Ratio   1000 non-null   float64
 6   Payment History            1000 non-null   float64
 7   Number of Credit Accounts  1000 non-null   int64  
 8   Loan Amount                1000 non-null   int64  
 9   Interest Rate              1000 non-null   float64
 10  Loan Term                  1000 non-null   int64  
 11  Type of Loan               1000 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None

Now let’s have a look at the descriptive statistics of the data:

print(data.describe())
               Age  Credit Utilization Ratio  Payment History  \
count  1000.000000               1000.000000      1000.000000   
mean     42.702000                  0.509950      1452.814000   
std      13.266771                  0.291057       827.934146   
min      20.000000                  0.000000         0.000000   
25%      31.000000                  0.250000       763.750000   
50%      42.000000                  0.530000      1428.000000   
75%      54.000000                  0.750000      2142.000000   
max      65.000000                  1.000000      2857.000000   

       Number of Credit Accounts   Loan Amount  Interest Rate    Loan Term  
count                1000.000000  1.000000e+03    1000.000000  1000.000000  
mean                    5.580000  2.471401e+06      10.686600    37.128000  
std                     2.933634  1.387047e+06       5.479058    17.436274  
min                     1.000000  1.080000e+05       1.010000    12.000000  
25%                     3.000000  1.298000e+06       6.022500    24.000000  
50%                     6.000000  2.437500e+06      10.705000    36.000000  
75%                     8.000000  3.653250e+06      15.440000    48.000000  
max                    10.000000  4.996000e+06      19.990000    60.000000  

Now let’s have a look at the distribution of the credit utilization ratio in the data:

credit_utilization_fig = px.box(data, y='Credit Utilization Ratio',
                                title='Credit Utilization Ratio Distribution')
credit_utilization_fig.show()
Credit Utilization Ratio Distribution

Now let’s have a look at the distribution of the loan amount in the data:

loan_amount_fig = px.histogram(data, x='Loan Amount', 
                               nbins=20, 
                               title='Loan Amount Distribution')
loan_amount_fig.show()
Credit Scoring and Distribution: Loan Amount Distribution

Now let’s have a look at the correlation in the data:

numeric_df = data[['Credit Utilization Ratio', 
                   'Payment History', 
                   'Number of Credit Accounts', 
                   'Loan Amount', 'Interest Rate', 
                   'Loan Term']]
correlation_fig = px.imshow(numeric_df.corr(), 
                            title='Correlation Heatmap')
correlation_fig.show()
Correlation Heatmap: Credit Scores

Calculating Credit Scores

The dataset doesn’t have any feature representing the credit scores of individuals. To calculate the credit scores, we need to use an appropriate technique. There are several widely used techniques for calculating credit scores, each with its own calculation process. One example is the FICO score, a commonly used credit scoring model in the industry.

Below is how we can implement the FICO score method to calculate credit scores:

# Define the mapping for categorical features
education_level_mapping = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
employment_status_mapping = {'Unemployed': 0, 'Employed': 1, 'Self-Employed': 2}

# Apply mapping to categorical features
data['Education Level'] = data['Education Level'].map(education_level_mapping)
data['Employment Status'] = data['Employment Status'].map(employment_status_mapping)

# Calculate credit scores using the complete FICO formula
credit_scores = []

for index, row in data.iterrows():
    payment_history = row['Payment History']
    credit_utilization_ratio = row['Credit Utilization Ratio']
    number_of_credit_accounts = row['Number of Credit Accounts']
    education_level = row['Education Level']
    employment_status = row['Employment Status']

    # Apply the FICO formula to calculate the credit score
    credit_score = (payment_history * 0.35) + (credit_utilization_ratio * 0.30) + (number_of_credit_accounts * 0.15) + (education_level * 0.10) + (employment_status * 0.10)
    credit_scores.append(credit_score)

# Add the credit scores as a new column to the DataFrame
data['Credit Score'] = credit_scores

print(data.head())
   Age  Gender Marital Status  Education Level  Employment Status  \
0   60    Male        Married                3                  1   
1   25    Male        Married                1                  0   
2   30  Female         Single                3                  1   
3   58  Female        Married                4                  0   
4   32    Male        Married                2                  2   

   Credit Utilization Ratio  Payment History  Number of Credit Accounts  \
0                      0.22           2685.0                          2   
1                      0.20           2371.0                          9   
2                      0.22           2771.0                          6   
3                      0.12           1371.0                          2   
4                      0.99            828.0                          2   

   Loan Amount  Interest Rate  Loan Term   Type of Loan  Credit Score  
0      4675000           2.65         48  Personal Loan       940.516  
1      3619000           5.19         60      Auto Loan       831.360  
2       957000           2.76         12      Auto Loan       971.216  
3      4731000           6.57         60      Auto Loan       480.586  
4      3289000           6.28         36  Personal Loan       290.797  

Below is how the above code works:

  1. Firstly, it defines mappings for two categorical features: “Education Level” and “Employment Status”. The “Education Level” mapping assigns numerical values to different levels of education, such as “High School” being mapped to 1, “Bachelor” to 2, “Master” to 3, and “PhD” to 4. The “Employment Status” mapping assigns numerical values to different employment statuses, such as “Unemployed” being mapped to 0, “Employed” to 1, and “Self-Employed” to 2.
  2. Next, the code applies the defined mappings to the corresponding columns in the DataFrame. It transforms the values of the “Education Level” and “Employment Status” columns from their original categorical form to the mapped numerical representations.
  3. After that, the code initiates an iteration over each row of the DataFrame to calculate the credit scores for each individual. It retrieves the values of relevant features, such as “Payment History”, “Credit Utilization Ratio”, “Number of Credit Accounts”, “Education Level”, and “Employment Status”, from each row.

Within the iteration, the FICO formula is applied to calculate the credit score for each individual. The formula incorporates the weighted values of the features mentioned earlier: 

  1. 35% weight for “Payment History”, 
  2. 30% weight for “Credit Utilization Ratio”, 
  3. 15% weight for “Number of Credit Accounts”, 
  4. 10% weight for “Education Level”, 
  5. and 10% weight for “Employment Status”. 

The calculated credit score is then stored in a list called “credit_scores”.

Segmentation Based on Credit Scores

Now, let’s use the KMeans clustering algorithm to segment customers based on their credit scores:

from sklearn.cluster import KMeans

X = data[['Credit Score']]
kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
kmeans.fit(X)
data['Segment'] = kmeans.labels_

Now let’s have a look at the segments:

# Convert the 'Segment' column to category data type
data['Segment'] = data['Segment'].astype('category')

# Visualize the segments using Plotly
fig = px.scatter(data, x=data.index, y='Credit Score', color='Segment',
                 color_discrete_sequence=['green', 'blue', 'yellow', 'red'])
fig.update_layout(
    xaxis_title='Customer Index',
    yaxis_title='Credit Score',
    title='Customer Segmentation based on Credit Scores'
)
fig.show()
Credit Score Segmentation using KMeans Clustering

Now let’s name the segments based on the above clusters and have a look at the segments again:

data['Segment'] = data['Segment'].map({2: 'Very Low', 
                                       0: 'Low',
                                       1: 'Good',
                                       3: "Excellent"})

# Convert the 'Segment' column to category data type
data['Segment'] = data['Segment'].astype('category')

# Visualize the segments using Plotly
fig = px.scatter(data, x=data.index, y='Credit Score', color='Segment',
                 color_discrete_sequence=['green', 'blue', 'yellow', 'red'])
fig.update_layout(
    xaxis_title='Customer Index',
    yaxis_title='Credit Score',
    title='Customer Segmentation based on Credit Scores'
)
fig.show()
Credit Scoring and Segmentation

So this is how you can perform credit scoring and segmentation using Python.

Summary

Credit scoring and segmentation refer to the process of evaluating the creditworthiness of individuals or businesses and dividing them into distinct groups based on their credit profiles. It aims to assess the likelihood of borrowers repaying their debts and helps financial institutions make informed decisions regarding lending and managing credit risk. I hope you liked this article on Credit Scoring and Segmentation using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply