Customer Segmentation

If you want to find out who are your best customers, using an old technique RFM matrix principle is still the best in the business. RFM means – Recency, Frequency and Monetary. RFM is basically a customer segmentation technique that works by using the past behaviour of your customers based on their purchases to split the customers into groups.

Let’s See How RFM Works:

As RFM is the technique behind customer segmentation, let’s see how it manages to do that. This technique is based on three main categories, they are:

  • Recency(R) – Days since last purchase
  • Frequency(F) – Total number of purchases
  • Monetary Value(M) – How much total money the customer spent.

Customer Segmentation with Python

Now, let’s proceed with the target of this article, which is to create a customer segmentation system with python. Now as I will use the RFM technique here, so the first thing we need to proceed is data because this technique is all dependent on data of customers expenditure on our products. The dataset I will use in this article can be downloaded below.

Now, let’s import the dataset and get started:

import pandas as pd
import warnings
df = pd.read_excel("Online_Retail.xlsx")
df1 = dfCode language: Python (python)
Image for post

The data contains all the transactions of a UK-based online retailer. All the transactions are between 01 Dec, 2010 and 09 Dec, 2011.

Data Exploration

Now let’s explore the data by looking at –

  1. Missing values in important columns;
  2. Customers’ distribution in each country;
  3. Unit price and Quantity should > 0;
  4. Invoice date should < today.
df1.Country.nunique()Code language: Python (python)
array([‘United Kingdom’, ‘France’, ‘Australia’, ‘Netherlands’, ‘Germany’,
‘Norway’, ‘EIRE’, ‘Switzerland’, ‘Spain’, ‘Poland’, ‘Portugal’,
‘Italy’, ‘Belgium’, ‘Lithuania’, ‘Japan’, ‘Iceland’,
‘Channel Islands’, ‘Denmark’, ‘Cyprus’, ‘Sweden’, ‘Austria’,
‘Israel’, ‘Finland’, ‘Bahrain’, ‘Greece’, ‘Hong Kong’, ‘Singapore’,
‘Lebanon’, ‘United Arab Emirates’, ‘Saudi Arabia’, ‘Czech Republic’,
‘Canada’, ‘Unspecified’, ‘Brazil’, ‘USA’, ‘European Community’,
‘Malta’, ‘RSA’], dtype=object
customer_country.groupby(['Country'])['CustomerID'].aggregate('count').reset_index().sort_values('CustomerID', ascending=False)Code language: Python (python)
Image for post

The above data states that more than 90 percent of the customers are from United Kingdom only. Some researchers states that customer segmentation varies with geography. So I will limit the data to United Kingdom only.

df1 = df1.loc[df1['Country'] == 'United Kingdom']Code language: Python (python)

Now let’s check whether, there are null values in the dataset:

df1.isnull().sum(axix=0)Code language: Python (python)

The above output stated that there are 133,600 missing values in the customerID column, and as we are working on customer segmentation, so we do not need to focus on customer ID, so I will simply remove these missing values in this column.

df1 = df1[pd.notnull(df1['CustomerID'])]Code language: Python (python)

Now let’s remove the negative values in Quantity column:

df1 = df1[(df1['Quantity']&gt;0)]Code language: Python (python)

Now let’s add a new column for total price:

df1['TotalPrice'] = df1['Quantity'] * df1['UnitPrice']Code language: Python (python)

To find out the first and last order dates in the data:

df1['InvoiceDate'].min()Code language: Python (python)
Timestamp(‘2010–12–01 08:26:00’)
df1['InvoiceDate'].max()Code language: Python (python)
Timestamp(‘2011–12–09 12:49:00’)

As recency is calculated for a point in time, and the last invoice date is 2011–12–09, so I will use 2011–12–10 to calculate recency:

import datetime as dt
NOW = dt.datetime(2011,12,10)
df1['InvoiceDate'] = pd.to_datetime(df1['InvoiceDate'])Code language: Python (python)

RFM Customer Segmentation

Now our data is completely ready for customer segmentation, lets move further with this by creating a RFM table:

rfmTable = df1.groupby('CustomerID').agg({'InvoiceDate': lambda x: (NOW - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalPrice': lambda x: x.sum()})
rfmTable['InvoiceDate'] = rfmTable['InvoiceDate'].astype(int)
rfmTable.rename(columns={'InvoiceDate': 'recency', 
                         'InvoiceNo': 'frequency', 
                         'TotalPrice': 'monetary_value'}, inplace=True)Code language: Python (python)

To calculate RFM metrics for each customer:

rfmTable.head()Code language: Python (python)
RFM customer segmentation


  • CustomerID 12346 has frequency: 1, monetary value: $77,183.60 and recency: 325 days.
  • CustomerID 12747 has frequency: 103, monetary value: $4,196.01 and recency: 2 days

Split the metrics

The easiest way to split metrics into segments is by using quartiles.

  1. This gives us a starting point for the detailed analysis.
  2. 4 segments are easy to understand and explain.
quantiles = rfmTable.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()Code language: Python (python)

Create a Customer Segment RFM table

segmented_rfm = rfmTableCode language: Python (python)

The lowest recency, highest frequency and best monetary amounts will convey are our best customers:

def RScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
        return 4
def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
        return 1Code language: Python (python)

Add segment numbers

segmented_rfm['r_quartile'] = segmented_rfm['recency'].apply(RScore, args=('recency',quantiles,))
segmented_rfm['f_quartile'] = segmented_rfm['frequency'].apply(FMScore, args=('frequency',quantiles,))
segmented_rfm['m_quartile'] = segmented_rfm['monetary_value'].apply(FMScore, args=('monetary_value',quantiles,))
segmented_rfm.head()Code language: Python (python)
Image for post

RFM segments split the customer base into an imaginary 3D cube which is hard to visualize. However, we can sort it out:

segmented_rfm['RFMScore'] = 
segmented_rfm.head()Code language: Python (python)
customer segmentation

Let’s See The Top 10 of best customers

segmented_rfm[segmented_rfm['RFMScore']=='111'].sort_values('monetary_value', ascending=False).head(10)Code language: Python (python)
customer segmentation

Also, read – Master your Text Analytics Skills.

I hope you liked this article on Customer Segmentation with Python. Feel free to ask question on this topic or any topic you like, and don’t forget to subscribe for the newsletters below, to get email notification if you like my work.

Follow Us:

Articles: 75

Leave a Reply