Cohort Analysis with Python

A cohort is a group of subjects which share a defining feature. We can observe the behaviour of a cohort over time and compare it to other cohorts. In this article, I’m going to present a data science tutorial on Cohort Analysis with Python.

What is Cohort Analysis?

A cohort represents a group of a population or an area of study which shares something in common within a specified period. For example, a group of people born in India in 2000 is an example of a cohort related to the number of births in a country. Likewise, in terms of business problems, cohorts represent a group of customers or users. For example:

  1. Several users who purchased the subscription the app in a given period.
  2. The number of users who cancelled a subscription during the same month.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

Cohorts analysis make it easy to analyze the user behaviour and trends without having to look at the behaviour of each user individually.

Why Cohort Analysis?

The Cohort analysis is important for the growth of a business because of the specificity of the information it provides. The most valuable feature of cohort analysis is that it helps companies answer some of the targeted questions by examining the relevant data. Some of the advantages of cohort analysis in a business are:

  1. It helps to understand how the behaviour of users can affect the business in terms of acquisition and retention
  2. It helps to analyze the customer churn rate
  3. It also helps in calculating the lifetime value of a customer
  4. It helps in finding the points where we need to increase more engagement with the customer.

Types of Cohorts

There are three types of Cohort Analysis:

  1. Time Cohort
  2. Behaviour Cohort
  3. Size Cohort

Time cohorts are customers who have signed up for a product or service during a specified period. Analysis of these cohorts shows the behaviour of customers based on when they started using the company’s products or services. The time can be monthly or quarterly or even daily.

Behaviour cohorts are customers who have purchased a product or subscribed to service in the past. It groups customers according to the type of product or service to which they have subscribed. Customers who signed up for basic services may have different needs than those who signed up for advanced services. Understanding the needs of different cohorts can help a business design tailor-made services or products for particular segments.

Size cohorts refer to the different sizes of customers who purchase the company’s products or services. This categorization can be based on the amount of spend in a certain period after acquisition or the type of product that the customer has spent most of the amount of their order in a given period.

Cohort Analysis with Python

I hope you now know what is cohort analysis and why companies do it. In this section, I will take you through a data science tutorial on cohort analysis with Python. I will start this task by importing the necessary Python libraries and the dataset:

# import library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt

#For Data  Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#For Machine Learning Algorithm
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

df = pd.read_excel('Online Retail.xlsx')
df.head()
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom
df.info()
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

So there is some missing data in the Description and Customer ID columns, let’s check that:

df.isnull().sum()
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
df= df.dropna(subset=['CustomerID'])

Now let’s check and clean the duplicate data:

df.duplicated().sum()
5225
df = df.drop_duplicates()
df.describe()
QuantityUnitPriceCustomerID
count401604.000000401604.000000401604.000000
mean12.1832733.47406415281.160818
std250.28303769.7640351714.006089
min-80995.0000000.00000012346.000000
25%2.0000001.25000013939.000000
50%5.0000001.95000015145.000000
75%12.0000003.75000016784.000000
max80995.00000038970.00000018287.000000

Note that the minimum for the unit price = 0 and the minimum for the quantity is with a negative value.

df=df[(df['Quantity']>0) & (df['UnitPrice']>0)]

Data Preparation for Cohort Analysis

We’ve done all of the data cleansings now running a cohort analysis with Python. For the cohort analysis there are a few labels we need to create:

  1. Billing period: String representation of the year and month of a single transaction/invoice.
  2. Cohort Group: A string representation of the year and month of a customer’s first purchase. This label is common to all invoices for a particular customer.
  3. Cohort Period / Cohort Index: Full representation of a client’s stage in their ā€œlifespanā€. The number represents the number of months since the first purchase.
cohort analysis: retention table

Customer retention is a very useful metric to understand how many of all customers are still active. Loyalty gives you the percentage of active customers compared to the total number of customers.

cohort analysis

I hope you liked this article on a data science tutorial on Cohort Analysis with Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of datašŸ“ˆ.

Articles: 1501

Leave a Reply