A cohort is a group of subjects which share a defining feature. We can observe the behaviour of a cohort over time and compare it to other cohorts. In this article, I’m going to present a data science tutorial on Cohort Analysis with Python.
What is Cohort Analysis?
A cohort represents a group of a population or an area of study which shares something in common within a specified period. For example, a group of people born in India in 2000 is an example of a cohort related to the number of births in a country. Likewise, in terms of business problems, cohorts represent a group of customers or users. For example:
- Several users who purchased the subscription the app in a given period.
- The number of users who cancelled a subscription during the same month.
Also, Read – 100+ Machine Learning Projects Solved and Explained.
Cohorts analysis make it easy to analyze the user behaviour and trends without having to look at the behaviour of each user individually.
Why Cohort Analysis?
The Cohort analysis is important for the growth of a business because of the specificity of the information it provides. The most valuable feature of cohort analysis is that it helps companies answer some of the targeted questions by examining the relevant data. Some of the advantages of cohort analysis in a business are:
- It helps to understand how the behaviour of users can affect the business in terms of acquisition and retention
- It helps to analyze the customer churn rate
- It also helps in calculating the lifetime value of a customer
- It helps in finding the points where we need to increase more engagement with the customer.
Types of Cohorts
There are three types of Cohort Analysis:
- Time Cohort
- Behaviour Cohort
- Size Cohort
Time cohorts are customers who have signed up for a product or service during a specified period. Analysis of these cohorts shows the behaviour of customers based on when they started using the company’s products or services. The time can be monthly or quarterly or even daily.
Behaviour cohorts are customers who have purchased a product or subscribed to service in the past. It groups customers according to the type of product or service to which they have subscribed. Customers who signed up for basic services may have different needs than those who signed up for advanced services. Understanding the needs of different cohorts can help a business design tailor-made services or products for particular segments.
Size cohorts refer to the different sizes of customers who purchase the company’s products or services. This categorization can be based on the amount of spend in a certain period after acquisition or the type of product that the customer has spent most of the amount of their order in a given period.
Cohort Analysis with Python
I hope you now know what is cohort analysis and why companies do it. In this section, I will take you through a data science tutorial on cohort analysis with Python. I will start this task by importing the necessary Python libraries and the dataset:
# import library import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import datetime as dt #For Data Visualization import matplotlib.pyplot as plt import seaborn as sns #For Machine Learning Algorithm from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans df = pd.read_excel('Online Retail.xlsx') df.head()
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
df.info()
Data columns (total 8 columns): InvoiceNo 541909 non-null object StockCode 541909 non-null object Description 540455 non-null object Quantity 541909 non-null int64 InvoiceDate 541909 non-null datetime64[ns] UnitPrice 541909 non-null float64 CustomerID 406829 non-null float64 Country 541909 non-null object dtypes: datetime64[ns](1), float64(2), int64(1), object(4) memory usage: 33.1+ MB
So there is some missing data in the Description and Customer ID columns, let’s check that:
df.isnull().sum()
InvoiceNo 0 StockCode 0 Description 1454 Quantity 0 InvoiceDate 0 UnitPrice 0 CustomerID 135080 Country 0 dtype: int64
df= df.dropna(subset=['CustomerID'])
Now let’s check and clean the duplicate data:
df.duplicated().sum()
5225
df = df.drop_duplicates() df.describe()
Quantity | UnitPrice | CustomerID | |
---|---|---|---|
count | 401604.000000 | 401604.000000 | 401604.000000 |
mean | 12.183273 | 3.474064 | 15281.160818 |
std | 250.283037 | 69.764035 | 1714.006089 |
min | -80995.000000 | 0.000000 | 12346.000000 |
25% | 2.000000 | 1.250000 | 13939.000000 |
50% | 5.000000 | 1.950000 | 15145.000000 |
75% | 12.000000 | 3.750000 | 16784.000000 |
max | 80995.000000 | 38970.000000 | 18287.000000 |
Note that the minimum for the unit price = 0 and the minimum for the quantity is with a negative value.
df=df[(df['Quantity']>0) & (df['UnitPrice']>0)]
Data Preparation for Cohort Analysis
We’ve done all of the data cleansings now running a cohort analysis with Python. For the cohort analysis there are a few labels we need to create:
- Billing period: String representation of the year and month of a single transaction/invoice.
- Cohort Group: A string representation of the year and month of a customer’s first purchase. This label is common to all invoices for a particular customer.
- Cohort Period / Cohort Index: Full representation of a client’s stage in their ālifespanā. The number represents the number of months since the first purchase.

Customer retention is a very useful metric to understand how many of all customers are still active. Loyalty gives you the percentage of active customers compared to the total number of customers.

I hope you liked this article on a data science tutorial on Cohort Analysis with Python. Feel free to ask your valuable questions in the comments section below.