Know User’s Behaviors Better with Cohort Analysis in Python

Henry Feng
7 min readMar 17, 2019

--

Intro

As a product analyst or marketing analyst, it is always a good idea to know how your users behave in a product or toward a marketing event aligned with time series. User acquisition is a huge and significant action for a product. However, sometimes a product manager or marketing manager are caring more if a user is retained on the platform and keeps using the product frequently.

In this article, I will use Python as my tool to conduct the cohort analysis. The code will be provided. The code can be used in any kind of online websites, platforms or application to demonstrate and conclude the better understanding of the users.

Dataset

I used the “Online Retail Data Set” from the UCI Machine Learning Repository. This is a transaction data set containing all the transaction between 12/01/2010 and 12/09/2011 for UK-based online retail. The number of instances is around half a million. It is a relatively clean data set for exploring the usability and testing for customer behavior, especially for cohort analysis.

Tools

Tools: Python Jupyter Notebook

Skills: Data munging, cleaning pivoting with Pandas, Visualization with Seaborn and Matplotlib

Methodology

  1. Data set exploration and cleaning
  2. Create a cohort and conduct cohort analysis
  3. Visualize the cohort analysis results

Data exploration & cleaning

First of all, I loaded in the needed packages for the analysis, including pandas, numpy, datetime, seaborn and matplotlib.

import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt

Next, it is time to load in our data set. The data set is saved as an excel format. It is pretty simple to load it with pandas. And we can check the top 5 rows of the data.

df = pd.read_excel('Online_Retail.xlsx')
df.head(5)

As it is shown, there are eight features in the data frame. I won’t use all the features. Only quantity, invoice date, unit price, customer ID are needed.

The row number of this file is about half a million.

df.shape
>> (541909, 8)

Finally, let’s check out the missing value issue. By running the code below, there are about 135K rows without customer ID. I decide to drop them all since there might not be a logical way to impute the customer ID using the existing data.

df.isna().sum()>>>
InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080
Country 0
df1 = df.dropna(subset=['CustomerID'])
df1.shape
>>>(406829, 8)

IFinally, I am able to see the clean data frame using the info function. About 400K rows have remained, and I also notice that the invoice date column is already shown in datetime type, so it is no need to conduct further transformation.

df1.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo 406829 non-null object
StockCode 406829 non-null object
Description 406829 non-null object
Quantity 406829 non-null int64
InvoiceDate 406829 non-null datetime64[ns]
UnitPrice 406829 non-null float64
CustomerID 406829 non-null float64
Country 406829 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)

Create Cohort & Conduct Cohort Analysis

Before creating cohorts, I need to munge two columns regarding months: one is invoice month, the other is cohort month.

With the helper function “get_month”, I am able to create the invoice month column first.

def get_month(x):
return dt.datetime(x.year, x.month, 1)
df1['InvoiceMonth'] = df1['InvoiceDate'].apply(get_month)
df1['CohortMonth'] = df1.groupby('CustomerID')['InvoiceMonth'].transform('min')

And for the cohort month, it is the first month for the specific user ID when that user shopped at this online retailer. The groupby and transform are used for creating the cohort month.

Finally, I created a cohort index for each row. The cohort index is the month difference between invoice month and cohort month for each row. By doing the deduction, I am able to know the month lapse between that specific transaction and the first transaction that user made on the website.

def get_date(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
invoice_year, invoice_month, _ = get_date(df1, 'InvoiceMonth')
cohort_year, cohort_month, _ = get_date(df1, 'CohortMonth')
year_diff = invoice_year - cohort_year
month_diff = invoice_month - cohort_month
df1['CohortIndex'] = year_diff * 12 + month_diff + 1

Create 1st Cohort: User number & Retention Rate

The first cohort I use here is the number of active users. I use groupby and pivot table functions.

cohort_data = df1.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].apply(pd.Series.nunique).reset_index()cohort_count = cohort_data.pivot_table(index = 'CohortMonth',
columns = 'CohortIndex',
values = 'CustomerID')
cohort_count

From the result above, the first column in the chart is active user number on that specific month, and the following columns show how the many remaining users are still using the service on the second, third, and fourth month.

The 13th column and 2010–12–1 row means 260 users out of 948 users were still buying things after the 13 months from 2010/12/1.

Based on this data frame, I can further transform the absolute number into percentage to showcase the retention rate using the code below.

cohort_size = cohort_count.iloc[:,0]
retention = cohort_count.divide(cohort_size, axis = 0)
retention.round(3) * 100

The first column is no doubt 100% retention rate, and with time going by, the retention rate would drop.

Finally, I am able to visualize the retention data frame with Seaborn heatmap to make the result more understandable.

plt.figure(figsize = (11,9))
plt.title('Cohort Analysis - Retention Rate')
sns.heatmap(data = retention,
annot = True,
fmt = '.0%',
vmin = 0.0,
vmax = 0.5,
cmap = "YlGnBu")
plt.show()

From the graph, it can be concluded that on 2011/11 some strategies might be implemented to increase the return users. About 50% of user from 2010/12 cohort return on 2011/11. Marketing analysts or product analysts might further investigate if some campaigns or feature changes that lead to the higher retention.

Create the 2nd Cohort: Average Quantity Sold

Beside retention metrics, there are many metrics the power of cohort can be utilized. In the 2nd cohort, I show the average quantity on the time series basis, seeing how the purchase quantity changes along with months.

cohort_data2 = df1.groupby(['CohortMonth', 'CohortIndex'])['Quantity'].mean().reset_index()average_quantity  = cohort_data2.pivot_table(index = 'CohortMonth',
columns = 'CohortIndex',
values = 'Quantity').round(1)
average_quantity

The result can be visualized as well.

plt.figure(figsize = (11,9))
plt.title('Cohort Analysis - Average Quantity')
sns.heatmap(data = average_quantity,
annot = True,
cmap = "BuGn")
plt.show()

It is observed that within the recent cohort/month, the average quantity on the website was facing a decline. It is a warning signal that users stop buying merchandise on the website. The reason needs further investigation.

Create the 3rd Cohort: Average Sales

Before creating this cohort, I munged the total sales column by multiplying quantity and unit price. And the following step is quite similar to the 2nd cohort.

df1['TotalSale'] = df1['Quantity'] * df1['UnitPrice']cohort_data3 = df1.groupby(['CohortMonth', 'CohortIndex'])['TotalSale'].mean().reset_index()average_sales  = cohort_data3.pivot_table(index = 'CohortMonth',
columns = 'CohortIndex',
values = 'TotalSale').round(1)
average_sales

The result is visualized below.

plt.figure(figsize = (11,9))
plt.title('Cohort Analysis - Average Sales')
sns.heatmap(data = average_sales,
annot = True,
cmap = "Blues")
plt.show()

The situation for average sales is parallel with the average quantity, which the retailer faces a significant decline in recent months. How to sustain the buying behaviors will be the issue with thinking for the product/marketing team.

Conclusion

In this article, I demonstrate the simple code for cohort analysis. It is a powerful tool for tracking user behaviors on the product. Furthermore, it is highly flexible. The time period can be changed from month to week even day-basis. Multiple metrics can be used in the diverse behavioral measurement. I believe this tool will be helpful for you to better understand your users and better optimize your product at the very end. Happy analyzing.

--

--

Henry Feng
Henry Feng

Written by Henry Feng

Sr. Data Scientist | UMN MSBA | Medium List: https://pse.is/SGEXZ | 諮詢服務: https://tinyurl.com/3h3uhmk7 | Podcast: 商業分析眨眨眼

Responses (4)