RFM Analysis & Customer Churn Analysis for Hotel/Mall Enterprise in China
Remember to follow me on Medium to get instant notification of my new article.
At the end of semester, while I was trying to wrap up my semester, I got a phone call from my friend who works as a business analyst for one of the biggest hotel and mall enterprises in China. He tried to come up with a solution to segment the users/shoppers who stay in the hotel and also shops at the mall during their stay. He gave me about 1000 row of sample data of the members who stay at hotel and another 2500-ish row of sample transaction data which indicates shopper’s behavior in mall.
Therefore, I came up two set of analyses that might help him capture more precisely the user behavior in both mall and hotel. The first one is RFM analysis, and the other is customer churn analysis. In the article, I will only show the code and thinking process of my analysis. The result will not be shown due to confidentiality. However, I think these two analyses are quite powerful and intuitive across different industries, so feel free to check my code and fork it for further usage.
Skills and Tools
Tools: Python Spyder
Skills: Data processing and pivot table with pandas and numpy and clustering with sklearn
Methodology and Execution Plan
1. RFM Model
2. Customer Churn on Spending at Mall
RFM Model
First I import the required package: pandas, numpy, matplotlib and datetime.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import datetime as dt
Second, I try to extract the RFM metric from the data. RFM stands for recency, frequency and monetary. It is an intuitive and direct measurement of customer behaviors, which can be applied to online or off-line business. Here I try to get how recently a member come to hotel, how frequently they come to hotel and how much they spend during the stay (the room price they pick). The benefit of RFM is to capture user’s behavior in the logic of time. We all know that member can be categorized into different status, such as new member, active member, silent member, sleeping member, and churn member. RFM is able to pinpoint and classify the member status.
# You can set any time you want that matters to the company
NOW = dt.datetime(2018,12,20)
rfmtable = df.groupby('UserID').agg({'DepatureTime':lambda x:(NOW-x.max()).days, 'UserID': lambda x:len(x), 'TotalSpent':lambda x:x.sum()})rfmtable.rename(columns = {'DepatureTime':'recency', 'UserID':'frequency','TotalSpent':'monetary'}, inplace = True)
Third, this step is optional. I use simple quantile (0.25, 0.5, 0.75), just like boxplot, to define each player into 4 different quantile groups based on data itself and create 3 extra column showing the group a customer belonging to, recency, frequency and monetary. Analyst and manager can further filter out the customers for different purpose, either the customers who come the most recent, pay the most or come many times.
quantiles = rfmtable.quantile(q=[0.25, 0.5, 0.75])
quantiles = quantiles.to_dict()def RScore(x,p,d):
if x <= d[p][0.25]:
return 1
elif x <= d[p][0.50]:
return 2
elif x <= d[p][0.75]:
return 3
else:
return 4def FMScore(x,p,d):
if x <= d[p][0.25]:
return 4
elif x <= d[p][0.50]:
return 3
elif x <= d[p][0.75]:
return 2
else:
return 1seg_rfmtable = rfmtable
seg_rfmtable['r_quartile'] = seg_rfmtable['recency'].apply(RScore, args = ('recency', quantiles,))
seg_rfmtable['f_quartile'] = seg_rfmtable['frequency'].apply(FMScore, args = ('frequency', quantiles,))
seg_rfmtable['m_quartile'] = seg_rfmtable['monetary'].apply(FMScore, args = ('monetary', quantiles,))seg_rfmtable.reset_index(level = 0, inplace = True)
For me, classification with quantile measurement is not sufficient enough. The power of RFM is that with these three metrics, analyst can scan a customer in 360 degree. Therefore, I decide to add another measurement in the original table of RFM, the customer’s total spend in mall and conduct cluster analysis to create five cluster to segment the member behavior within the hotel and mall.
# First combine transaction data from mall data. Remember to group by user Id as well
catmap = {'LOW':1, 'MEDIUMLOW':2, 'MEDIEUMHIGH':3, 'HIGH':4 ,'VIP':5}
user = user.applymap(lambda s: catmap.get(s) if s in catmap else s)single_user = user.groupby(['UserID']).aggregate({'spent_all':lambda x:x.sum(),'user_category':lambda x:round(x.median(),0)})
single_user.reset_index(level=0, inplace = True)# Clustering based on these features
final_metrics = pd.merge(seg_rfmtable, single_user, how='left', on = ['UserID'])from scipy import stats
from sklearn.cluster import KMeanscluster_data = final_metrics.dropna(subset=['spent_all'])
cols = ['recency', 'frequency','monetary','spent_all']
kmean_data = stats.zscore(cluster_data[cols])km = KMeans(n_clusters = 5, random_state=0)km.fit(kmean_data)
After k-mean clustering, I label each customer with cluster result, and I further conduct the summary analysis for each cluster. The company can conclude the mean value of each cluster and based on the classification, design related strategy and marketing campaigns toward different segments.
# label the row / customer
labels = km.labels_
cluster = labels.tolist()# summarize each cluster
final_cluster = pd.DataFrame(dict(cluster_label = cluster, UserId = cluster_data['UserID']))verify_final = pd.merge(cluster_data, final_cluster, how = 'left', on = ['UserID'])cols2 = ['recency', 'frequency','monetary','spent_all', 'cluster_label', 'user_category']verify_final[cols2].groupby(['cluster_label']).agg({'recency':lambda x:x.mean(), \
'frequency':lambda x:x.mean(), \
'monetary':lambda x:x.mean(), \
'spent_all':lambda x:x.mean(), \
'user_category': lambda x:x.mode(), \
'cluster_label':'size'})
Customer Churn Analysis on Spending in Mall
Another question my friend want to dig deeper is that for the guest in hotel, how the spending patterns at mall are during the stay. The mall the enterprise owns is one of the biggest malls in China. There are a little zoo, many restaurants and bars, a concert hall, a movie theater, amusement park, and other facilities within the mall.
Some loyal customers will stay in the hotel for many days and enjoy themselves in the mall. And the assumption my friend made is that the longer they stay, the amount they spend will indeed accumulate but the spending for each day will decrease.
Therefore, I decide to conduct another research about customer’s spending during their stay.
First, I use the hotel user data to filter out those guests who stay more than 3 days and less than 10 days to set the time frame of selected users.
stay3 = df[df['Stay_days']>=3]
stay3 = stay3[stay3['Stay_days']<=10]
Next I filter out the customer in the mall from user data with the unique ID list from the first step. I create a column called ‘n_days’ with userid, year, and month and give each row a rank to showcase the nth day the user spend in the mall.
userid_stay3 = list(stay3.UserID.unique())user_stay3 = user[user['UserID'].isin(userid_stay3)]user_stay_adt = user_stay3.sort_values(['UserID','Date'], ascending=[True, False])user_stay_adt['Year'] = user_stay_adt['Date'].dt.year
user_stay_adt['Month'] = user_stay_adt['Date'].dt.monthuser_stay_adt['n_day']=user_stay_adt.groupby(['UserID','Year','Month'])['Date'].rank(ascending=True,method='dense').astype(int)
Finally, I can use simple pivot table to retrieve for each user during each of their stay, how much they spend on nth day.
user_stayday_adt = user_stay_adt.pivot_table(index=['UserID','Year','Month'], columns = 'n_day', values='spent_all')
I munge another 9 columns, each of which are the change rate from previous date. Take column [‘2–1’] for example, it is the change rate of the spending on 2nd day compared to spending on 1st day, and so on so forth.
longstay['2-1'] = (longstay[2]-longstay[1])/longstay[1]
longstay['3-2'] = (longstay[3]-longstay[2])/longstay[2]
longstay['4-3'] = (longstay[4]-longstay[3])/longstay[3]
longstay['5-4'] = (longstay[5]-longstay[4])/longstay[4]
longstay['6-5'] = (longstay[6]-longstay[5])/longstay[5]
longstay['7-6'] = (longstay[7]-longstay[6])/longstay[6]
longstay['8-7'] = (longstay[8]-longstay[7])/longstay[7]
longstay['9-8'] = (longstay[9]-longstay[8])/longstay[8]
longstay['10-9'] = (longstay[10]-longstay[9])/longstay[9]
I design two functions to depict churn.
The first one is called day_decrease_portion. The user can pass the data frame into this function, and it will return the percentage of customers who decrease their spending among all the customer for that specific nth day (from day 2 to day 10). If d3 is 0.63, which means for those who stay in hotel on their 3rd day during their visit, 63% of users have decreased their spending in the mall.
def day_decrease_portion(longstay):
d2 = round(sum(longstay['2-1']<0)/len(longstay['2-1']),3)
d3 = round(sum(longstay['3-2']<0)/len(longstay.dropna(subset=['3-2'])),3)
d4 = round(sum(longstay['4-3']<0)/len(longstay.dropna(subset=['4-3'])),3)
d5 = round(sum(longstay['5-4']<0)/len(longstay.dropna(subset=['5-4'])),3)
d6 = round(sum(longstay['6-5']<0)/len(longstay.dropna(subset=['6-5'])),3)
d7 = round(sum(longstay['7-6']<0)/len(longstay.dropna(subset=['7-6'])),3)
d8 = round(sum(longstay['8-7']<0)/len(longstay.dropna(subset=['8-7'])),3)
d9 = round(sum(longstay['9-8']<0)/len(longstay.dropna(subset=['9-8'])),3)
d10 = round(sum(longstay['10-9']<0)/len(longstay.dropna(subset=['10-9'])),3)
return (d2, d3, d4,d5,d6,d7,d8,d9,d10)
The second function is called day_decrease_rate. Same as first function, the user can pass the data frame into the function. And it will return the spending decrease rate for those who decrease their spending on specific nth day. And it will return 9 values from day 2 to day 10.
def day_decrease_rate(longstay):
d2 = round(longstay[longstay['2-1']<0]['2-1'].mean(),3)
d3 = round(longstay[longstay['3-2']<0]['3-2'].mean(),3)
d4 = round(longstay[longstay['4-3']<0]['4-3'].mean(),3)
d5 = round(longstay[longstay['5-4']<0]['5-4'].mean(),3)
d6 = round(longstay[longstay['6-5']<0]['6-5'].mean(),3)
d7 = round(longstay[longstay['7-6']<0]['7-6'].mean(),3)
d8 = round(longstay[longstay['8-7']<0]['8-7'].mean(),3)
d9 = round(longstay[longstay['9-8']<0]['9-8'].mean(),3)
d10 = round(longstay[longstay['10-9']<0]['10-9'].mean(),3)
return (d2, d3, d4,d5,d6,d7,d8,d9,d10)
With these functions, my friend is able to see on the specific nth day how big is the portion of customers decrease their spending and how much the spending drops. Marketing team is able to conclude related method to prolong the drop.
The bonus last step is I create user category data frame based on the company standard for customer category based on the first day of spending, which includes low, median-low, median-high, high and VIP. Analysts can pass these data frames into my two functions and see the change of number in portion and decrease spending rate.
def usercat(s):
if s[1] <= 200:
return 1
elif s[1] <= 500:
return 2
elif s[1] <= 800:
return 3
elif s[1] < = 1000:
return 4
else:
return 5
longstay['user_category'] = longstay.apply(usercat, axis = 1)longstay_low = longstay[longstay['user_category']==1]
longstay_medlow = longstay[longstay['user_category']==2]
longstay_medhigh = longstay[longstay['user_category']==3]
longstay_high= longstay[longstay['user_category']==4]
longstay_vip = longstay[longstay['user_category']==5]
Conclusion
I utilize what I have learned in this semester to tackle the real world problem posed by my friend. It is not deep enough for sure, but I got quite good feedback from him for initiating some directions and some hand-on analyses.
RFM is definitely a good and easy model to use in different scenario and it is very powerful with cluster analysis. Trying to put it into your industry might lead to insightful findings. Why not give it a shot?
If you like the article, feel free to give me 5+ claps
If you want to read more articles like this, give me 10+ claps
If you want to read articles with different topics, give me 15+ claps and leave the comment hereThank for the reading