Customer Funnel Analysis for Online Retailer — Using Pivot Table and Clustering in Python

Henry Feng
10 min readDec 26, 2018


As a product person, you got to know the customer funnel. It can be the funnel for an app, a website, or a service, from acquisition, activation, retention, revenue to referral (AARRR), or it can be the user flow from getting exposed to an advertisement to purchasing the items as the last step. And as a product analyst, web analytics, aka using google analytics, got to be the intro 101 of analyzing user behavior of a website or an application.

As a previous web analyst in Taiwan, I am always looking forward to going further in funnel analysis on websites in the US market. And I finally got the chance to polish my analytics skills with the data set of user behaviors across diverse ad channel of an online retailer. In this article, I will share some of my analytical process and insight in dealing with the data.

Tools & Skills

Main tools: Python, Jupyter Notebook

Skills Used: Pivot table with pandas, visualization with matplotlib, clustering with sklearn

Data Set

It is an online transaction data set from online e-commerce website with nearly 10000 entries and 14 columns. The main variables I focus on the most are sales amount, new customer, position, group name, brand, position name and time to convert.

Position name column contains 4 categories: originator, roster, assist, and converter. Each of the categories represent the step a user went through. A user might go through 1 originator and later go to converter and finish the transaction. Another user might have originator, many assist and finally reach conversion. And for each categories, there will be a group aligned with it, such as CPM, TV ads, or social media. Different ad groups might play diverse roles in the user journey, which is worth analyzing. And the variable “position” indicates which step the user is at that moment.

Problem Set

1. Which ad group played the most significant role for originator? And which ad group lead to most conversion?

2. Do new users and old users behave differently?

3. For both first click and last click model, which channel might bring in most revenue?

4. Are there any diverse groups of users who have distinct behavior on the website? And what can we conclude some action item for that?

Problem Re-definition

From the problem listed above, I simply conclude them into two categories.

1. Prioritization: Product manager and product analyst are always doing prioritization. Among all the ad channel, which is most effective and least effective? With appropriate metrics, i.e. first clicks or last clicks model, I might suggest the level of importance for each channel.

2. Segmentation: Users may behave differently. An product analyst has to provide and label different behavioral pattern for different group of users. Then a product manager and marketing team can further target them with a more precise strategy

Analytics Structure

A. Overview of the channel group and position

B. Originator and converter analysis

C. New and old user analysis

D. Cluster analysis

Findings and Recommendation

A. Overview of the channel group and position

To begin with, I’d like to take a quick glimpse of how data distributed through different position role, so I utilize pivot table in pandas to consolidate the data in a simpler manner, and I also conduct some filtering to further make an easy-to-interpret graph.

web['Countnumber'] = 1
stage = web.pivot_table('Countnumber', index = 'Groupname', columns = 'Positionname', aggfunc= 'sum')
stage2 = stage[stage['ASSIST'] >= 8]
stage2 = stage2.fillna(0)
stage2 = stage2.drop(['Uncategorized'])

The result table is below.

And I plot the graph further.

From the graph, different ad types have diverse proportion in each position.

Take search Google Brand and search MSN brand for example. They play significant role in originator, which means these type of ads attract user to start their user journey.

And for converter, CJ (one kind of affiliate campaign) have more effects on converter, which boosts the last kick for user to make purchase.

B. Originator and Converter Analysis

After the overview, I decided to focus more on originator and converter, which is our first and last step of the whole user journey. The analyses may give the product manager and marketing team a clear image which channel brings better performance.

I expand the stage table by transforming the number into percentage, and make a graph based on it.

stage['orig_perc'] = round(stage.ORIGINATOR / stage.ORIGINATOR.sum(), 2)
stage['convert_perc'] = round(stage.CONVERTER / stage.CONVERTER.sum(),2)
stage = stage.fillna(0)

From the graph, the findings are:

  1. CPM, search google brand and buzz affiliate are the three major source for originator.
  2. CPM, buzz affiliate and CJ are the three major source for converter.


As a PM and marketing team for ad product, I recommend these channels can be first optimized and invested with more budgets.

Percentage might showcase parts of the information. I therefore utilize simple group_by function to present the absolute number of originator and converter.

# Originator graph data source
web_origin = web[web['Positionname']=='ORIGINATOR']
df3 = web_origin.groupby('Groupname')['Saleamount'].aggregate({'No_of_Sale':'count','Total_Sum':'sum'})
df3['Avg_Sales'] = round(df3['Total_Sum'] /df3['No_of_Sale'],2)
# Converter graph data source
web_convert = web[web['Positionname']=='CONVERTER']
df4 = web_convert.groupby('Groupname')['Saleamount'].aggregate({'No_of_Sale':'count','Total_Sum':'sum'})
df4['Avg_Sales'] = round(df4['Total_Sum'] /df4['No_of_Sale'],2)


  1. For originator, although CPM leads to the biggest amount of sales across all channels, its average sales is not very high.
  2. For originator, social has relatively high average sales amount.
  3. For converter, CPM, same as the condition in originator graph , leads to fair amount of sales, but the average sales amount is low.
  4. Buzz affiliate and CJ have relatively high average sales amount.


  1. CPM plays important role in attracting customers and converting purchase, but its average sale number isn’t that healthy. It probably needs further investigation into the users using this channel.
  2. Social might be a potential channel for future development.
  3. Buzz affiliate and CJ are worth investing, since they convert a lot of transaction with high average sales.

C. New User and Old User Analyses

Recapped from the position percentage graph for each channel. I further think about another possibility. Why not I divide the data into new and old user. There is a basic assumption here that I think the ways new and old user get attracted and enter the consumer funnel might be different.

Therefore, I subset the data into new and old user, and run the same graph.


  1. For the originator, CPM and search Google brand have more influential power over new user than old user, which makes sense that typing in search engine and being attracted by some networks ad are the common method to draw new users.
  2. For converter, there is some huge contrast between new and old users. Most of the old user’s conversion is triggered by CPM; while for new users, CPM, buzz affiliate and CJ are more evenly distributed based on their portion.

A thought one-step-forward:

From the average sales amount graph of converter, the relatively low average sales amount is observed in CPM. And with the separation of new and old user, graph shows that CPM is the major trigger for old user. Is it possible that the amount paid by old user is less than new user, which makes the average sales from CPM low? Is it possible that the willingness to pay between new and old user different?

I therefore did a pivot table again. This time, I pick new and old user as columns from subset converter data and use position as index.

convert_newold = web_convert.pivot_table('Countnumber', index = 'Position', columns = 'Newcustomer', aggfunc= 'sum')

The result is surprising. From the graph, we can see that new users tend to convert at the first few position phase. They get attracted by ads and they decide to buy quickly.

It is a total different story for old users. They tend to think more, and the overall conversion is way less than new users.

I have to bring up the question here. Why older user are not willing to make the final purchase and what factors make them think more before buying?


  1. Product team and marketing team should dig deeper to the root cause of the retention of old users. The ad channels can lead the user to get into the website, but if the website or the merchandise the retailer sold could’t trigger the return purchase, it is quite risky.

D. Cluster Analysis

The final step of my analyses is cluster analysis. I also divided my analyses into two parts. One is based on the behavior pattern of converters; the other is based on the originator.

Converter Clustering

First of all, for the converters cluster, I select sales amount, new customer, position, group name and time to convert as attribute columns. And further cast some categorical attributes into dummy.

clus_col = ['Saleamount','Newcustomer','Position', 'Groupname','Time to Convert (Days)']
clus_df = web_convert[clus_col]
clus_df2 = pd.get_dummies(clus_df, prefix=['new_customer', 'group_name'])

Second, I use elbow method to decide the appropriate k.

wdata = clus_df2
K = range(1,20)
KM = (sk_cluster.KMeans(n_clusters=k).fit(wdata) for k in K)
centroids = (k.cluster_centers_ for k in KM)
D_k = (sci_distance.cdist(wdata, cent, 'euclidean') for cent in centroids)
dist = (np.min(D, axis=1) for D in D_k)
avgWithinSS = [sum(d) / wdata.shape[0] for d in dist]
plt.plot(K, avgWithinSS, 'b*-')
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering', fontsize = 18)

I decide when k = 4, it might be suitable, and further run the clustering analysis and print out the mean for each group to get more understanding of each group characteristics.

n_clusters = 4means_cluster = sk_cluster.KMeans(n_clusters=n_clusters, random_state = 1111)
columns = clus_df2.columns.values
est =[columns])
clusters = est.labels_
clus_df2['cluster'] = clusters
for c in range(n_clusters):
cluster_members = clus_df2[clus_df2['cluster'] == c][:]
print('Cluster{}(n={}):'.format(c, len(cluster_members)))

The result is like this:

After collating and naming based on the mean of attribute. I conclude four clusters of users group on the website.


  1. The new and old users take each half of the website population.
  2. Buzz affiliate and CJ are the main source for the new user, while most old users converted by CPM, which aligned with my analyses above.
  3. New user tend to convert quickly and pay more, while old user convert slowly and pay less, which is a serious signal that pinpoint the retailer couldn’t retain the old user.


  1. I recommend the retailer developing more cooperation with media and KOLs to create more buzz to attract new users to convert.
  2. I recommend to look deeper into cluster 2, who is our potential profit source for future. I recommend identify the product this group of user search and purchase, and further conclude action items on boosting the sales.
  3. Research on old users must be conducted. I recommend doing some user research on old user’s buying behavior, browsing pattern and the reason why they decide not to come back.

Originator Cluster

As the second part of cluster, I focus on the clustering analysis on originator. The process is similar. I cast the dummies for categorical variables and drew the elbow graph.

From the graph, I decide to pick n=4 as cluster number. And I then print out the cluster result.

After collating and interpreting each cluster, I conclude the four user group for originator behavior.


  1. I recommend looking into the decay of old users. From the perspective of originator cluster, it is more obvious that old user need more time to convert and they spend less on website.
  2. Google search brand can draw some high quality new users. Strategies could be made to expand the channel and user base from this.


Through the analyses process, I endeavored to make recommendation for prioritization and segmentation.

Several channels for originator and converter are worth investing, and others need more investigation to decide whether to discard them.

And from the new/old user analyses and clustering, I conclude it is risky for the website that it faces the challenge of user retention. And it should bring in the thoughts aligned with channel performance. Product manager and marketing team may further conduct some health check: at which phase or part of user journey, the users have bad experience, and further optimize them.

Web analytics and funnel analyses are always fascinating to me. This time, I skim through the beginning and the end of user funnel. In the future, I will definitely go further and explore more amazing story for user journey.



Henry Feng

Sr. Data Scientist | UMN MSBA | Medium List: | 諮詢服務: | Podcast: 商業分析眨眨眼

More from Henry Feng