[Product Insight] Redesign Medium “Related Articles” Feature with Topics Clustering Model
Don't forget to follow me on Medium and get notification whenever I publish a new article!
As a big fan and loyal user of Medium, an online publishing platform favored by millions of reader and professional writers, I have been so eager to conduct some product research and analytics projects on this beloved website for such a long time.
Product analytics have been my passion since I first started my career in Taiwan, and the passion grows gradually as my studies in UMN MSBA proceeded. The dilemma for an aspiring product analyst is that I can’t easily get access to the data related to products to polish my analytical skills, for the reason of confidentiality. I can only imagine and propose the metrics the companies use and further walk through the thinking process.
Luckily, I came across a simple data set of Medium topics on Kaggle, which contains 192 titles related to the category of data science. Using the data set, I imagine I were a product analyst in Medium, trying to come up with an alternative recommendation system for ‘Related Articles’ features below every articles.
Goal
Provide an alternative model of recommended articles other than existing systems and better optimize user experience to extend users’ engagement time on Medium.
Skills and Tools
Tools: Python Jupyter Notebook
Skills: Text processing with NLTK, text clustering with Sklearn, visualization with Matplotlib and WordClouds, data processing with Pandas and numpy
Status Quo Analysis
Before the introduction of my simple model, it is important to give an overview of the current layout and system of the target feature.
On PC, there are three blocks of recommended articles below each article, and with simple observation, the recommendation system can be put into three categories.
- Related reads: The algorithm behind is unknown for me.
- Also tags: This system utilizes the tag for each articles and showcases the articles with the same tags.
- More from the same website/blog articles: I thinks this is only for some blogs and websites built in Medium, not for individuals, which might help to increase time spent on that specific sites.
I also observed that the “related reads” and “also tags” may appear on the same page for recommendation.
On mobile device, it shows three articles for recommendation just like PC but with slightly different layout. I observed three recommendation systems here, most of which are similar to PC.
- Pick for you: The algorithm behind this is unknown for me.
- Also tags: Same as PC.
- More from the same website/blog articles: Same as PC version.
What I tried to reach in my research is to provide the fourth method for the recommendation system.
My Methodology with Title Clustering for Medium Recommendation System
The structure of analyses is shown below:
- Load in, process and vectorize topics text data
- Conduct clustering analysis with K-Mean and hierarchical clustering
- Analyze and name the clusters for further suggestion
- Recommendation and Next Step
Load in, Process and Vectorize Text Data
First step is to load in the data, and for the data has no header and with id. I just made the id as index and named the column ‘Text.’
data = pd.read_csv('medium_titles.csv', index_col=0, header = None, names = ['Text'])
Next is to provide a tokenize and stem function for the usage of further vectorization. (p.s. Stemming is to return the words to its stem format, like turn cats to cat, effective to effect)
def tokenize_and_stem(text_file):
# declaring stemmer and stopwords language
stemmer = SnowballStemmer("english")
stop_words = set(STOPWORDS)
words = word_tokenize(text_file)
filtered = [w for w in words if w not in stop_words]
stems = [stemmer.stem(t) for t in filtered]
return stems
As for next step, I further created a column which converts title into the title with no stop words and name it as ‘text_no_stop’
stop_words = set(STOPWORDS)
data['text_no_stop'] = data['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
Final step is to create a vectorizer and transform [text_no_stop] column into a matrix. I use TfidfVectorizer to deal with text and extract the trait of texts.
tfidf_vectorizer = TfidfVectorizer(max_features=200000,
use_idf=True,
stop_words='english',
tokenizer=tokenize_and_stem)tfidf_matrix = tfidf_vectorizer.fit_transform(data['text_no_stop'])
terms = tfidf_vectorizer.get_feature_names()
The tfidf_matrix will be the final matrix I used for clustering.
Conduct Clustering Analysis
In this section, I decide to cluster the matrix in two ways: K-Means and Hierarchical Clustering. The reason for the two methods is to verify the robustness of the clustering and to check if the model can properly categorize different topics.
K-means
For K-means is an unsupervised learning, I tried to decide how many clusters are appropriate for this approach. I tried to plot silhouette coefficient as well as elbow graph and further determined appropriate clusters.
(1) Silhouette Coefficient
coef = {}
for k in range(2, 20):
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=1, verbose=0, random_state=342).fit(tfidf_matrix)
label = kmeans.labels_
sil_coeff = silhouette_score(tfidf_matrix, label, metric='euclidean')
# print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))
coef[k] = sil_coeffplt.figure
plt.plot(list(coef.keys()), list(coef.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Coefficient")
plt.title('Silhouette Coefficient across N-clusters')
plt.show()
(2) Elbow Graph
sse = {}
for k in range(1, 20):
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=1, verbose=0, random_state=342).fit(tfidf_matrix)
data["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.title('Elbow Graph for N-clusters')
plt.show()
From the two graphs above, there is no definite n I can pick. Based on silhouette graph, I decided to pick the first turning point where n is equals to six as my cluster number.
After deciding cluster number, I then fit the matrix with k-means. Creating labels for each of the topics is the next step.
(3) Fitting the model and labeling titles
km = KMeans(n_clusters=6, init='k-means++', max_iter=300, n_init=1, verbose=0, random_state=342)
km.fit(tfidf_matrix)
labels = km.labels_
clusters = labels.tolist()
With the code, I am able to create 6 labels for each cluster for the later use
(4) Visualize the clusters on a two-dimensional graph
Next step is optional, for the tfidf matrix is a highly dimensional structure, I have to transform it with multidimensional scaling (MDS) and fit MDS to cosine distance.
distance = 1 - cosine_similarity(tfidf_matrix)
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(distance)
xs, ys = pos[:, 0], pos[:, 1]
Create final data frame with cluster label and visualize it.
df_knn = pd.DataFrame(dict(label=clusters, data=data['Text'], x=xs, y=ys))label_color_map = {0: 'tomato',1: 'skyblue',
2: 'lightgrey',3: 'pink',
4: 'plum', 5: 'yellow'}fig, ax = plt.subplots(figsize=(17, 9))
for index, row in df_knn.iterrows():
cluster = row['label']
label_color = label_color_map[row['label']]
label_text = row['data']
ax.plot(row['x'], row['y'], marker='o', ms=12, c=label_color)
row = str(cluster) + ',' + label_text + '\n'# ax.legend(numpoints=1)
for i in range(len(df_knn)):
ax.text(df_knn.ix[i]['x'], df_knn.ix[i]['y'], df_knn.ix[i]['label'], size=8)plt.title('News Headlines using KMeans Clustering')
From the graph above, we may observe that the dots of titles scatter across the 2-dimensional plain and form in a circular style. And cluster 0 is the smallest and cluster 2 is the biggest one.
Hierarchical Clustering
I used ward method to create linkage matrix and further create hierarchical dendrogram based on the linkage matirx.
linkage_matrix = ward(distance)
fig, ax = plt.subplots(figsize=(20, 20)) # set size
ax = dendrogram(linkage_matrix, orientation="top", labels=data.values, show_leaf_counts=True)
plt.tight_layout()
plt.title('News Headlines using Ward Hierarchical Method')
plt.show()
From the dendrogram above, it is clear that the algorithm clustered all titles into six. For better recognition, I used fcluster function to create labels just like what I have done in K-means. Finally, I create a data frame with labels from hierarchical clustering.
label = fcluster(linkage_matrix, 5, 'maxclust')
Analyze and Name the Clusters
(1) Check the size of cluster
I used simple groupby function to check the size of clusters.
df_knn.groupby(['label'])['data'].agg({'kmean':'size'})
df_hei.groupby(['label'])['label'].agg({'hierarchical':'size'})
(2) Visualize categories with Word Clouds
Unlike the numeric clustering, I can’t extract the mean of the text and interpret them. Therefore, I decide to conclude the topics and categories of each cluster with simple word clouds.
Wordclouds for 6 clusters from K-Means
for i in range(6):
df_new = df_knn[df_knn['label']==i]
lines = df_new.data.str.cat(sep= ' ')
stopwords = set(STOPWORDS)
wordcloud0 = WordCloud(width = 400, height = 400,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(lines)
plt.figure(figsize = (4, 4), facecolor = None)
plt.imshow(wordcloud0)
plt.axis("off")
plt.tight_layout(pad = 0)plt.show()
With the visualization of word clouds, I am able to conclude 6 categories of this files.
Wordclouds for 5 clusters from Hierarchical Clustering
for i in range(1,6):
df_new = df_hei[df_hei['label']==i]
lines = df_new.team.str.cat(sep= ' ')
stopwords = set(STOPWORDS)
wordcloud0 = WordCloud(width = 400, height = 400,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(lines)
plt.figure(figsize = (4, 4), facecolor = None)
plt.imshow(wordcloud0)
plt.axis("off")
plt.tight_layout(pad = 0)plt.show()
From the word clouds from two clustering model, I notice the result is quite consistent to each other.
Recommendation and Next Step
My recommendation for my research and analytics result is that Medium product team can apply the topic clustering model for the recommendation system below the articles.
When a reader is reading an article about data visualization, based on the clustering results, the website can push and recommend the articles with titles within the same cluster to the reader.
Apart from the existing mechanism, I believe the title clustering model might be another way to optimize user experience and boost users’ duration on Medium.
The next step for this, I suggest conducting A/B testing on this model. By placing the article selected from topic clustering model in one of the three blocks, the product team might test the click through rate of each blocks to verify which method might be the optimal solution for recommendation.
Conclusion
Through the process of redesign the “Related Articles” feature on Medium, I tried to utilize what I have learned in clustering, text analytics and visualization to suggest an alternative solution for Medium product team. And by playing around the Medium title data set, I tried to polish my mindset of product analytics, user journey and product development, which is so much fun. I will definitely go further to polish my skills as a more professional product analyst.
Feel free to check my code in my Github.
If you like the article, feel free to give me 5+ claps
If you want to read more articles like this, give me 10+ claps
If you want to read articles with different topics, give me 15+ claps and leave the comment hereThank for the reading