Decide Your Post Topics on Social Media with Simple Topic Modeling

Henry Feng
9 min readFeb 9, 2019
Feel free to follow my Medium account :) 

In the spring semester, I continue my journey of research assistant internship at UMASH. I feel refreshing coming back from winter break and get prepared for my last semester at work to deploy more advanced analytics skill and further improve the performance of UMASH social media.

Problem Intro & Possible Solution

Posts on Twitter, Facebook, and YouTube are the first impression for our users and audiences. I am thinking of a more scientific method to position the good topics that might attract user engagement and post reach.

The prior operation for posting in UMASH is totally based on experience and the free will of our social editor. Therefore, I decide to utilize historic post data to further present some recommendation on topic selection.

The method I think of to tackle this problem is a simple topic modeling. Topic modeling is an unsupervised learning technique that can generate the attribute of topics out of texts, articles, sentences.

By identifying topics related to past popular posts, it is relevant to generate more posts with related topics to achieve the goal of social media. To expand that, it is possible to trace back those past post with similar topics for re-promotion, which is cost-effective as well.

Tools

Tools: Python Jupyter Notebook, Twitter Analytics, Facebook Fanpage Insight, YouTube Analytics

Skills: Data munging with Pandas, text processing with NLTK, text transforming and topic modeling with Sklearn TFIDF and SVD

Methodology

  1. Download and check the data from analytical platforms of each different social media
  2. Load in the data and see the cleanness of data
  3. Build two topic modeling functions to generate topics related to post

Download Data from Social Media

As an analyst, you have to get familiar with the mechanism of each analytical platform and where we can download the post data.

Facebook

The fan page design is pretty straightforward. First, I click the insight tab and press the export data tab to initiate a bounced window. I can then download the post data with the desired timenframe.

Twitter

Twitter Analytics is at the scroll-down list of the profile icon. After entering the analytics page, go to the “tweets” page and the analyst can further download the tweets data with the desired timeframe.

Get in to Twitter Analytics
Download the determined time frame

Youtube

There are more steps to retrieve YouTube video data compared to Facebook and Twitter. In the end, the analyst will get into the page where the data can be downloaded. I suggest framing the time for the lifetime of all Youtube Data. It is more convenient for further research.

Load in data

The first step for that is to load Pandas and further read in the three excel file from Youtube, Facebook, and Twitter.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df_twit = pd.read_excel('Tweet text.xlsx')
df_fb = pd.read_excel('Facebook Text.xlsx')
df_yt = pd.read_excel('Youtube Text.xlsx')

Because I am doing the topic modeling, not much data point is needed. The most important column is the text column. I just use head and tail function to see if there is any missing value in the data frame. Luckily, I found for YouTube data frame, there is the NA for the title column at the end, indicating the total of all video performance, which needs to be tackled with.

Another issue is that the column storing texts have different column names. If I want to create a universal tool across different platform, I need to design the mechanism to deal with that.

Introduction of Two Topic Modelling Functions

I decided to create two functions that can return the topic modeling result within a second, which can be applied to any files downloaded from any analytics platforms with structured data. Once the file has the post/tweets/text column and further have some metrics related to that post, these two functions can be easily deployed.

I will briefly run through my code here and present the result based on my three files from YouTube, Twitter, and Facebook.

First function: topic_model(file, colname, topics_num)

The first function is used to returning the topics for a specific column in a file. Users just need to pass in the file name, column name and how many topics they want for result into the function.

I will try to break down and elaborate my function below.

I load in needed libraries into functions, including Pandas, NLTK, and Sklearn.

def topic_model(file, colname, topics_num):

import nltk
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

Next, I load in the file. I observe the file format from those analytics platforms belongs to either CSV file or excel file. Therefore, I write a conditional statement here to determine either format and either of them will be stored as pandas data frame at the very end.

    if 'csv' in file:
df = pd.read_csv(file)
elif 'xlsx' in file:
df = pd.read_excel(file)

Further, I clean the text with some text manipulation. Based on my observation on the YouTube data file, the last row will be the null text. Therefore I remove the last row. I then get rid of the string other than a to z texts, remove the string less than three characters, and turn string into lowercase.

    # Clean the text
df = df.dropna(subset=[colname])
df['clean_title'] = df[colname].str.replace("[^a-zA-Z#]", " ")
df['clean_title'] = df['clean_title'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
df['clean_title'] = df['clean_title'].apply(lambda x: x.lower())

The next step still belongs text processing. I decide to remove the stop words from the text. I set the stopword in English setting. For applying stop words, it is needed to tokenize the sentences into a list of words.

    # deal with the stop word
stop_words = set(stopwords.words('english'))
tokenized_doc = df['clean_title'].apply(lambda x: x.split())
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

After removing stop words, I merge those word lists back into sentences again with join function and a “for” loop.

    # merge the tokenized word back to sentences again
detokenized_doc = []
for i in range(len(df)):
t = ' '.join(tokenized_doc[i])
detokenized_doc.append(t)
df['clean_title'] = detokenized_doc

It is the time we get clean text data. The data can now be fit and transformed in TFIDF vecotorizer. I stored it in X.

    # vectorize it
vectorizer = TfidfVectorizer(max_features= 500, # keep top500 terms
max_df = 0.5,
smooth_idf=True)
X = vectorizer.fit_transform(df['clean_title'])

Final step is to use the dimensionality reduction tool: truncated SVD, also called LSA, latent semantic analysis to fit X. The topic_num argument in the function will be used here for specifying the number of components. The SVD will give each term a related components portion. Therefore, for each topic I ask the function to print out the top 6 words with the highest components portion.

    # SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components = topics_num, algorithm='randomized', n_iter=100, random_state=122)
svd_model.fit(X)

terms = vectorizer.get_feature_names()
for i, comp in enumerate(svd_model.components_):
terms_comp = zip(terms, comp)
sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:6]
print("Topic "+str(i)+": ")
print('-------')
for t in sorted_terms:
print(t[0])
print(' ')

Da-la. Here is the end of the function. We can use this function to generate the list of topics based on the full column of texts.

I will showcase the function with my YoutTube video data.

topic_model('Youtube Text.xlsx','video_title',3)

You can see above I key in needed arguments, and it generates the 3 lists of topics containing different keywords. It gives analyst the overview of topics for a social media platform.

Second function: topic_model_quantile(file, colname, metric_col, lower_quantile_no, upper_quantile_no ,topics_num)

Basically, this function is the extension of the previous one. The only difference is the three extra add-in: metric column, lower quantile, and upper quantile. The purpose of this is to return topics based on the performance metrics; say, a marketing manager may want to know the topics of the top ¼ post with most engagment. The metric column here will be “engagement” and lower quantile is 75, and upper quantile is 100.

The mechanism within the function is basically the same. I just added the data subsetting process.

def topic_model_quantile(file, colname, metric_col, lower_quantile_no, upper_quantile_no ,topics_num):

import nltk
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
if 'csv' in file:
df = pd.read_csv(file)
elif 'xlsx' in file:
df = pd.read_excel(file)

df = df.dropna(subset=[colname])

##### Subset the data with quantile #####
lower_quantile, upper_quantile = df[metric_col].quantile([lower_quantile_no/100, upper_quantile_no/100])
df = df.loc[(df[metric_col] > lower_quantile) & (df[metric_col] < upper_quantile)]

df.reset_index(inplace=True)
#########################################

By doing so, the subset data frame will then passed through the same cleaning, text processing, and transforming process and finally return the topics results.

Some demo here, I will use Twitter and Facebook file to take a look at the topic of the top engaging posts and least engaging posts.

topic_model_quantile('Tweet text.xlsx', 'Tweet text', 'engagements', 75, 100 , 3)
topic_model_quantile('Facebook Text.xlsx', 'Post Message', 'Lifetime Post Total Impressions', 0, 40, 4)

This function is more specific and valuable. Every organization measures the post with different metrics. Some value engagement, some emphasize impression and some prefer share number. In this function, the analyst is able to specify the metric column and conclude the recommendation for better topics on posts.

Limitation

This function is more like an exploratory analysis of topics and texts. I have to stress it might only show correlation but not causality. I may conclude these topics have relation to higher or lower engagement, but I will not say once the social media team write posts with certain topics, it will lead to high metric performance. It is the first step. Starting from here, social media teams might start thinking about how to verify causal relation, maybe with experiments or other techniques.

Conclusion

I really enjoy this journey of topic modeling. I’ve been longing for trying this technique for such a long time. It is so good I have the data from UMASH to perform this technique and further bring the value for proposing further recommendation on posts content.

P.S. I put my Jupyter notebook in GitHub. Feel free to fork it and play it around. I will provide the text dataset where you can actually scrap from UMASH social media. I will not include any metric column within for data credential issue. Have fun with data.

--

--

Henry Feng

Sr. Data Scientist | UMN MSBA | Medium List: https://pse.is/SGEXZ | 諮詢服務: https://tinyurl.com/3h3uhmk7 | Podcast: 商業分析眨眨眼