Detecting Possible Earning Manipulation for a Real-World Company

Henry Feng
9 min readNov 16, 2018

--

Don't forget to follow me on Medium and get notification whenever I publish a new article!

Intro

The final episode of earning manipulation detection is to utilize everything I just wrote and showcased in the previous articles to really check if a singular company has possibly manipulated its earning. I picked Amgen Inc as my target company. Amgen is a bio-pharmaceutical firm which develops and manufactures medicines. It was founded in 1980 and through multiple companies acquisition, it is now ranking 130 on 2017 Fortune 500 list. There are two reasons why I picked this company as my research target.

1. The existence period of this company is appropriate, neither too long nor too short, which grant me a suitable size of data set for multiple analyses.

2. I am not very familiar with bio-pharmaceutical industry, and with this chance, I am able to explore more of it.

Skills and Tools

Tools: Python-Spyder, R-R-Studio

Skills: Data cleaning and munging with Pandas, function creation, visualization using matplotlib and ggplot2, linear Regression with R

Feel free to click my GitHub to see all the code.

Photo by rawpixel on Unsplash

Analyses structure

The analyses structure will be very similar to the previous articles and I will walk through and apply those methods and tools to the target company, Amgen Inc. Furthermore, I will check if there is any possibility or hint that the earning of Amgen Inc was being manipulated.

1. Data Cleaning

2. The overview of revenue, net income, total asset

3. Benford’s Law

4. Accruals Model

5. Operating Cash Flow Model

6. M-Score

1. Data Cleaning

As I mentioned in the previous article, Compustate is a gigantic data set, which contains 1000 variables and millions of rows. The data I need here is very straightforward: the data of Amgen and the data related to Amgen. Therefore I load my data into Python and further doing some cleaning and slicing to create and write out two csv files, which you might find in my Github.

The brief process is below:

(1) Only pick several needed variables including gvkey (company identifier), datadate (reporting period), fyear (fiscal year), revt, rect, ppegt, epspi, ni, at, oancf, sic, and rdq.

(2) For the first data frame, I only select row of Amgen Inc; and for the second data frame, I only select the companies are in the same SIC-defined industry as Amgen’s, which is 2836, biological products.

(3) Year is selected to be after 1980 and before 2018, for Amgen was founded in 1980.

(4) Select the row where its receivables are large than 0.

(5) Drop NA based on several columns and remove duplicates.

(6) Finalize and write out two tables. One is Amgen data solely, the other is the industry data, where Amgen is in.

mport pandas as pd 
import numpy as np
comp = pd.read_csv("compustat_1950_2018_annual_merged.csv")comp1 = comp[["gvkey","datadate","fyear","revt","rect","ppegt","epspi","ni","at","oancf","sic","rdq"]]
comp2 = comp1[comp1['gvkey'] == 1602] // comp2 = comp1[comp1['sic'] == 2836]
comp3 = comp2[comp2['fyear'] > 1980]
comp3 = comp3[comp3['fyear'] < 2018]
comp4 = comp3[comp3['rect'] > 0]
comp5 = comp4.dropna(subset = ['at','revt','ni','epspi','rect','oancf'])
comp6 = comp5.drop_duplicates(comp5.columns.difference(['rdq']))
com_7 = comp6.fillna(0)
com_7.to_csv('amgen_compustate.csv', float_format = '%.6f', index = 0) //com_7.to_csv('amgen_compustate_2386.csv', float_format = '%.6f', index = 0)

2. The overview of revenue, net income, total asset of Amgen

Before digging deeper into earning manipulation practice, I’d like to get some basic overview of the recent financial situation of Amgen and how Amgen performs across years. Therefore I select three of the most important metrics to plot and analyze them across year.

(1) Year 2017 Financial Performance

# Python code 
df = pd.read_csv('amgen_compustate.csv')
y2017 = df[df['fyear']==2017]
y17_basic = pd.DataFrame(y2017[['revt','ni','at']])
print(y17_basic)
########## Result ############
revt ni at
22849.0 1979.0 79954.0

(2) Yearly trend

df2 = df[['fyear','revt','ni','at']]
df2 = df2.set_index('fyear')
plt.figure(figsize=(10,8))
plt.plot(df2['revt'], linestyle = 'solid')
plt.plot(df2['ni'], linestyle = 'dashed')
plt.plot(df2['at'], linestyle = 'dashdot')
plt.legend(df2.columns.values.tolist())
plt.title('Trend for Revenue, Net Income & Total Asset for Amgen Inc 1988-2017')
plt.show()

It is observed that the total asset of Amgen has a drastic rise after year 2000, which is worth investigation. And it might be the result of multiple and frequent acquisition after that period of time. Also, the net income after 2015 has dropped. Its profitability recently can be further checked.

3. Benford’s Law

Still remember the Benford’s Law? It is the distribution theory of the leading digits in the real life. Benford’s Law indicates the appearance of each digit is followed by certain probability.

In this section, I will still focus on three important financial metrics related to earning (revenue, operating cash flow and total asset)and to see if Benford’s Law can reveal some hints for earning manipulation.

(1) First Digit

# R-code
library('benford.analysis')
library(dplyr)
library(ggplot2)
## Use 1st digit from revenue columns as code example
revenue_bf <- data.frame(benford(abs(df$revt), number.of.digits = 1, sign = 'positive', discrete = TRUE, round = 3)$bfd)[,c("digits","data.dist", "benford.dist")]
colnames(revenue_bf) <- c('Digit', 'Sample_Revenue', 'Benford_Distribution')ggplot(data = revenue_bf, aes(x = as.factor(Digit), y =Benford_Distribution)) +
geom_bar(stat='identity') +
geom_line(aes(Digit, Sample_Revenue, col = 'Sample_Revenue'), linetype = 1) +
geom_point(aes(Digit, Sample_Revenue), size = 4, col = 'red')+
ggtitle('Theoretical Distribution v.s. Amgen Revenue Distribution - 1st Digits ')

From the graph and Chi-squared table above, the distribution of first digit for operating cash has more significant difference from Benford’s Law, which indicates possible manipulation on the first digit.

Visually speaking, there is a high peak for digit 5 in operating cash flow. We might further refer to its performance goal to see if the manager tried to push the number to five.

Also for revenue, though it doesn’t have significant difference from original digit distribution, I observe a higher appearance probability for digit 1 and zero appearance for digit 9. It might give a hint that the company tend to turn some number with 9 as first digit and round it to 1 (i.e. turning 99 to 1000).

(2) Second digit: After looking into first digit distribution, I decide to go further to check the distribution of second digit.

From the Chi-squared test, it shows the significance difference between theoretical digit distribution and actual number distribution. And from graphs, I observe a huge distinction for number 9, which refer to possible manipulation on second number other than just rounding the number.

(3) Some limitation for Benford’s Law here is that the sample data for each financial metrics are not so big, which is around 30 rows. The small sample of data might not be a very appropriate representative for the overall picture compared what I have done in the previous article, where I used every company in the data set.

4. Yearly Accruals Model

For the yearly accruals model, I used the second data file with Amgen and its peer companies in the same SIC-defined industry.

The model is to calculate every year accruals regression within the same industry and for each company, there will be a residual generated by these regression model. If in the specific year, the residual is huge, which means its accrual have varied differently from industry collective standard. It further indicates the possible manipulation of independent variable such as cash revenue growth or gross PPE.

# R-code
df1 <- select(df, gvkey, fyear, accurals,scale_cashrev_growth, scale_ppe)
df1 = na.omit(df1)
year = list()
residual_a = list()
for (i in 1:29){
data1 <- df1[df1['fyear'] == i + 1988, ]
fit <- lm(accurals ~ scale_cashrev_growth + scale_ppe, data = data1)
year[i] = i + 1988
residual_a[i] = fit$residuals[1]
}
residuals_list = do.call(rbind,
lapply(1:length(year),
function(i)
data.frame(A=unlist(year[i]),
B=unlist(residual_a[i]))))
residuals_list$B = abs(residuals_list$B)plot(residuals_list, type="l", xlab = 'Year', ylab = 'Residuals',
main = 'Yearly Unsigned Discretionary Accruals for Amgen')

In the code chunk, I select the munging column accruals, cash revenue growth as well as ppe and run regression every year in a for loop and capture the 1st residuals, which belongs to Amgen. Furthermore I take the absolute value of Amgen residual and plot it across year.

The peaks for residuals are where Amgen accruals are off industry benchmark (year 2002, 2014, and 2017). It might further give some possible hint of earning manipulation.

5. Operating Cash Flow Model

The logic behind operating cash flow model is very similar to accruals model. It is the testing of predictive power of several different financial metrics. The higher of R-score of the model, the more robust the independent metrics refers to the target variable and the less possibility of earning manipulation on operating cash flows or accruals.

I decided to use five years as a bin and to see for every five year how the change of predictive power of the model is.

# R-code
df1$accurals = df1$ni - df1$oancf
df01 <- df1[,c('gvkey','fyear','oancf')]
df02 <- mutate(df01, fyear = fyear - 1) %>%
rename(., next_oancf = oancf)
df1 <- left_join(df1, df02, by = c('gvkey',"fyear"))
# Basic model
oc_reg <- lm(df1$next_oancf ~ df1$accurals + df1$oancf)
# run the model every five year
y1992 <- df1[df1['fyear'] <= '1992',]
oc_reg_92 <- lm(y1992$next_oancf ~ y1992$accurals + y1992$oancf)
summary(oc_reg_92)

The result is sorted below:

From the chart, the green-colored R-squared in 1993–1997 and 2013–2016 are relatively high, indicating a better predictive power and less probability of manipulation.

The two red-colored R-squared from 2003 to 2012 showing the weakness of model prediction and higher possibility of manipulation on the independent metrics (operating cash flow and accruals).

6. M-Score

For M-Score, I will use the M-Score Generator I created in my previous article to plot the graph of M-score.

# Python
m_score_trend_graph('Amgen')

From M-score graph, data point above red line is the red-flag of earning manipulation; and data point between red and green line is yellow flag indicating slight manipulation.

Overall speaking, based on M-score, Amgen might not be a potential earning manipulator for past 10 years.

Brief Conclusion

In the article, I briefly walked through different methods to detect possible earning manipulation for a real-world company. Different methods provide different kinds of perspectives. Next time when you want to look into the company profile, maybe it can grant you a fresh eye to the a company with a more data-science-based manner.

If you like the article, feel free to give me 5+ claps
If you want to read more articles like this, give me 10+ claps
If you want to read articles with different topics, give me 15+ claps and leave the comment here
Thank for the reading

--

--

Henry Feng
Henry Feng

Written by Henry Feng

Sr. Data Scientist | UMN MSBA | Medium List: https://pse.is/SGEXZ | 諮詢服務: https://tinyurl.com/3h3uhmk7 | Podcast: 商業分析眨眨眼

No responses yet