Author Helmi Abd Hamid Categories: Data Science

SHOWDOWN: Tun Dr. Mahathir vs Lim Kit Siang

Who’s More Popular in Johor? Tun Dr. Mahathir or Lim Kit Siang?

Discussed in this topic:

Bayesian Average
Opinion Mining/Sentiment Analysis
Data Filtering with SQL

Since politics is the most dominant subject in our previous analysis, let’s take a jab at one aspect which is the popularity of the most popular politicians.

One quite common technique is to use Bayesian Average circulating around information like no of subscriptions, rating, vote casts, likes, comments, but since we do not have ample data based on those parameters, we will use basic techniques to determine who is more popular.

We will still use the data set which we had filtered to contains “Johor” as the most important keyword. Thus, the other datasets will be a subset of this data set.

First, let’s filter rows or observations as they normally call it in R, that contain the keywords pointing to Tun Dr. Mahathir. We will then create a vector to indicate the frequency of mentions in each row.

As can be seen from the Venn diagram, there are overlapping areas referring to articles which mention both leaders. Well, we can zoom in and perform a more detailed examinations on those articles but discounting them is also an option - especially when the percentage of those kinds of articles is relatively small and perhaps insignificant. On the flipside, those articles may provide us a more accurate picture or fortify our findings based on articles which mentions only either one of them.

Let’s load the package to use SQL statements to filter the data and the dplyr package for data frame manipulation and processing and strinr for string manipulation:

library(sqldf)
library(dplyr)
library(stringr)

Let’s count the number of occurrences of both of the leaders’ names:

	TDMvsKS <- as.data.frame(news$id) # Create a new dataframe

for (i in 1:nrow(news)) {
  TDMvsKS$Mahathir[i] <- str_count(tolower(news$para[i]),"mahathir")
  TDMvsKS$KitSiang[i] <- str_count(tolower(news$para[i]),"kit siang")
   
}

sum(TDMvsKS)
sum(TDMvsKS$Mahathir)
sum(TDMvsKS$KitSiang)

From counting the number of mentions it looks like Tun Dr. Mahathir is overwhelmingly mentioned much more frequently. But the mentioning of his name may not necessarily indicate that he is more popular since it may also indicate notoriety or infamy. Thus, we will use sentiment analysis to determine the words used in correlation to both leaders.

Now let’s leverage on the power of SQL statements to extract only data that we want:

Mahathir <- sqldf("select * from news where para like '%ahathir%'")
KitSiang <- sqldf("select * from news where para like '%Kit Siang%'")

Now let’s filter out contents that contain both leaders’ names:

MahathirLessKS <- sqldf("select * from Mahathir where para not like '%kit siang%'")

KitSiangLessMaha <- sqldf("select * from KitSiang where para not like '%ahathir%'")

One of the most straightforward methods is to count the occurrences of each names being mentioned where the victor must be the one who is most talked about. However, there is a possibility that the nature of the sentences where these leaders were mentioned were negative. We will investigate that later using sentiment analysis techniques.

There are several ways and more than two packages to perform sentiment analysis in R but for now let’s use the SentimentAnalysis package.

library(SentimentAnalysis)

doc <- sapply(MahathirLessKS$para, function(row) iconv(row,"latin1","ASCII",sub=""))

sentiment <- analyzeSentiment(doc)

# View sentiment direction (i.e. positive, neutral and negative)
convertToDirection(sentiment$SentimentQDAP)
likesMaha <- as.data.frame(convertToDirection(sentiment$SentimentQDAP))

Now let’s count the number of positive negative and neutral sentiments:

countlikesMaha <- paste(likesMaha$`convertToDirection(sentiment$SentimentQDAP)`,collapse="")

positiveMaha <- str_count(countlikesMaha,"positive")
negativeMaha <- str_count(countlikesMaha,"negative")
neutralMaha <- str_count(countlikesMaha,"neutral")

sentMaha <- data.frame(positiveMaha,negativeMaha,neutralMaha)

View(sentMaha)

Clearly from the data frame we can comfortably come to a conclusion that positive connotations overwhelmingly outnumber negative and neutral ones.

We can draw donut or doughnut or pie charts from the data frame.

Now let’s do the same to Uncle Lim Kit Siang, shall we?

#Doing the same for LKS
doc <- sapply(KSLessMahathir$para, function(row) iconv(row,"latin1","ASCII",sub=""))

sentiment <- analyzeSentiment(doc)

# View sentiment direction (i.e. positive, neutral and negative)
convertToDirection(sentiment$SentimentQDAP)
likesKS <- as.data.frame(convertToDirection(sentiment$SentimentQDAP))

countlikesKS <- paste(likesKS$`convertToDirection(sentiment$SentimentQDAP)`,collapse="")

positiveKS <- str_count(countlikesKS,"positive")
negativeKS <- str_count(countlikesKS,"negative")
neutralKS <- str_count(countlikesKS,"neutral")

sentKS <- data.frame(positiveKS,negativeKS,neutralKS)

If you notice, I had used a different package to perform a sentiment analysis on JDT tweets. The package was syuzhet. You can also use Rsentiment as an alternative with probably a slightly different syntax but similar output. In some packages you can specify which standards to use such as syuzhet, Oxford, Cambridge University Press or Stanford corenlp.

The differences are in terms of how they define the words as juxtaposed in their respective dictionaries.

In general, words and sentences will be analysed an assigned or categorised with sentimental values. In actual fact, in most packages, those values will be numerical. We will need to use other functions from those same packages to convert them into simpler numerical values or textual values such as negative or positive. P-values for each sentence will be computed to determine the probability of the sentiment they belong to. This is to address the issue where in some sentences, both negative and positive words may be present.

Sentiment analysis and opinion mining is almost same thing, however there is a minor difference between them in which opinion mining extracts and analyse people's opinion about an entity while Sentiment analysis searches for the sentiment words/expression in a text and then analyse them.

Now let’s visualize the percentages of the sentiments for both leaders.

Pie Charts:

library(plotly)
pi_values <- c(sentMaha$positiveMaha,sentMaha$negativeMaha,sentMaha$neutralMaha)
pi_legends <- c("Positive","Negative","Neutral")
ap <- plot_ly() %>%
  add_pie(data = sentMaha, labels = pi_legends, values = pi_values,
          name = "Cut", domain = list(x = pi_values, y = pi_legends))
ap

This will create an object, ap which when called will draw a pie chart comprising the three sentiments.

Using the code which I provided, you will have to hover your mouse to each element to see the sentiment they represent.

Now let’s see Kit Siang’s pie chart:

Alternatively, you can plot a donut chart with the plotly package. There are at least two other R packages you can use which I may cover in my future posts. Most of these packages, like plotly, will provide interactive charts.

	library(tidyverse)
donut <- sentMaha %>%
  summarize(count = n()) %>%
  plot_ly(labels =pi_legends, values = pi_values) %>%
  add_pie(hole=0.6) %>%
  layout(title="Mahathir: Percentage of Sentiments by Category", showlegend=F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
donut

As you can obviously see, there are much more positive posts on Tun Dr. Mahathir than there are Kit Siang’s. Furthermore, only 15.4% of those were accounted for negative comments in Tun Dr. Mahathir’s case but Kit Siang has 24.1% negatively labelled sentiments on his part.