Indonesian Social Media Sentiment Analysis with Sarcasm Detection

Information

Meta

Goals

We proposed two additional features to detect sarcasm after a common sentiment analysis is conducted

Author

Ayu Purwarianti

Problem

Sarcasm is considered one of the most difficult problem in sentiment analysis. sarcasm

In our observation on Indonesian social media, for certain topics, people tend to criticize something using

Features

Negativity information

Number of interjection words

SentiWordNet

All the classifications were conducted with machine learning algorithms

Sentiment analysis, detecting sarcasm automatically is still considered a difficult problem because lexical features do not give enough information to detect sarcasm

Possible approach

click to edit

Lexical factors like presence of adjectives and adverbs, presence of interjections, and use of punctuations play a quite significant role in sarcasm

Methodology

Preprocessing

Feature extraction component

Classification component

Minimize the vocabulary of terms used in the text message

In Indonesian text message of social media such as twitter or facebook, people tend to use slank words than the formal one such as using numeric to replace alphabet, repeating vocal characters, and using common informal words to replace the formal words [2]

We converse numeric character into alphabet, such as
“ga2l” into “gagal”

Remove vocal repetition, such as “cemunguuuudh” into “cemungudh”

Translate informal words into formal words using dictionary, such as “cemungudh” into “semangat” (high spirit).

Unigram

Negativity

Number of interjection words

Question word

Word context

Affix

Feature extraction involves reducing the amount of resources required to describe a large set of data

Unigram is more suitable for Indonesian social media text since the grammars used in Indonesian social media texts are various and informal

The unigram taken from the text is only the term that exists in our translated SentiWordNet. We translated English SentiWordNet into Indonesian using an available statistical machine translation (Google)

In the translation, one Indonesian word with more than one English translation is given the average score of all the English translations

One word sentiment may change depends on its word context. For example, the word “mahasiswa” (student) is basically a neutral word but if this word is preceeded by “harga” (price) which makes it into “harga mahasiswa” (low price) then the word becomes a possitive word.

Word with different affix may have different sentiment. For example “murah” (low) has a possitive sentiment, while “murahan” (twopenny) has a negative sentiment

Represents the percentage of the negative sentiment in the topic of the text message.

This feature is intended to catch global information. It gives information about the real sentiment of a certain topic

To get this feature, the topic of the text message should be extracted first. In this research, we did the topic extraction manually

Example of interjection words are “aha”, “bah”, “nah”, “wew”, “wow”, “yay”, “uh”, etc

We employ this feature based on our observation that among 100 sarcasm text, there are 20 text with interjection words

This feature is used to classify neutral text. By detecting the question word like “who”, “what”, “when”, “how”, “where”, and “why” it will show that the text has no sentiment value.

There are two classification steps. The first classification is to classify each text into three sentiment classes: possitive, negative and neutral

The second classification is to classify the sarcasm of the possitive text

All the classifications are conducted using several machine learning algorithms such as Naïve Bayes, Maximum Entropy, and Support Vector Machine. These algorithms were chosen because they have shown good accuracy in many text classification task (Pang, Bo., Lee, L., and Vaithyanathan, S.. 2002. Thumb’s up?
Sentiment Classification using machine learning Techniques)

We also employed translated SentiWordNet in
the sentiment classification

Resource of word with sentiment score

Reference

Naïve Bayes, Maximum Entropy, and Support Vector machine because these algorithms are tend to outperform the other algorithm in the context of text classification

Sentiment Classification for Indonesian Messages in Social Media

Sistem Analisis Opini Microblogging Berbahasa Indonesia

Thumb’s up? Sentiment Classification using machine learning Techniques

We only classify positive sarcasm text is because almost all of the sarcasm text is looked like positive text, while the real value of the text is negative. Although it may be possible to encounter sarcasm text that has the positive value, the quantity is very low.

Naïve Bayes

Often used in automated text classification because it is simple and did not need a lot of data compared to another machine learning algorithm [8]

Maximum Entropy

Construct a stochastic model that accurately represents the behaviour of the process. Such a model is a method estimating the conditional probability that, given a context, what will the output from the process.

Support Vector Machine

Maps every feature set into a two dimensional plane and construct a model that based on a linear line that separate the class from the mapped feature set

Maximum Entropy find the most uniform model for each context [9].

In the first classification, the feature is the unigram, while in the second classification, the features are unigram, negativity and the number of interjection words.

Rather than directly classify a text into three classes, at first, this method classify a text into opinion and neutral text. After that, the opinion text will be classified into positive or negative class

Result

We have shown that by using the sentiment score, the accuracy of general sentiment analysis is improved for about 4%.And then, the result shown that the negativity feature for detecting sarcasm are quite effective since it increased the accuracy by 6%

Based on our observation of the sarcasm data, we added the features of negativity and number of interjection words. The negativity feature tried to catch the global sentiment value, while the interjection feature represents the lexical phenomena of the text message.

Experiments on classification method

Experiments on Sentiment Score

Using sentiment score in the classification gave higher accuracy than only using the lexical words

We compared two things in the experiment: using only the lexical of sentiment word; and using the score of the sentiment word.

We evaluated the usage of the translated SentiWordNet in the first classification type which classify each text into 3 classes: positive, negative and neutral.

The sentiment score can differentiate the word with low score of certain sentiment and the word with high score of sentiment. By using sentiment score, word with low sentiment score might have been ignored and give result of neutral or opposite class, w

For example, in the text “Berlibur di Hanoi Vietnam juga bisa menjadi liburan yang berbeda saat Tahun Baru Cina nanti Travelers” the word “bisa” (can) has a sentiment score of 0.125 which shows a low valued positive sentiment.

Using the sentiment score and term weight, the word “bisa” is ignored and it gives the neutral sentiment, which is a correct one. But, by using only the lexical value, the word “bisa” can’t be ignored and it gives the positive sentiment, which is an incorrect result.

Evaluated the direct and levelled method in sentiment classification.

Both of the classification use the sentiment score feature as one of the base feature

Features

The results in Table 2 showed that direct classification gave higher accuracy than levelled method. Both of the methods use the same feature.

Unigram

Sentiment Score

Question words

Experiments on sarcasm detection

The results in showed that the additional features are effective in sarcasm

Evaluated the additional features of negativity and interjection number in the sarcasm

Additional features also showed that Indonesian people tend to write their critics using sarcasm.

As for the low accuracy, we found that there are many sarcasm texts have no global topic. For example, in the text “Men, lu ganteng banget kalo pake dress.p” (Man, you are so handsome using dress).

Here, the text topic is not widely known, thus, the negativity feature is useless. There are a quite a lot of text with personal message that can only be analy

Edwin Lunando

Year

2013

Negation

In order to get this feature, the topic of the text message should be extracted first. In this research, we did the topic extraction manually to determine the topic for each text.

The negation words usually reside before the sentiment word and it can be located two or three words away from the sentiment word. We handled this by multiplying the score of sentiment word closest to the negation word by 1.

“Televisi jaman sekarang tidak begitu bagus karena mahal” (nowadays televisions are notgood because they are expensive), the sentiment word “bagus” (good) is changed into negative because it is preceeded by a negation word (“tidak” or “not”)

Negation words (such as “no” or “not”) tend to change
sentiment score of a certain word.

For example, in a topic about Indonesian singer, “Rhoma Irama wants to be Indonesia’s President” has 80% negative sentiment texts in the public. Then, the negativity value of the topic is 80% or 0.8. After determining each topic negativity value, the input text negativity value can be determined by extracting the topic of input text and get the negativity value from the topic.

If the text is using interjection words, the text has more tendency to be classified into sarcasm text.