Indonesian Social Media Sentiment Analysis with Sarcasm Detection
Information
Meta
Goals
We proposed two additional features to detect sarcasm after a common sentiment analysis is conducted
Author
Ayu Purwarianti
Problem
Sarcasm is considered one of the most difficult problem in sentiment analysis. sarcasm
In our observation on Indonesian social media, for certain topics, people tend to criticize something using
Features
Negativity information
Number of interjection words
SentiWordNet
All the classifications were conducted with machine learning algorithms
Sentiment analysis, detecting sarcasm automatically is still considered a difficult problem because lexical features do not give enough information to detect sarcasm
Possible approach
click to edit
click to edit
Lexical factors like presence of adjectives and adverbs, presence of interjections, and use of punctuations play a quite significant role in sarcasm
Methodology
Preprocessing
Feature extraction component
Classification component
Minimize the vocabulary of terms used in the text message
In Indonesian text message of social media such as twitter or facebook, people tend to use slank words than the formal one such as using numeric to replace alphabet, repeating vocal characters, and using common informal words to replace the formal words [2]
We converse numeric character into alphabet, such as
“ga2l” into “gagal”
Remove vocal repetition, such as “cemunguuuudh” into “cemungudh”
Translate informal words into formal words using dictionary, such as “cemungudh” into “semangat” (high spirit).
Unigram
Negativity
Number of interjection words
Question word
Word context
Affix
Feature extraction involves reducing the amount of resources required to describe a large set of data
Unigram is more suitable for Indonesian social media text since the grammars used in Indonesian social media texts are various and informal
The unigram taken from the text is only the term that exists in our translated SentiWordNet. We translated English SentiWordNet into Indonesian using an available statistical machine translation (Google)
In the translation, one Indonesian word with more than one English translation is given the average score of all the English translations
One word sentiment may change depends on its word context. For example, the word “mahasiswa” (student) is basically a neutral word but if this word is preceeded by “harga” (price) which makes it into “harga mahasiswa” (low price) then the word becomes a possitive word.
Word with different affix may have different sentiment. For example “murah” (low) has a possitive sentiment, while “murahan” (twopenny) has a negative sentiment
Represents the percentage of the negative sentiment in the topic of the text message.
This feature is intended to catch global information. It gives information about the real sentiment of a certain topic
To get this feature, the topic of the text message should be extracted first. In this research, we did the topic extraction manually
Example of interjection words are “aha”, “bah”, “nah”, “wew”, “wow”, “yay”, “uh”, etc
We employ this feature based on our observation that among 100 sarcasm text, there are 20 text with interjection words
This feature is used to classify neutral text. By detecting the question word like “who”, “what”, “when”, “how”, “where”, and “why” it will show that the text has no sentiment value.
There are two classification steps. The first classification is to classify each text into three sentiment classes: possitive, negative and neutral
The second classification is to classify the sarcasm of the possitive text
All the classifications are conducted using several machine learning algorithms such as Naïve Bayes, Maximum Entropy, and Support Vector Machine. These algorithms were chosen because they have shown good accuracy in many text classification task (Pang, Bo., Lee, L., and Vaithyanathan, S.. 2002. Thumb’s up?
Sentiment Classification using machine learning Techniques)
We also employed translated SentiWordNet in
the sentiment classification
Resource of word with sentiment score
Reference
Naïve Bayes, Maximum Entropy, and Support Vector machine because these algorithms are tend to outperform the other algorithm in the context of text classification
Sentiment Classification for Indonesian Messages in Social Media
Sistem Analisis Opini Microblogging Berbahasa Indonesia
Thumb’s up? Sentiment Classification using machine learning Techniques
We only classify positive sarcasm text is because almost all of the sarcasm text is looked like positive text, while the real value of the text is negative. Although it may be possible to encounter sarcasm text that has the positive value, the quantity is very low.
Naïve Bayes
Often used in automated text classification because it is simple and did not need a lot of data compared to another machine learning algorithm [8]
Maximum Entropy
Construct a stochastic model that accurately represents the behaviour of the process. Such a model is a method estimating the conditional probability that, given a context, what will the output from the process.
Support Vector Machine
Maps every feature set into a two dimensional plane and construct a model that based on a linear line that separate the class from the mapped feature set
Maximum Entropy find the most uniform model for each context [9].
In the first classification, the feature is the unigram, while in the second classification, the features are unigram, negativity and the number of interjection words.
Rather than directly classify a text into three classes, at first, this method classify a text into opinion and neutral text. After that, the opinion text will be classified into positive or negative class
Result
We have shown that by using the sentiment score, the accuracy of general sentiment analysis is improved for about 4%.And then, the result shown that the negativity feature for detecting sarcasm are quite effective since it increased the accuracy by 6%
Based on our observation of the sarcasm data, we added the features of negativity and number of interjection words. The negativity feature tried to catch the global sentiment value, while the interjection feature represents the lexical phenomena of the text message.
Experiments on classification method
Experiments on Sentiment Score
Using sentiment score in the classification gave higher accuracy than only using the lexical words
We compared two things in the experiment: using only the lexical of sentiment word; and using the score of the sentiment word.
We evaluated the usage of the translated SentiWordNet in the first classification type which classify each text into 3 classes: positive, negative and neutral.
The sentiment score can differentiate the word with low score of certain sentiment and the word with high score of sentiment. By using sentiment score, word with low sentiment score might have been ignored and give result of neutral or opposite class, w
For example, in the text “Berlibur di Hanoi Vietnam juga bisa menjadi liburan yang berbeda saat Tahun Baru Cina nanti Travelers” the word “bisa” (can) has a sentiment score of 0.125 which shows a low valued positive sentiment.
Using the sentiment score and term weight, the word “bisa” is ignored and it gives the neutral sentiment, which is a correct one. But, by using only the lexical value, the word “bisa” can’t be ignored and it gives the positive sentiment, which is an incorrect result.
Evaluated the direct and levelled method in sentiment classification.
Both of the classification use the sentiment score feature as one of the base feature
Features
The results in Table 2 showed that direct classification gave higher accuracy than levelled method. Both of the methods use the same feature.
Unigram
Sentiment Score
Question words
Experiments on sarcasm detection
The results in showed that the additional features are effective in sarcasm
Evaluated the additional features of negativity and interjection number in the sarcasm
Additional features also showed that Indonesian people tend to write their critics using sarcasm.
As for the low accuracy, we found that there are many sarcasm texts have no global topic. For example, in the text “Men, lu ganteng banget kalo pake dress.p” (Man, you are so handsome using dress).
Here, the text topic is not widely known, thus, the negativity feature is useless. There are a quite a lot of text with personal message that can only be analy
Edwin Lunando
Year
2013
Negation
In order to get this feature, the topic of the text message should be extracted first. In this research, we did the topic extraction manually to determine the topic for each text.
The negation words usually reside before the sentiment word and it can be located two or three words away from the sentiment word. We handled this by multiplying the score of sentiment word closest to the negation word by 1.
“Televisi jaman sekarang tidak begitu bagus karena mahal” (nowadays televisions are notgood because they are expensive), the sentiment word “bagus” (good) is changed into negative because it is preceeded by a negation word (“tidak” or “not”)
Negation words (such as “no” or “not”) tend to change
sentiment score of a certain word.
For example, in a topic about Indonesian singer, “Rhoma Irama wants to be Indonesia’s President” has 80% negative sentiment texts in the public. Then, the negativity value of the topic is 80% or 0.8. After determining each topic negativity value, the input text negativity value can be determined by extracting the topic of input text and get the negativity value from the topic.
If the text is using interjection words, the text has more tendency to be classified into sarcasm text.