Please enable JavaScript.
Coggle requires JavaScript to display documents.
SENTIMENT ANALYSIS OF INDONESIAN MOBILE OPERATOR WITH TWITTER DATA…
SENTIMENT ANALYSIS OF INDONESIAN MOBILE OPERATOR WITH TWITTER DATA
Information
Method
Overview
Analysis and Design
Analyze and design the model of rapid miner including selection of operators, e.g. “validation”, etc. that may assist in prediction analysis
Data Gathering
Gather data from Twitter in limited period of time in approximately a month with specific categories, e.g. network performance, promotion and product of offering price
Preprocessing the Data
Preprocess the data with case folding
Information analysis and sentiment analysis
Calculate accuracy of rapid miner model optimally by means of determining among true/false positive and negative results of sentiments found.
Diagram and Statistic creation
METHODOLOGY
Corpus Creation
Data used is gathered from social media Twitter.
The twitter data has been used for corpus creation based on the research by Alexander & Paurobek at 2010“Twitter as a Corpus for Sentiment Analysis and Opinion Mining”with data modification using Indonesian language tweets only.
Web Crawling
Twitter has already implemented an Application Interface (API) to make developer or researcher to utilize API
API is used to get data from the twitter or known as crawling
Manual Selection
Manual Selection is the step to manually select and check if the tweet is having correct sentiment produced by Web Crawling process
The tweet then also checked to match selected category
Data Preprocessing
Reduce the noise, normalize the words, reduce the word volume, and to remove duplicated tweet, remove Emoticon, Remove retweet, remove URL and Username in tweet, remove Special Characters
Data Learning
The result that gathered in the corpus then separated into training set and testing set or validation
Feature Selection
Only Unigrams feature is selected, because in Indonesiathere are no WordNet to easily classify words, and unigrams feature are classified every words in the document as a feature
Weighting
Based on the previous Feature Selection step, the document are represented as vector-space. From the vector document, the keywords are ranked based on the importance in the document out of all documents.
Weighting feature
Term Frequency
This method is counting the frequency of word in a document.
Term Presence
Counting the presence or absence of a word in a document. And the frequencies in a document is not affecting the count, because it is only counting if its presence or absence.
Term Frequency –Inverse Document Frequency (TFIDF)
TF-IDF are statistically reflecting the important of a keyword in a document or collection of documents.
SVM Classification
One of the learning method in this research is SVM,SVM will differ the training set data into two class, negative and positive.
Validation and Evaluation (Testing)
Decision Tree
Evaluation and Validation
X-Validation
Used to validate the training set, because not all data in training set can be represented in training process. X-Validation is one of the technique to assessing how the results of a statically analysis will generate.
X-Validation is one of the technique to assessing how the results of a statically analysis will generate.
Validation process with cross validation using k-fold theory which is separate the data to the same k amount.So thev alidation process will be based on how many k simultaneously.
X-Validation or Cross validation is a validation process which divide the model into several sections and it is called fold,
After the model divide into folds, then the machine will make the first fold as a training set, and the data for training set will be taken from fold 2, 3 and 4. In this case the algorithms never seen the fold 1 before, and it is like a simulator for the data from the real world,
After the error rate are acquired from the first round of x-validation, the fold are swap, in this case the swapped fold are fold 1 with fold 2, or it could be any fold as long it is never measured the error rate before, and the same algorithm used to measure fold 2 with the same process with the first round
After the process of swap and measure error rate are finish, and there are no more fold to swap.The error rate for the model are measured by the average, and the error rate are called as X-Validation error.
Evaluation
The performance model will be used, the model is based on the Rapidminer model. And it is using confusion matrix, which is the output from SVM.The training classification will used 1200 positive and 1200 negative data.
Next Level Analysis
Statistical Analysis
The statistical analysis will be created with Microsoft Excel, and it will create some fancy facts based on the time, mobile operators, and many more. For example in the statistical analysis, we can know when is the most frequent of Indonesian users are tweeting.
Tag Word Creation
Tag word analysis can create an insight of the data more specifically, for example we can know which operator that related to most of keywords.
Consist of statistic creation, tag words creation and also another interesting facts that can gather from the corpus
Based on the next-level analysis, the business can implement the data driven decision creation which is becoming a trend in the market.
Sentiment Analysis
Has been developed since 2003
Part of text mining that is a computational research based on the sentiment, emotion, opinion, comment and every expression expressed by text.
(Dave et al., Opinion extraction and semantic classification of product reviews, 2003) sentiment analysis is a tool to process a collection of search result that objected to an attribute of a product (quality, feature, etc.) and process the aggregation of the opinions.
Based on the classification,
sentiment analysis is divided into
two main groups:
Document classification into opinion or facts,class, or known as subjectivity classification
Document classification to the positive or negative, or known as sentiment analysis
Sentiment analysis model
Term object used to show the entity has been commented or an object having components and a set of attribute (Liu, Sentiment Analysis and Subjectivity, 2010).
“The speed of Ferrari is so fast” then the object is “Ferrari” and the attribute that is commented is “Speed”.
Sentiment sentences
Sentence that express emotion negative or positive as explicitly or implicitly
Sentiment sentences also can be as subjective or objective sentences (Liu, Sentiment Analysis and Subjectivity, 2010)
Sentiment Lexicon
In the sentiment analysis, the document must be an opinion that having a sentiment, to gather sentiment document, every words inside the document should be analyze to prove the document is consist of sentiment words or not.
The collection of the words is called opinion lexicon. (Liu, Sentiment Analysis and Subjectivity, 2010).
(Liu, Sentiment analysis and subjectivity, 2010) Stated that gathering the words that having sentiment can used a dictionary. The strategy is to list all known sentiment in a words, and queried to the dictionary to get the synonym and the antonym of the words. The result from query is used as a parameter for the next query.
Sentiment Orientation
A Sentiment as a feature f from an object is a positive, negative, emotion or objection respond from f from opinion holder (Liu, Sentiment analysis and subjectivity, 2010).
Opinion holder
Individual or organization that express the sentiment
The orientation of opinion is related to polarity(Yi et al., Extracting sentiments about a given topic using natural language processing techniques, 2003).
Emotion is defined as subjective feeling and thought from individual or organization (Liu, Sentiment analysis and subjectivity, 2010).
Sentiment Classification
Topic based classification which classified the document based on the topic that have been described, for example sports, science and economic
Divided
Coarse-grained sentiment analysis
This means to gather insight whether the document have positive or negative sentiment, or known as document level sentiment classification.
Fined-grained sentiment analysis
The point of this sentiment analysis is to classify the subjectivity of a sentiment, or known as sentence level subjectivity classification, this is a step to define whether a sentence is subjective or objective and the opinion association.
Sarcasm in the Sentiment
A closer look, 2011) the sarcasm is transforming the polarity of an apparently positive or negative utterance into its opposite. Sarcasm is considered as difficult problem in text mining (Nigam et al., towards a robust metric in polarity, 2006)
Based on (Roberto et al., Identifying Sarcasm in Twitter: A closer look, 2011) the proper way to minimalize mistake from sarcasm is by categorized the sarcastic document from non-sarcastic document that directly convey positive and negative sentiment.
Feature and Weighting
n-gram Feature
Feature the analysis such as syntactic, semantic, POS and link based. The feature is extracted from the whole document and the relationship between words in the document is analyze before to get the relationship between words (O'Keefe, T. & Koprinska, I., 2009. Feature Selection and Weighting Methods in Sentiment Analysis).
Unigram Feature
Symbol and unigram are represented as a vector and every words counted as one feature(O'Keefe, T. & Koprinska, I., 2009. Feature Selection and Weighting Methods in Sentiment Analysis).
A document can be stated as vector space model, if the document has a vector from the extracted keyword and from the vector the document is weighted to know the importance of the keyword in the document (O'Keefe, T. & Koprinska, I., 2009. Feature Selection and Weighting Methods in Sentiment Analysis)
Decision Tree Based Methods
A tree, it is what the main idea of decision tree classifier, the classifier create a model from every problem as a decision tree.
Every branch in decision tree is stated for decision and the leaf as a solution.The number of branch created is based on the feature.
Benefit
The data can directly create a decision tree without any assumption
Can be classify numerical and class data.
Weakness
The classifier output is a class
The classification process can take long time, it depends on the data
The algorithm is unstable
The output is only have one attribute
The numerical data can produce complicated tree
Support Vector Machine (SVM)
The idea of SVM is developed by Boser, Guyon and Vapnik at 1992 in Annual Workshop on Computational Learning Theory
Natural Language Processing
Natural Language Processing is a part of Artificial Intelligence and linguistic computational that focus on the interaction between computer and human natural language.
In Natural Language Processing the computer is thought to extract information and give the information in human natural language.
Phases
Phonetic and Phonology
Phonetic and Phonology are related to how the sound can produce known words.
Morphology
Morphology is differing the words by its form. By this phase the words is separated between the words and the element
Syntaxes
Related to the location of the word in the sentences, and the relationship between another words in a process of creating a systematic sentences.
Semantic
Semantic is the mapping of the syntax with using every words to the root words without related to sentences.Semantic is learning the meaning of the word, and how the words are related to the meaning of the sentences
Pragmatic
The knowledge in the pragmatic phase is related to the each context depends on the situation and the reason of the system creation.
Discourage Knowledge
Discourage Knowledge is detecting if the sentences is already read and recognized can affect the next sentences.
Word Knowledge
Word knowledge is related to the differentiation of the meaning of each words in the sentences or another context.
Naïve Bayes
Using a probability to define a document class
The classifier is using statistical approach, even though the classifier is offending with grammatical rules, in Naïve Bayes Classifier the emersion of words is not affecting the other words emersion, and the absence of a word is not affecting the absence of another word, and it is not decreasing the accuracy of Naïve Bayes Method
(Susanto, Pengklasifikasian Artikel berita Berbahasa Indonesia Secara Otomatis Menggunakan Naïve Bayes Classifier, 2006)
Document Classification
Classification is a method to identify an object into a class, group, or category based on the procedure, characteristic and the definition (Tan, Steinbach & Kumar, Introduction to Data Mining, 2006)
Divided
Binary classification
The binary classification is only have two outputs, for example positive or negative
Multiclass classification
The output from multiclass classification can be more than two class, for example in the mood analysis, happy, sad, angry, and bad.
Large scale multiclass classification
The output is like multiclass classification, but the class can be thousands.
Text mining
Sentiment Analysis is based on the text mining
Gather information by extracting the pattern from source data.
Goals
Objective
Find the customer sentiment of several mobile operator in Indonesia,
Find the specific words that related to the operator
Provide the mobile network operators with the latest feedback from their product
Meta
Year
2013
Author
Hansen Januar Fahrezasandy Wijaya
Research Problem
Data extraction for Indonesian language
Finding good rules for information extraction
The sentiment from twitter is sarcasm
Tweets that contain more than one information
The accuracy of classification based on Indonesian language
Result
Conclusion
Prediction accuracy for sentiment analysis in Indonesian mobile operators achieve 80% more of accuracy with the SVM algorithm.
Recommendation