Machine Learning and Lexicon based Methods for Sentiment Classification: A…
Machine Learning and Lexicon based Methods for Sentiment Classification: A Survey
Sentiment classification is an important subject in text mining research, which concerns the application of automatic methods for predicting the orientation of sentiment present on text documents
Sentiment Classification Method
Machine Learning Method
Trains a text classifier on a human labeled training dataset
Feature Selection For Sentiment Analysis
They test several features to fined optimal feature set: unigrams, bigrams, adjective and position of words were used as features, and found that the best performance was achieved when the unigrams were used in SVM classifier.
In the later work, Pang and Lee reported improvement by adding a preprocessing filter to remove objective sentences which allowed the classifier to focus only on subjective sentences, raising the accuracy from 82.9% to 86.4% in a movie reviews dataset.
When the set of training data is small, a naive Bayes classifier might be more appropriate since SVM must be exposed to a large set of data in order to build a high-quality classifier.
Supervised & Unsupervised
Require a large corpus of training data and its performance depends on a good match between the training and testing data with respect to the domain, topic and time-period
Input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.
Supervised learning problems can be further grouped into regression and classification problems.
A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
3 more items...
A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
k-means for clustering problems
A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
Feature Selection For Sentiment Analysis
A critical task in sentiment analysis and effectively selected representative features from subjective text can improve sentiment based classification.
It was shown that using unigrams as features in classification performed well with either naive Bayesian or SVM.
Sentiment classification based on machine learning can be formulated as a supervised learning problem
Opinion lexicon methods adopt a lexicon to perform sentiment analysis, by counting and weighting sentiment-related words that have been evaluated and tagged.
Collecting the opinion word list
Time consuming and thus it is not usually used alone, but combined with automated approaches as the final check because automated methods make mistakes.
Use a seed set of sentient words with known polarity and exploit syntactic patterns of co-occurrence patterns to identify new sentiment words and their polarity in a large corpus
This method scanned through a review looking for phrases that match certain part of speech patterns (adjective and adverbs), and then added up all sentiment orientation to compute the orientation of a document.
For example, Turkey determined whether words are positive or negative and how strong the evaluation is by computing the words' point-wise mutual information (PMI) for their co-occurrence with the word's sentiment orientation
Exploit available lexicographical resources like WordNet or HowNet.
The main strategy in these methods is to collect an initial seed set of sentiment words and their orientation manually, and then searching in a dictionary to find their synonyms and antonyms to expand this set
The manual approach is very time consuming and thus it is not usually used alone, but combined with automated approaches
Using corpus-based approach alone to identify all opinion words, is not as effective as the dictionary-based approach because it is hard to prepare a huge corpus to cover all words.
However, this approach has a major advantage it can help to find domain-specific opinion words and their orientations if a corpus from only the specific domain is used in the discovery process.
It is a hotspot of natural language processing(NLP), data mining (DM) and information retrieval (IR), with a number of applications including recommender and advertising systems, product feedback analysis and customer decision making.
Provide a survey and comparative study of existing techniques for opinion mining including machine learning and lexicon-based approaches together with evaluation metrics
Experimental results show that supervised machine learning methods, such as SVM and naive Bayes, have higher precision
Lexicon-based methods are also very competitive because they require few effort in human-labeled document and isn’t sensitive to the quantity and quality of the training dataset.