SENTIMENT ANALYSIS USING AUTOMATIC CLASSIFICATION ON ONLINE MEDIA ARTICLE

Information

Sentiment Analysis

Goal

Objective

Design and implement a system that can analyze sentiment of text in Bahasa Indonesia

Algorithm

Sentiment analysis is a process of classifying articles as a positive or negative

Turney (2010) presents a simple unsupervised learning algorithm for classifying reviews as recommended or not recommended. The classification is predicted by the average of semantic orientation. A review classified as recommended if it has a good associations (e.g., “subtle nuances”) and not recommended reviews is a review that has a bad association(e.g., “very cavalier”)

Pang (2008) sentiment classification using machine learning technique classifying documents not by topic, but by the overall sentiment

NLP

natural language processing

Computerized approach to analyzing text(Liddy, 2001)

In general it could be defined as computational techniques for analyzing and representing text at many levels of linguistics to achieve human-like languageprocessing

NLP systems would be able to (Liddy, 2001):

Paraphrase an input text

Translate the text into another language

Answer questions about the contents of the text

NLP was originally mixed from various disciplines (e.g., Linguistics, Computer Science, Cognitive and Psychology)

Meta

Author

Year

Feizal Badri Asmoro

2013

Research Question

How do we analyze the sentiment of an online media streaming article in Bahasa Indonesia?

Is the data suitable for the system to be analyzed?

What is the most appropriate method for this problem?

Hypothesis

sentiment of an online media streaming can be obtained by using pairwise matching, vector matches and similarity matrix

It is possible that online media streaming article has some script, images, and many noisy texts within the data. Therefore, pre-processing is necessary so that the data is compatible and can be analyzed by the system.

k-NNis an appropriate solution to classify the sentiment in an online media streaming article.

WordNet

Lexical

Categories

Symbolic

Statistical

Connectionlist

Hybrid

Perform deep analysis of linguistic phenomena and are based on explicit representation of facts about language through well-understood knowledge representation schemes and associated algorithms (Basili, 1996)

Used many mathematical techniques and often used for large text. The primary source of evidence that used in this approach is using the observable data.

Hidden Markov Model (HMM)

Statistical approach work by using this large text corpus to develop generalized models without added with significant linguistic or world knowledge.

Use statistical learning and theories of representation.The statistical learning used in this approach is the same with with statistical approaches

This approach uses connectionlist mode, -a network of interconnected simple processing units with knowledge stored in the weights of the connection between units (Rumelhart, 1998).

Frequent applications of NLP

Information Retrieval (IR)

Information Extraction (IE)

Question-Answering (QA)

Summarization

reduces larger text into a shorter, but contains the most important information in the text

Dialogue Systems

a lexical databasein any language Consist of sets of synonyms (synsets), definition, and semantic relations between the synsets

Purpose

Support automation text analysis and artificial intelligence (AI) applications

Combining thesaurus and dictionary to produce more intuitively usable information

Includes the following semantic relations

Synonymy

Anatomymy

Hyponymy

Troponymy

Entailment

Relations between verbs are also coded in WordNet

is for verbs what hyponymy is for noun, although the resulting hierarchies are much shallower.

click to edit

K-nearest Neighbors Algorithm for Data Classification

Sentiment analysis divided into two typesof tasks

Basic task

Advanced task

Classifying the expressed opinion in a document, sentence, or feature/aspect level(Haaff, 2010)

Advanced task look for specific emotional states such as “angry”, “sad”, and “happy”.

Method

Pang and Lee (2005) expanded the basic task of classifying a movie review. The proposed method determines whether the review about some topic has positive or negative reviews. The result is used to predict the star ratings (on a five-star scale)

Featured/aspect-based sentiment analysis (the most common model in sentiment analysis) such as research by Hu and Liu.(2004) determines the sentiment expressed in attributes or component of entities (e.g., a digital camera)

NLP Levels

Phonology

Thislevel looks to interpretation of speech sound between words

Morphology

deals with words that are composed with morphemes. The purpose is to gain and represent the real meaning of the word itself form the morpheme.For example,the word punched with suffix–ed.The system will know that the verb punched took place in the past.

Lexical

NLP system interprets the meaning of individual words.There are several type of processing in word-level understanding.First, assign each word with its part-of-speech tag.If a word has many part-of-speeches, it will be assigned with the most probable part-of speech tag according to the context

Lexicon may be simple information of words and their part-of-speech, or complex information (e.g., the semantic class, the argument, semantic limitations, and definitionsof the sense).

Syntactic

focuses on analyzing the grammatical structure of the sentences.The output of this level is information of structural dependency relationships between words

Semantic

Focuses on determines the possible meaning of a sentence

Achieved by analyzing the interactions of word-level meaning in the sentence.

click to edit

Discourse

Discourse focuses on the properties of text as a whole that convey meaning rather than interpret multi-sentence texts

type of discourse processing

anaphora resolution

dicourse recognition

replacing words with the appropriate entity

adds meaningful representation of the text by determining the functions of the sentences in the text

What is

Lexical

Methodology

Text Crawler Framework

Selenium Framework

Result

Wu & Palmer

Measure calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LeastCommon Subsumer (LCS)(Wu & Palmer, 1994)

Similarity Matrix

Similarity matrix is a matrix that contains similarity score between two sentences or articles.Similarity matrix can be used to calculate the overall similarity between two articles

click to edit

Evaluation

Subjectivity Measurement

Measuring the performance and accuracy of the sentiment analysis system

Each news article has been classified as negative news or positive news by the sentiment analysis system. In this live experiment, participants need to define the accuracy of the classification done by the system, whether the positive or negative sentiment generated by the system is correct or not