Please enable JavaScript.
Coggle requires JavaScript to display documents.
The Heart and Soul of the Web? Sentiment Strength Detection in the Social…
The Heart and Soul of the Web? Sentiment Strength Detection in the Social Web with SentiStrength
Information
Emotions and sentiments are critical to many human activities, including communication.
It is well documented that people can feel and express emotions through computer mediated communication (CMC) even if it is asynchronous and text-based (Walther and Parks 2002).
People not only engage in social communication because they enjoy it or because it helps to fulfil emotional needs but also use sentiment to help convey meaning and react to sentiments expressed towards them or others.
For example emoticons arose as a partial solution to the lack of body language and intonation to express emotion in informal types of textbased
Hence, to effectively analyse any area of the social web, emotion should be taken into account for all except the simplest models.
If using real data for such analyses, it is necessary to have an automatic method to extract sentiment from text and this is the sentiment analysis task.
Sentiment analysis
Reads text and uses an algorithm to produce an estimate of its sentiment content. This estimate can be in several different forms:
Binary—either positive/negative
Objective/subjective
Trinary—positive/neutral/negative;
scale
-5 (strongly negative) to +5 (strongly
positive)
Algorithm
machine learning
May start by converting each text into a list of words, consecutive word pairs and consecutive word triples (i.e., 1- to 3-grams)
Then, based upon a human coded set of texts, learn which of these features tend to associate with sentiment scores, using this information to classify new cases
lexical methods
May start with some language information, such as a list of sentiment words and their polarities, and use this together with grammatical structure knowledge, such as the role of negation, to estimate the sentiment of texts.
A machine learning approach may classify “I am not happy” as negative because the bigram “not happy” occurs almost always in texts in the training set coded as negative by humans
The two approaches seem to have similar levels of accuracy (however measured) depending upon the types of texts classified and the amount of human classified training data available. Nevertheless, lexical sentiment analysis seems to be superior from a pragmatic perspective for many social research applications because it is less likely to pick up indirect indicators of sentiment that will generate spurious sentiment patterns
For instance a machine learning approach might extract unpopular politicians’ names as negative features because they tend to occur in negative texts but this would result in even objective or neutral texts about them being classified as negative, undermining any derived analysis of sentiment in political communication.
Sentistrength
Free sentiment analysis program that uses a lexical approach to classify social web texts
It uses the dual positive and negative scales because psychological research reports that humans can experience positive and negative emotions simultaneously and to some extent independently (Norman et al. 2011)
It also uses the lexical approach for the pragmatic reasons given above and harnesses CMC conventions for expressing sentiment to capture non-standard expressive text.
As the results below show, it works well without any training data on a wide range of social web texts and approaches human-level accuracy in most tested cases.
The exceptions where it performs less well are sets of texts with widespread irony or sarcasm, such as informal political discussions, and narrowly focused topics with frequently used sentiment terms that are either rare in other topics or tend to have a specialist meaning within the narrow topic examined.
available in two versions, Java and Windows.
SentiStrength’s commercial users include Yahoo! (Kucuktunc et al. 2012;Weber et al. 2012) and a range of online information management companies around the world. It was also used to power a light display on the EDF Energy London Eye during the London 2012 Olympic Games by continually monitoring the average sentiment of Olympic-related tweets
The SentiStrength resources, such as the sentiment lexicon and emoticon list, are stored as separate text files and SentiStrength must be pointed to the location when started
SentiStrength Algorithm
The heart of SentiStrength is a lexicon of 2310 sentiment words and word stems obtained from the Linguistic Inquiry andWord Count (LIWC) program (Pennebaker et al. 2003), the General Inquirer list of sentiment terms (Stone et al. 1966) and adhoc additions made during testing, particularly for new CMC words
The stemming used is simple and indicated in the lexicon with a wildcard at the end of a word. For instance amaz* matches all words starting with amaz, such as amazed and amazing.
For each text, SentiStrength outputs a positive sentiment score from 1 to 5 and a negative score from 1 to 5. Matching this, each word or stem in the dictionary is given a positive or negative score within one of these two ranges.
These scores were initially human assigned based upon a development corpus of 2600 comments from the social network site MySpace, and subsequently updated through additional testing.
The reason for primarily relying upon human input for the sentiment weights is that many of the terms occur rarely in texts and so a machine learning approach would need a huge number of classified texts to give sufficient coverage to assign lexicon weights well
Weakness
SentiStrength is that it does not attempt to use grammatical parsing (e.g., part of speech tagging) to disambiguate between differentword senses.XtGNXJ1/t/-
This is because it is designed to process very informal text from the social web and so, unlike typical linguistic parsers, does not rely upon standard grammar for optimal performance
The lexicon is used in a simple way. When SentiStrength reads a text, it splits it into words and separates out emoticons and punctuation. Each word is then checked against the lexicon for matching any of the sentiment terms.
If a match is found then the associated sentiment score is retained. The overall score for a sentence is the highest positive and negative score for its constituent words and for multiple sentences and the maximum scores of the individual sentences is taken.
For example, the text “Mike is horrible and nasty but I am lovely. I am fantastic.” would be classified as follows, “Mike is horrible[-4] and nasty[-3] but I am lovely[2].
In addition to the lexicon, SentiStrength includes a list of emoticons together with human-assigned sentiment scores
SentiStrength also has a list of idioms with sentiment
strength weights.
Supervised and Unsupervised Modes
Supervised
It does this by repeatedly increasing or decreasing the term weights by 1, one term at a time, and then assessing whether this change increases, decreases or does not affect the overall classification accuracy for the human coded texts.
Changes that improve accuracy are kept and the process is repeated until no term strength change improves the overall classification accuracy (i.e., it is a hill climbing algorithm).
SentiStrength has the capability to optimise its lexicon term weights for a specific set of human-coded texts (i.e., a collection of texts with human-assigned sentiment scores for each one).
Unsupervised
Without training
Supervised mode has similar overall accuracy to that of unsupervised mode, but it should logically outperform unsupervised mode if enough training data is used.
Evaluating SentiStrength
Language Variants
Nevertheless a good conversion will take at least a month to translate the resource files, human-code a development corpus of 1000 texts and refine the lexicon and options based upon an examination of incorrect classifications in the development corpus.
Converting SentiStrength to work with a new language requires no coding in most cases because all its language resources are stored externally in plain text files and because it has language customisation options built in, including a UTF-8 mode for non-ASCII characters.
SentiStrength can be customised for new languages by translating its sentiment lexicon and other resources, adjusting its optional settings to cope with language specific features, such as negating words occurring after sentiment terms in Germanic languages, and refining through testing on a human-coded development corpus.
For some languages, additional processing will be needed to get good results, however. For example, the morphology of Turkish word formulation (Turkish is an agglutinative language) means that Turkish text must be parsed to separate out negating suffixes from sentiment terms and then the negation can be re-introduced by inserting an artificial negating word (e.g.,
NOT
) prior to the sentiment word before being submitted to SentiStrength (Vural et al. 2013).
The accuracy of the translated variant should also be assessed before use on a second human coded corpus, to determine its accuracy. This is likely to be lower than SentiStrength’s accuracy for English due to the longer development time for this language.
Meta
Year
2017
Author
Mike Thelwall
Result
Goals