Please enable JavaScript.

Coggle requires JavaScript to display documents.

The Heart and Soul of the Web? Sentiment Strength Detection in the Social…

- - - - -5 (strongly negative) to +5 (strongly
        positive)
    - - machine learning
        
        May start by converting each text into a list of words, consecutive word pairs and consecutive word triples (i.e., 1- to 3-grams)
        
        Then, based upon a human coded set of texts, learn which of these features tend to associate with sentiment scores, using this information to classify new cases
      - lexical methods
        
        May start with some language information, such as a list of sentiment words and their polarities, and use this together with grammatical structure knowledge, such as the role of negation, to estimate the sentiment of texts.
      - A machine learning approach may classify “I am not happy” as negative because the bigram “not happy” occurs almost always in texts in the training set coded as negative by humans
      - The two approaches seem to have similar levels of accuracy (however measured) depending upon the types of texts classified and the amount of human classified training data available. Nevertheless, lexical sentiment analysis seems to be superior from a pragmatic perspective for many social research applications because it is less likely to pick up indirect indicators of sentiment that will generate spurious sentiment patterns
      - For instance a machine learning approach might extract unpopular politicians’ names as negative features because they tend to occur in negative texts but this would result in even objective or neutral texts about them being classified as negative, undermining any derived analysis of sentiment in political communication.
  - - - The heart of SentiStrength is a lexicon of 2310 sentiment words and word stems obtained from the Linguistic Inquiry andWord Count (LIWC) program (Pennebaker et al. 2003), the General Inquirer list of sentiment terms (Stone et al. 1966) and adhoc additions made during testing, particularly for new CMC words
      - The stemming used is simple and indicated in the lexicon with a wildcard at the end of a word. For instance amaz* matches all words starting with amaz, such as amazed and amazing.
      - For each text, SentiStrength outputs a positive sentiment score from 1 to 5 and a negative score from 1 to 5. Matching this, each word or stem in the dictionary is given a positive or negative score within one of these two ranges.
      - These scores were initially human assigned based upon a development corpus of 2600 comments from the social network site MySpace, and subsequently updated through additional testing.
      - The reason for primarily relying upon human input for the sentiment weights is that many of the terms occur rarely in texts and so a machine learning approach would need a huge number of classified texts to give sufficient coverage to assign lexicon weights well
      - Weakness
        
        SentiStrength is that it does not attempt to use grammatical parsing (e.g., part of speech tagging) to disambiguate between differentword senses.XtGNXJ1/t/-
        
        This is because it is designed to process very informal text from the social web and so, unlike typical linguistic parsers, does not rely upon standard grammar for optimal performance
      - The lexicon is used in a simple way. When SentiStrength reads a text, it splits it into words and separates out emoticons and punctuation. Each word is then checked against the lexicon for matching any of the sentiment terms.
      - If a match is found then the associated sentiment score is retained. The overall score for a sentence is the highest positive and negative score for its constituent words and for multiple sentences and the maximum scores of the individual sentences is taken.
      - For example, the text “Mike is horrible and nasty but I am lovely. I am fantastic.” would be classified as follows, “Mike is horrible[-4] and nasty[-3] but I am lovely[2].
      - In addition to the lexicon, SentiStrength includes a list of emoticons together with human-assigned sentiment scores
      - SentiStrength also has a list of idioms with sentiment
        strength weights.
    - - Supervised
        
        It does this by repeatedly increasing or decreasing the term weights by 1, one term at a time, and then assessing whether this change increases, decreases or does not affect the overall classification accuracy for the human coded texts.
        
        Changes that improve accuracy are kept and the process is repeated until no term strength change improves the overall classification accuracy (i.e., it is a hill climbing algorithm).
        
        SentiStrength has the capability to optimise its lexicon term weights for a specific set of human-coded texts (i.e., a collection of texts with human-assigned sentiment scores for each one).
      - Unsupervised
        
        Without training
      - Supervised mode has similar overall accuracy to that of unsupervised mode, but it should logically outperform unsupervised mode if enough training data is used.
    - - Nevertheless a good conversion will take at least a month to translate the resource files, human-code a development corpus of 1000 texts and refine the lexicon and options based upon an examination of incorrect classifications in the development corpus.
      - Converting SentiStrength to work with a new language requires no coding in most cases because all its language resources are stored externally in plain text files and because it has language customisation options built in, including a UTF-8 mode for non-ASCII characters.
      - SentiStrength can be customised for new languages by translating its sentiment lexicon and other resources, adjusting its optional settings to cope with language specific features, such as negating words occurring after sentiment terms in Germanic languages, and refining through testing on a human-coded development corpus.
      - For some languages, additional processing will be needed to get good results, however. For example, the morphology of Turkish word formulation (Turkish is an agglutinative language) means that Turkish text must be parsed to separate out negating suffixes from sentiment terms and then the negation can be re-introduced by inserting an artificial negating word (e.g., NOT) prior to the sentiment word before being submitted to SentiStrength (Vural et al. 2013).
      - The accuracy of the translated variant should also be assessed before use on a second human coded corpus, to determine its accuracy. This is likely to be lower than SentiStrength’s accuracy for English due to the longer development time for this language.