A hybrid approach for Sarcasm Detection of Social Media Data

Meta

Author

Information

Goals

N.Vijayalaksmi

Dr. A.Senthilrajan

This project focuses mainly on sarcasm detection which is a major part of sentiment analysis of twitter data which is helpful to analyze the sarcasm in the tweets where views are miscellaneous and highly unstructured, or may be positive, negative, sarcastic, ironic or neutral in some cases.

This research work borrows the ideas of utilizing different semi-supervised algorithms like Lexical Analysis with N-grams approach, Knowledge extraction, contrast approach, emoticon based approach and hyperbole approach to propose a new rule based Hybrid approach for sarcasm detection.

The process of converting a sequence of characters into tokens is called Lexical Analysis

In a tweet, where there are 140 characters which are parsed by a lexical analyzer there are numerous tokens created. These tokens allows the machine to know, learn and predict whether the tweet is sarcastic or not with various features. The tweet characters given as input is parsed with the lexical analyzer and utilized to give tokens as our output.

The probabilities of a tweet being sarcastic in the given tokens are analyzed in order to compute. This process is iterated over various tokens. Then the created tokens, coordinates with the tokens having few phrases in dictionary and later passed to the algorithm.

Rule based approach

7 rules or phases in a rule
based approach

Lexicon based approach (N-grams=Bigrams)

Hyperbole approach

Hierarchical approach

Hybrid approach

Contrast approach

The Contrast approach uses a logic i.e. a tweet having a positive sentiment and negative situations is considered to sarcastic and are marked as sarcastic

Consists of bag of words and trained dataset to determine sarcasm in a tweet tex

Emoticon based approach

Since the twitter platform allows the user to use emoji to express their sentiments, hence a rule can be built based on the given feature.

Emoticon based approach is used to determine text based on a logic having a positive sentiment tweet with negative emoticons and negative sentiment tweets with positive emoticon can be assumed to be sarcastic In many instances

It is based on a logic where we consider a unigram set of trained dataset with pragmatic markers solution is able to determine the sarcasm in a text. Unigram dataset contains set of intensifiers (’wow’,’Awesome’,’nottttt’) which expresses the emotions as exaggerations (as intensifiers’) of a user.

And the pragmatic markers are a way expressing words by representing them with various font styles (as [WRITING IN SQUARE BRACKETS] or “WRITING IN CAPITAL LETTERS” or use of “Exclamatory!!!!!!!!! FULLSTOPS……….. /delimiters” etc...).

Hence a proper solution is built using the above logic and we were able to determine sarcasm in texts more often.

Determines text as sarcastic through a sequential logic which takes into considerations.

Combinational logic of all the given approaches as shown in fig 1,which gives a high precise rate of sarcasm in a text and it was shown that it gave wonderful results compared to all the other approaches.

Data

The case study exploits the user’s self-reporting of sarcasm feature and follow the same methodology, by identifying all the tweets mentioning #sarcasm or #sarcastic in the #tag tweets from 2013–16 regardless of who the author is, and we were able to collect the most recent 2000 tweets of those authors

This yields a sarcastic training set of 2500 tweets, for nonsarcastic data, we select an equal number of tweets from users over the same time period who have not mentioned #sarcasm or #sarcastic in their messages. The total dataset is evenly balanced at 3000 tweets. Since the hashtags #sarcasm and #sarcastic and #not are used to define the sarcastic examples, we remove those tags from all tweets for the prediction task.

Hash tag based approach

From tweetdict import tweets:
For each tweet in tweets:
For data in tweets:
If data==#Sarcasm-Data:
Append ‘Sarcastic’ to [Output-list]
Else:
Append ‘Non-Sarcastic’ to [Output-list]
For tweet in tweets:
Forlistdata in Output-list: Newlst=Join (tweet with list-data)
Output: [Newlist] of tweets

: :

We have applied the POS tagger and respective valence value from Warner et al. (2013) Dataset which includes features based on the absolute count and ratio of each tweet, along with the “lexical density” of the tweet, which models the ratio of nouns, verbs, adjectives and adverbs to all words a baseline capacitor (General Dictionary) is developed and implemented for productive analysis and classified data as shown in fig 3

Patterns can be created by concatenating adjacent tokens into ngrams where n is the max value on which the all possible combinations can be made for a single pattern

N-gram process is being checked for all bigram values and marked as sarcastic or non-sarcastic

Kreuz and Roberts 1995 theoretical work has stressed the importance of hyperbole for sarcasm, we have implemented indicator logic for whether the tweet contains a word in a list of intensifiers (so, too, very, really), stretched words (rightttt), capitalized words ([HELLO]), punctuations and so on. Hence giving us a more stable classifier hence covering most of the uncovered area for text classification

Hybrid Based

It’s a sequential logic of all the above approaches (refer to figure 6). The tweets are marked as sarcastic if it’s found true in anyone of the rule except for in hashtag rule (Since #tag rule tags all tweets as sarcastic) the implementation is simplified using hierarchical directory format and shown in the

Hierarchical Plotting table

: If the tweets are marked as sarcastic for a sequential logic of contrast with lexicon, hyperbole and emoticon approaches. Then tweet is automatically considered as sarcastic and tagged under hierarchical rule as shown in fig 8.

Hybrid Based

: If the tweets are marked as sarcastic for a sequential logic of contrast with lexicon, hyperbole and emoticon approaches. Then tweet is automatically considered as sarcastic and tagged under hierarchical rule as shown in fig 8.

Hybrid Plotting table

Tweets are marked as sarcastic for a combinational logic for either contrast with hyperbole, or contrast with lexicon, or lexicon with hyperbole, or lexicon with emoticon or contrast with emoticon and emoticon with hyperbole then tweet is automatically considered as sarcastic and tagged under hybrid rule

Result

Hybrid approach yields a better result compare to other approaches. The Rule based approach was able to detect sarcasm in text over 80% at precision rate which proved that it was giving 3% more accurate results compare to the existing algorithm [10] in few test runs

The output of rule based approach is a list of tweets marked either sarcastic or non-sarcastic for every rule in a rule engine and each result were saved by their rules name respectively as “.csv” files and charts were produced automatically by python and plotly [3]

From all the results after working over three different sets of sarcastic set of tweets we come to a conclusion that the contrast approach, lexicon, hyperbole and hybrid approaches are considered to be the higher level approach to give accurate and precise results hence deducing a better prediction rule compare to previous work.