Please enable JavaScript.

Coggle requires JavaScript to display documents.

Detecting Sarcasm in Text: An Obvious Solution to a Trivial Problem…

- - - - Aside from a value of 0.1 for the penalty parameter
        C, all other configuration options are left as default.
      - Support vector machine (SVM) as implemented by the LinearSVC function from scikitlearn, a popular open source machine learning library in Python
      - Features are extracted from the raw Twitter data to create training examples that are fed into the SVM to create a hypothesis model.
      - The tweets were collected over a span of several months in 2014. The sanitation processed included removing all the hashtags, non ASCII characters, and http links
      - In addition, each tweet is tokenized, stemmed, and uncapitalized through the use of the Python NLTK library.
    - - For each tweet, features that are hypothesized to be crucial to sarcasm detection are extracted. The features fall broadly into 5 categories: n-grams, sentiments, parts of speeches, capitalizations, and topics.
      - N-grams
        
        Individual tokens (i.e. unigrams) and bigrams are placed into a binary feature dictionary.
        
        Bigrams are extracted using the same library and are defined as pairs of words that typically go together. Examples include artificial intelligence, peanut butter, etc.
      - Sentiments
        
        A tweet is broken up into two and three parts
        
        Sentiment scores are calculated using two libraries (SentiWordNet and TextBlob)
        
        Positive and negative sentiment scores are collected for the overall tweet as well as each individual part. Furthermore, the contrast between the parts are inserted into the features
        
        SentWordNet
        
        TextBlob
      - Parts of Speech
        
        The parts of speech in each tweet are counted and inserted into the features.
      - Capitalizations
        
        A binary flag indicating whether the tweet contains at least 4 tokens that start with a capitalization is inserted into the features
      - Topics
        
        The python library gensim which implements topic modeling using latent Dirichlet allocation (LDA) is used to learn the topics.
        
        The collection of topics for each tweet is then inserted into the features.
  - - - Initial analysis of the baseline model quickly reveals that the testing error far exceeds the training error. The large gap between training and testing error suggests that the model is suffering from high variance
    - - NAIVE BAYES
      - ONE CLASS SVM
      - GAUSSIAN KERNEL
    - - Also, non-sarcastic sentences can still have both positive and negative sentiments.
      - Also, we should test whether adding sentiments improves our classification by a significant factor. While it is true that some sarcastic sentences have words words with negative sentiments and others with positive sentiments, many other sentences do not have this property. Thus, adding this feature might not be useful.
      - In both cases, we think it is important to reduce the dimension of feature space and use relevant features. For instance, the benefit of adding some features such as bigrams, sentiments and topics is not clear. Bigrams might have the same effect as unigrams.
      - The high testing error of the model at hand implies that we are fitting noise. This problem could be caused by the fact that we have a high dimensional feature space. Another possibility is that there are features that are not relevant for detecting sarcasm
      - We also think that finding the sentiments in each training example takes a lot of time.
      - For each training example, we have to look for the sentiment of each word in a dictionary, which takes a lot of time.
      - We also want to investigate the topics that are added. Topic modeling using LDA might be returning similar words as the unigrams of the training example, and we might end up getting redundant information.
      - However, we think that categorizing the training examples into a set of topics can be useful in a different way than it is used. Instead of adding topics as a separate feature, we might split our classifier to n-classifiers, where n is the number of topics in the training set. In other words, we build a classifier for each topic.