Please enable JavaScript.

Coggle requires JavaScript to display documents.

Social media user personality classification using computational linguistic

- - - - Term Frequency - Inverse Document Frequency (TF-IDF)
        
        TF-IDF is known as the best weighting scheme for informationretrieval (Manning et al., 2008) and also popular in text classification (Gebre et al.,2013).
        
        Term Frequency (TF)
        
        Frequency of a particular term appears in a document
        
        The more frequent the word is, the more important it is in the document
        
        Relying on TF alone is not sufficient, as if the word appears too frequently in the corpus it will mean that the word is so common yet unimportant.
        
        Inverse Document Frequency (IDF)
        
        That the importance of a word also depends on the frequency of the word in other documents. In which, Inverse Document Frequency (IDF) play its role in adjusting this situation
        
        Quantifies the intuition that if a word appears in many documents, the word might be not important and should be given less weight than the one that appears more frequently in fewer documents.
        
        Term weighting schemes are the central part of an information retrieval system (Paik,2013)
      - Naïve Bayes
        
        Naïve Bayes (NB) classifier is a probabilistic classifier that based on a statistical principle.
        
        Similar with TF-IDF, each term is processed and assigned a probability that it belongs to a certain class or category
        
        The existence of a word in a document determines the outcome of the prediction.
        
        The probability is calculated from the occurrences of the terms in the documents where the classes are already known.
        
        With these calculated probabilities, a new document canbe classified by simply summing up the probabilities for each class of each term occur-ring in the document.
    - - FAROO algorithm
        
        Since there will be auto correction done in the tweaking process of lexicon based and grammatic rule approach, the algorithm used is FAROO (Garbe, Wolf, 2012)
      - Corpus
        
        Corpus is needed for grammatic rule approach, especially in POS-tagging and context detection
        
        In this thesis, the corpus used is the one from a research (Wijaya et al., 2013)due to its flexibility in social media language.
    - - personality indicators that is developed based on Carl Jung’s model (Jung, 2014)
      - Different types of personality (INTJ, INTP, ENTJ, ENTP, INFJ, INFP, ENFJ,ENFP, ISTJ, ISFJ, ESTJ, ESFJ, ISTP, ISFP, ESTP, and ESFP) which each type has its behavior and needs to be treated differently
  - - - Started with data collection.An MBTI psychological test was translated and then shared
      - After respondents were collected, then a Twitter crawler was developed to get the respondents’ tweet. Those tweets then undergo a preprocessing steps before it is used and processed in the statistical models
      - After the data or tweets are all clean, then those data will be used in developing the training set and statistical model as there are three different approaches
      - After the core of the statistical model is done, testing and tweaking for each statistical model is done so that each statistical model will have its best performance
      - After the statistical model with highest accuracy was chosen, then a simple application will be developed using the chosen statistical mode
    - - Respondents Collection
        
        We translated an MBTI psychological test (Briggs, 1977) to Bahasa Indonesia, in which double translation has been done by a psychologist
        
        Then the psychological test is written in Google Form and shared through social media and Indonesian’s forum.
        
        For the question itself, in addition to the list of 50 questions from the psychological test, respondents’ full name (optional), twitter username, age, gender, and occupation were also asked because those factors also usedto determine an individual’s personality (Matsumoto and Juang, 2012; Schwartz et al.,2013)
        
        Respondent’s Twitter account must be exist.
      - Twitter Crawler Development
        
        Tweet crawler is developed using Java. The crawler utilizes a Java library to use Twitter API, Twitter4J
        
        The crawled tweets then will be saved in text files separately accordance to the respondents’ username to save processing time and resource
    - - Preprocessing is done for the tweets from that still considered as raw data
      - Steps
        
        Removing Retweets (RT)
        
        Lowercase Transformation
        
        Removing URLs
        
        Removing User Mention
        
        Removing Repeated Letters
        
        Replacing Emoticon Symbols
        
        Removing Non-Alphabetic Characters and Tokenizing
    - - Since there are three different approaches for building the statistical model, different training set must also be developed so that it is suitable for each statistical mode
      - Before all of the tweets are used for training set development, the tweets are separated according to the personality traits of the respective respondents after it undergo data preprocessing step
      - Training Set for Machine Learning Approach
        
        Naïve Bayes classifier
        
        WEKA is used for building training set and classification.However, as the data in tweet collection is in different format with WEKA requirement,the data need to be reformatted. The reformatting process is to label each string or tweet with a class according to the respondent’s personality.
      - Training Set for Lexicon Based Approach
        
        TF-IDF
        
        Since this approach can also be seen as a bag-of-words model, what is left for the training set for this approach is to do vectorization because this approach is actually discarding the word order and grammar.
      - Training Set for Grammatic Rule Approach
        
        Differences between this approach with lexicon base dapproach are the existence of word order and grammar, and also the use of POS-tags.
    - - Machine Learning Statistical Mode
        
        The model can only classify a tweet to personality traits, not a user or respondent. Therefore, since the objective of this thesis is to classify respondents, an additional process needs to be made after each tweets from a respondent has classified by WEKA.
      - Lexicon Based Statistical Model
      - Grammatic Rule Statistical Model