Please enable JavaScript.
Coggle requires JavaScript to display documents.
Social media user personality classification using computational linguistic
Social media user personality classification using computational linguistic
Information
Psychology researches suggest that certain personality traits have correlation with linguistic behavior.
A simple application was developed based on the best statistical model compared before to classify an individual’s personality with their Twitter username and gender as an in-put.
Literature review
Algorithm
Term Frequency - Inverse Document Frequency (TF-IDF)
TF-IDF is known as the best weighting scheme for informationretrieval (Manning et al., 2008) and also popular in text classification (Gebre et al.,2013).
Term Frequency (TF)
Frequency of a particular term appears in a document
The more frequent the word is, the more important it is in the document
Relying on TF alone is not sufficient, as if the word appears too frequently in the corpus it will mean that the word is so common yet unimportant.
Inverse Document Frequency (IDF)
That the importance of a word also depends on the frequency of the word in other documents. In which, Inverse Document Frequency (IDF) play its role in adjusting this situation
Quantifies the intuition that if a word appears in many documents, the word might be not important and should be given less weight than the one that appears more frequently in fewer documents.
Term weighting schemes are the central part of an information retrieval system (Paik,2013)
Naïve Bayes
Naïve Bayes (NB) classifier is a probabilistic classifier that based on a statistical principle.
Similar with TF-IDF, each term is processed and assigned a probability that it belongs to a certain class or category
The existence of a word in a document determines the outcome of the prediction.
The probability is calculated from the occurrences of the terms in the documents where the classes are already known.
With these calculated probabilities, a new document canbe classified by simply summing up the probabilities for each class of each term occur-ring in the document.
External Resources
FAROO algorithm
Since there will be auto correction done in the tweaking process of lexicon based and grammatic rule approach, the algorithm used is FAROO (Garbe, Wolf, 2012)
Corpus
Corpus is needed for grammatic rule approach, especially in POS-tagging and context detection
In this thesis, the corpus used is the one from a research (Wijaya et al., 2013)due to its flexibility in social media language.
Myers-Briggs Type Indicator (MBTI)
personality indicators that is developed based on Carl Jung’s model (Jung, 2014)
Different types of personality (INTJ, INTP, ENTJ, ENTP, INFJ, INFP, ENFJ,ENFP, ISTJ, ISFJ, ESTJ, ESFJ, ISTP, ISFP, ESTP, and ESFP) which each type has its behavior and needs to be treated differently
Related Works
RESEARCH METHODOLOGY
Overview
Started with data collection.An MBTI psychological test was translated and then shared
After respondents were collected, then a Twitter crawler was developed to get the respondents’ tweet. Those tweets then undergo a preprocessing steps before it is used and processed in the statistical models
After the data or tweets are all clean, then those data will be used in developing the training set and statistical model as there are three different approaches
After the core of the statistical model is done, testing and tweaking for each statistical model is done so that each statistical model will have its best performance
After the statistical model with highest accuracy was chosen, then a simple application will be developed using the chosen statistical mode
Data Collection
Respondents Collection
We translated an MBTI psychological test (Briggs, 1977) to Bahasa Indonesia, in which double translation has been done by a psychologist
Then the psychological test is written in Google Form and shared through social media and Indonesian’s forum.
For the question itself, in addition to the list of 50 questions from the psychological test, respondents’ full name (optional), twitter username, age, gender, and occupation were also asked because those factors also usedto determine an individual’s personality (Matsumoto and Juang, 2012; Schwartz et al.,2013)
Respondent’s Twitter account must be exist.
Twitter Crawler Development
Tweet crawler is developed using Java. The crawler utilizes a Java library to use Twitter API, Twitter4J
The crawled tweets then will be saved in text files separately accordance to the respondents’ username to save processing time and resource
Data Preprocessing
Preprocessing is done for the tweets from that still considered as raw data
Steps
Removing Retweets (RT)
Lowercase Transformation
Removing URLs
Removing User Mention
Removing Repeated Letters
Replacing Emoticon Symbols
Removing Non-Alphabetic Characters and Tokenizing
Training Set Development
Since there are three different approaches for building the statistical model, different training set must also be developed so that it is suitable for each statistical mode
Before all of the tweets are used for training set development, the tweets are separated according to the personality traits of the respective respondents after it undergo data preprocessing step
Training Set for Machine Learning Approach
Naïve Bayes classifier
WEKA is used for building training set and classification.However, as the data in tweet collection is in different format with WEKA requirement,the data need to be reformatted. The reformatting process is to label each string or tweet with a class according to the respondent’s personality.
Training Set for Lexicon Based Approach
TF-IDF
Since this approach can also be seen as a bag-of-words model, what is left for the training set for this approach is to do vectorization because this approach is actually discarding the word order and grammar.
Training Set for Grammatic Rule Approach
Differences between this approach with lexicon base dapproach are the existence of word order and grammar, and also the use of POS-tags.
Statistical Model Development
Machine Learning Statistical Mode
The model can only classify a tweet to personality traits, not a user or respondent. Therefore, since the objective of this thesis is to classify respondents, an additional process needs to be made after each tweets from a respondent has classified by WEKA.
Lexicon Based Statistical Model
Grammatic Rule Statistical Model
Tweaking and Testing
Word Analysis
Simple Application Development
Goals
Predicting humans’personality from their post become possible. Most existing researches have done similar approach in predicting personality from social media.
Problem
However, focuses on closed vocabulary investigation with English as their language and mostly based on Big Five personality type
Objectvie
Explore Twitter as data source for open vocabulary personality prediction in Indonesia.
We analyze and compare three different statisticalmodel and find correlation about which personality traits are related with linguistic be-havior.
Meta
Author
Louis Christy Lukito
Adi Wahyu Pribadi
Year
2009
Research problem
Results
Naïve Bayes classifier outperforms the other statistical model with the highest accuracy (80% for I/E and 60% for S/N, T/F, and J/P personality traits) and shows the best performance in terms of speed in classifying the users.