Please enable JavaScript.

Coggle requires JavaScript to display documents.

Identifying a Large Number of Fake Followers on Instagram (Information…

- - - - Magi Metrics supplies enough information for public Instagram accounts.
      - However, information of private accounts have to be scraped additionally.
      - We do this with the R package ‘rvest’. Roughly one third of all followers to be classified are private accounts.
    - - The list of followers per influencer where gathered from Magi Metrics in the second half of August.
      - The private accounts were scraped in the time between mid of August and mid of September.
    - - Since scraping private accounts’ key metrics from Instagram is a tedious process and some followers deleted their profile or stopped following an influencer, in the meantime, it was impossible to gather a complete list of all followers of a specific influencer at a given time.
      - Overall, we managed to gather at least 95% of the followers for 113 influencers, for two influencers we surveyed only 94% of their followers
  - - - Number.of.followers
      - Number.of.following
      - Number.of.posts
      - has_profile_picture
      - Private.account
      - username_has_number
      - username_has_number_at_end
      - alpha_numeric_ratio
      - following_followers_ratio
      - following_posts_ratio
      - followers_posts_ratio
    - - The number of trees is set to 30 since with this setting, in this specific analysis, the most precise results can be obtained, as a previous, herein unspecified, analysis has shown.
      - The processed training data of 975
        followers is split into two samples
        
        The validation sample
        
        The number of trees is set to 30 since with this setting, in this specific analysis, the most precise results can be obtained,
        
        Estimating the performance of the model (sample segmentation: dev = 70% / val = 30%)
        
        The development sample
        
        Used to train the algorithm
      - Variables with a large mean decrease in accuracy are more important for classification of the data.
      - The more the accuracy of the random forest decreases due to the exclusion or permutation of a single variable, the more important that variable is considered.
      - The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest.
      - Variables that result in nodes with higher purity have a higher decrease in the Gini coefficient.
      - The random forest algorithm identifies the following-posts-ratio, number of posts, and following-followers-ratio as the most discriminative features to identify and classify a fake Instagram account.
      - In turn, whether the account is private or not seems to play a minor role.
      - Adjusting the Balance Between
        False Positives and False Negatives
        
        The random forest attributes each classified follower a score in range [0, 1].
        
        The default value of the cutoff is 0.5, i.e. followers having been attributed a score higher than this value are classified as “fake”.
        
        With this cutoff, we get a very high specificity (= low false-positive-rate, see below). This is crucial for our analysis because we want to have a very low chance of falsely accusing a follower to be “fake”.
        
        Analyzing explicitly the predicted outcome class “TRUE” which determines the classification label “fake”, the test specificity (true positive rate) reveals that 96.3% of the “real” accounts are correctly identified as “real”, which leaves us with a false-positive-rate of 3.7%, which we consider to be sufficiently low
        
        This means that less than 4% of the accounts which are in fact “real” are falsely classified as “fake”. Apart from that, the sensitivity (true negative rate) states that 77.4% of the accounts that are indeed “fake” are recognized as “fake”, while 22.6% of them are missed.
    - - As a consequence of the ratio calculations some features contain infinite values which have to be imputed. With the R package Hmisc, infinite (not available = NAs) values can be substituted by a regression technique.
      - Values
        
        following_followers_ratio
        
        following_posts_ratio
        
        followers_posts_ratio