Please enable JavaScript.
Coggle requires JavaScript to display documents.
Identifying a Large Number of Fake Followers on Instagram (Information…
Identifying a Large Number of Fake Followers on Instagram
Background
Problem
However, we know of no systematic study that actually tried to thouroughly quantify the phenomenon
There have been quite some reports about high numbers of fake followers in the emerging influencer marketing business
Objective
This journalistic investigation therefore sets out to quantify the amount of fake followers in a representative sample of Swiss Instagram influencers
Information
Training and employing a statistical model that has sufficient precision to reliably distinguish a fake from a real follower.
R-Script & Processed Data
The analysis of the data was conducted in the R project for statistical computing
R version 3.4.4 is used
If the code does not work, it is very likely that an older R version is installed.
If an error occures it sometimes helps to execute the script several times.
Particularly in the case of package installation problems it could be helpful to restart the R session and execute the code over again.
Data sources
Scraping of Private Accounts
Magi Metrics supplies enough information for public Instagram accounts.
However, information of private accounts have to be scraped additionally.
We do this with the R package ‘rvest’. Roughly one third of all followers to be classified are private accounts.
Time Range
The list of followers per influencer where gathered from Magi Metrics in the second half of August.
The private accounts were scraped in the time between mid of August and mid of September.
Completeness
Since scraping private accounts’ key metrics from Instagram is a tedious process and some followers deleted their profile or stopped following an influencer, in the meantime, it was impossible to gather a complete list of all followers of a specific influencer at a given time.
Overall, we managed to gather at least 95% of the followers for 113 influencers, for two influencers we surveyed only 94% of their followers
Through this analysis, the approximately 7 million unique followers of 115 Swiss Instagram influencers are inspected
A list of hundred leading Swiss Instagram influencers was provided to us by Le Guide Noir, a company specialised in influencer analytics
Celebrities (= people who are publicly known / famous independently of their social media activity) were removed from the list
Training Set
To train a statistical learning model, followers of the aforementioned Instagram influencers have to be manually labelled as fake or real.
The identification of a fake follower account underlies the hereinafter listed concuring features (a complete list can be found in section 4.1)
We judge a follower as fake, for example, if he or she follows an exorbitantly high number of other Instagram accounts, has almost no posts, no profile picture, if his or her account is held privately and the username contains many numbers
Company, advertisement or curation accounts are not considered as fake accounts if they don’t have the above mentioned criteria fulfilled. Furthermore, the decision of labelling an account as fake is made rather conservatively.
Building a Random Forest
List of features
Number.of.followers
Number.of.following
Number.of.posts
has_profile_picture
Private.account
username_has_number
username_has_number_at_end
alpha_numeric_ratio
following_followers_ratio
following_posts_ratio
followers_posts_ratio
Performance Evaluation
The number of trees is set to 30 since with this setting, in this specific analysis, the most precise results can be obtained, as a previous, herein unspecified, analysis has shown.
The processed training data of 975
followers is split into two samples
The validation sample
The number of trees is set to 30 since with this setting, in this specific analysis, the most precise results can be obtained,
Estimating the performance of the model (sample segmentation: dev = 70% / val = 30%)
The development sample
Used to train the algorithm
Variables with a large mean decrease in accuracy are more important for classification of the data.
The more the accuracy of the random forest decreases due to the exclusion or permutation of a single variable, the more important that variable is considered.
The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest.
Variables that result in nodes with higher purity have a higher decrease in the Gini coefficient.
The random forest algorithm identifies the following-posts-ratio, number of posts, and following-followers-ratio as the most discriminative features to identify and classify a fake Instagram account.
In turn, whether the account is private or not seems to play a minor role.
Adjusting the Balance Between
False Positives and False Negatives
The random forest attributes each classified follower a score in range [0, 1].
The default value of the cutoff is 0.5, i.e. followers having been attributed a score higher than this value are classified as “fake”.
With this cutoff, we get a very high specificity (= low false-positive-rate, see below). This is crucial for our analysis because we want to have a very low chance of falsely accusing a follower to be “fake”.
Analyzing explicitly the predicted outcome class “TRUE” which determines the classification label “fake”, the test specificity (true positive rate) reveals that 96.3% of the “real” accounts are correctly identified as “real”, which leaves us with a false-positive-rate of 3.7%, which we consider to be sufficiently low
This means that less than 4% of the accounts which are in fact “real” are falsely classified as “fake”. Apart from that, the sensitivity (true negative rate) states that 77.4% of the accounts that are indeed “fake” are recognized as “fake”, while 22.6% of them are missed.
We choose a Random Forest algorithm, as it has been shown to be a good model for binary classification in many different domains.
Imputation
As a consequence of the ratio calculations some features contain infinite values which have to be imputed. With the R package Hmisc, infinite (not available = NAs) values can be substituted by a regression technique.
Values
following_followers_ratio
following_posts_ratio
followers_posts_ratio
Classifying 7 Million Followers
The evaluated random forest is then applied to the full dataset of approximately 7 million followers.
For practical & privacy reasons, this part of our analysis is not publicly reproducible.
Yet, a random sample of 10’000 followers is provided below so the reader can evaluate the classification performance himself using 10 randomly selected accounts classified as “fake” and as “real”, respectively. Upon each recompilation of the script, another 20 random accounts will be selected.
Result
Fake followers are indeed a widespread phenomenon, as almost a third of approximately 7 million classified accounts appear to be fake – on average, the surveyed influencers have around 30% fake followers.
Also, influencers with high ratios of fake followers seem to form a distinct cluster which stands apart from influencers with a “normal” base rate of fake followers.
Meta
Year
2017
Author
SRF Data,
Timo Grossenbacher
Jennifer Victoria Scurrell,