Jupyter Notebook implementation to detect Instagram's fake accounts.
Machine Learning Project - Classification Algorithms
- Table of Contents
- About The Project
- Getting Started
- Datasets
- Classification Report
- License
- Contributors
Insta Fake Account Detection is Machine Learning Project developed for the Big Data and Business Intelligence course of @UniversitΓ di Parma. The objective of this project is the automated recognition of fake Instagram accounts, using some Classification Algorithms.
The projects has 2 dataset: the first with 11 feature is used for the recognition of private accounts, which due to their privacy have a limited amount of informations to share, the second with 14 features is used with the public accounts, which thanks to their privacy have more informations to work with, such as the date of the post published, which gave the algorithms some informations about the index of activity of the account.
Every account's feature has been scraped using an Instagram Web Scraper.
Then the two dataset have been subject to a Preprocessing Phase. This phase consists of the standardization and the normalization of the two datasets. In this have been done the Feature Importance Forest of Trees and the Feature Selection analysis. In the Feature Selection has been used four algorithms:
- L1 Based
- Tree Based
- Removing Features with Low Variance
- Univariate Select Best (K = 4)
The features, now preprocessed, are taken and given to this Machine Learning Classifier Algorithms:
- AdaBoost
- Decision tree
- K-Nearest Neighbours (KNN)
- Logistic Regression
- Multi-Layer Perceptron
- Random Forest
- Stochastic Gradient Descent (SGD)
- Stochastic Gradient Descent (SGD)
- Support Vector Machine (SVM)
For every algorithm, in addition to the trainining and the testing phase, has been calculated:
- Cross Validation
- Confusion Matrix
- Receiver Operating Characteristic / ROC Curve
- Classification Report
Eventually, we created a telegram bot with the best algorithm of the two datasets. We embodied it in the bot and run the script. Once started you only have to send it a username and it will verify the authenticity of the account and send you the reply with the response you search.
Check π Insta Fake Detector Bot π to see the relative telegram-bot project. Try the efficiency of the AdaBoost algorithm on detecting the fake accounts and ask the bot the account to detect.
You can just clone this repository and install the requirements by running:
$ pip install igramscraper
$ pip install sklearn
Then start the notebook files in the relative folders to see the results.
Pull this repository for updates.
Is possible to find the datasets in the resources folder.
Profile Pic | Nums / Length Username | Full Name Words | Bio Length | External URL | Is Private | Is Verified | Is Business | # Post | # Followers | # Following |
---|
Profile Pic | Nums / Length Username | Fake Account | Full Name Words | Bio Length | External URL | Is Verified | Is Business | # Post | # Followers | # Following | Last Post Recent | % Post Single Day | Index of Activity | Average of Likes |
---|
- Profile Pic boolean value. 0 if the user doesnt'have the profile pic, 1 otherwise.
- Nums / Length Username double value. How many special characters of numeric characters the username has on its full length.
- Full Name Words numeric value. How many words in the full name.
- Bio Length numeric value. How many characters in the biography of the account.
- External URL boolean value. 0 if the user doesnt'have the an external URL in the biography, 1 otherwise.
- Is Private boolean value. 0 if the user doesnt'have a private account, 1 otherwise.
- Is Verified boolean value. 0 if the user doesnt'have the verified badge , 1 otherwise.
- Is Business boolean value. 0 if the user doesnt'have a business account, 1 otherwise.
- # Post numeric value. The number of the post published by the account.
- # Followers numeric value. The number of the followers of the account.
- # Following numeric value. The number of the following of the account.
- Last Post Recent boolean value. 0 if the user doesnt'have a post publisched withing 6 months, 1 otherwise.
- % Post Single Day double value. How many post has been published in the same same day on the total number of the posts.
- Index of Activity double value. How many post in average the account publishes every month.
- Average of Likes double value. Average of the likes of a post of the account.
Algorithm | Accuracy | Precision | Recall | F-Score |
---|---|---|---|---|
AdaBoost | 96% | 96% | 96% | 96% |
Decision Tree | 96% | 96% | 96% | 96% |
KNN Classifier | 95% | 96% | 95% | 95% |
Logistic Regression | 94% | 94% | 94% | 94% |
Multi-Layer Perceptron | 96% | 96% | 96% | 96% |
Random Forest | 94% | 94% | 94% | 94% |
SGD Classifier | 95% | 95% | 95% | 95% |
SVM Classifier | 94% | 94% | 94% | 94% |
Algorithm | Accuracy | Precision | Recall | F-Score |
---|---|---|---|---|
AdaBoost | 97% | 97% | 97% | 97% |
Decision Tree | 97% | 97% | 97% | 97% |
KNN Classifier | 95% | 95% | 95% | 95% |
Logistic Regression | 95% | 95% | 95% | 95% |
Multi-Layer Perceptron | 97% | 97% | 97% | 97% |
Random Forest | 98% | 99% | 98% | 98% |
SGD Classifier | 95% | 95% | 95% | 95% |
SVM Classifier | 95% | 95% | 95% | 95% |
Distributed under the GPL License. See LICENSE
for more information.
Instagram Web Scraper made by Realsirjoe from https://github.com/realsirjoe/instagram-scraper
Icons made by Roundicons from www.flaticon.com
Riccardo Fava - 287516
Daniele Pellegrini - 285240