subreddit-analyzer

Website url: https://vast-plateau-92435.herokuapp.com/

Instructions to use website You can either enter the text of the post for which you want to know flair, or you can enter the URL of a valid reddit post. Make sure to select TEXT or URL from the dropdown.
After clicking on search button wait for 15 sec

Note: Unfortunately I could not add the graph of the analysis of the data in the website because of memory problems. (I used up all 500 MB of memory provided by heroku, just barely managed to fit the website) You can find all the graphs(that were to be posted on website) inside graph folder

Directory Structure

Note: All the scripts are make by taking an assumption that the txt files needed are in the same directory as that of script.
analysis_script: Contains scripts used for doing some analysis on the scraped data to draw useful conclusion and represent it in pictorial form
graph: Contains all the graph which are generated by the scripts in the analysis_script folder.
db: Contains mongoDB database .
other_scripts: Scripts to sanitize database and remove duplicate documents.
Website: Contains the website (Made using flask)
tfidf: Scripts to compute TF-IDF
tfs: Contains files having TF-IDF value of all the words in a flair
wordcloud: Word cloud of all the flairs
useless: Contains some random scripts that were once used during development
storing_scripts: Scripts used for scraping the reddit posts and comments

Installation

Download this repository

$ git clone https://github.com/nahimilega/subreddit-analyzer.git

Create a python virtual environment and activate it:

python3 -m venv ven
source ven/bin/activate
cd ven/

Use the package manager pip to install dependencies of this project

$ pip install -r requirements.txt

To run the website -

$ cd website
$ python deploy.py

(make sure to run all the nltk modules mentioned in nltk.txt in website folder)

Database Model

(Note - I could not upload the comment db because of github limit of 100 mb)
https://drive.google.com/file/d/1zCjKkd5xEue2moViP3OJX8kKzbaOyqSv/view?usp=sharing
This is the link to the collection which is greater than 100 mb. Download it and paste in in the db/subreddit folder.

Database name - Subreddit

Collections posts2 : Stores all the scraped posts

'post_id': post id
'author': name of the author ,
'title': Title of the post,
'flair': Flair of the post,
'time': Time of creation of the post(UTC),
'over_18': (bool) is the post over 18,
'num_comment': Number of comments on the post,
'upvote': Upvotes on the posts,

comments: Store all the comments of all the scraped posts
'body': Body of the comment,
'time': Time of comment creation(UTC),
'author': Author of comment,
'upvotes': Upvotes on the comment,
'post_id': id of the post to with the comment belong

Algorithm

This

Data Collection

For the purpose of collecting the posts of subreddit, I made scrape_post.py scripts which uses reddit api to get the posts, as the reddit api only gave a limited number of posts, I scraped multiple links and even the surch results to get the maximum number of posts. I manage to scrape 8341 posts.

To get all the comments corresponding to a post, I wrote a scrape_post.py script. It uses PRAW(The Python Reddit API Wrapper). It scrapes all the comments corresponding to all the posts which were scraped previously and store them in comments collection on db. I managed to scrape 838514 comments

Libraries Used

This project relies on Flask with Jinja for handling the web display and serving of pages. PRAW was used to scrape data from the reddit. For preprocessing of text, nltk is used.

References:

Reddit API:
https://www.reddit.com/wiki/search#wiki_search_api
https://praw.readthedocs.io/en/latest/

Text Classification Algorithm:
http://www.imedpub.com/articles/an-efficient-classification-model-for-unstructured-text-document.pdf
https://towardsdatascience.com/pre-processing-in-natural-language-machine-learning-898a84b8bd47
https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568
https://en.wikipedia.org/wiki/Document_classification
https://medium.com/mlrecipies/document-classification-using-machine-learning-f1dfb1171935
https://towardsdatascience.com/algorithms-for-text-classification-part-1-naive-bayes-3ff1d116fdd8
https://www.scitepress.org/Papers/2016/59077/59077.pdf
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
https://www.coursera.org/lecture/python-text-mining/learning-text-classifiers-in-python-GaNec
https://www.youtube.com/watch?v=xm-wmBwJLww

WebApp:
https://www.tutorialspoint.com/flask/flask_sending_form_data_to_template.htm

nahimilega/Flairinator