/reddit-scraper

A tool for scraping and visualizing search results from Reddit.

Primary LanguageJavaScript

reddit-scraper

A tool for scraping and visualizing search results from Reddit.

Setting Up

Installing Node.js

Json-Server is utilized as the back-end for visualizing the scraped data, and Node.js is required in order to use it. The following bash commands can be used to install Node.js in Debian-based architectures. For other architectures, please refer to the official installation guide.

curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
sudo apt-get install -y nodejs

Setting up Json-Server

Inside the jsonserver directory, run the following bash command without modifying the existing files:

npm install --save json-server

Installing Beautiful Soup 4, Requests and LXML

Beautiful Soup 4 has to be installed along with the LXML parser. Also the requests library is required to access the HTML content of Reddit.

pip3 install beautifulsoup4
pip3 install requests
pip3 install lxml

Usage

This tool is made up of two parts; a web scraper and a dynamic web page for visualizing the results. The scraped data is stored in a file called product.json, and it is served by Json-Server to the front-end for visualization.

Scraping

You can make scraper limit its search by a specific subreddit, or you can make it search all subreddits.

Searching for a Keyword in All Subreddits

For example, in order to search for the keyword uzay in all subreddits, run the command below inside the root project folder:

python3 scraper.py --keyword="uzay"

Searching for a Keyword in All Subreddits

In order to search for the keyword ayn rand in the subreddit r/objectivism, run the command below inside the root project folder:

python3 scraper.py --keyword="ayn rand" --subreddit="objectivism"

Incremental Search

If there is an existing product.json file, the scraper will append the search results of a new keyword at the end of the file.

If the keyword already exists in product.json, the scraper will start searching from the date of the most recent post and append the new content at the end of the existing posts that had been earlier saved for that keyword.

Visualization

When the product.json file is ready for visualization, run the following command inside the jsonserver directory in order to start the Json-Server:

npm run json:server

Then, open up the reddit.html file inside a browser.