Web Crawler

Webcrawler for Instagram/Facebook/Twitter post scraping
Dependencies required: Selenium, Chromedriver

Dependency Setup

Follow these steps depending on which OS is being used to set up Chromedriver

BEFORE setting up Chromedriver you must have Chrome installed and know what version of Chrome you are using You can figure this out by opening the menu on the top-right of your browser -> Help -> About Google Chrome
The version number can be found there. Please download the correct Chromedriver for your version (Use the link only if you are a Windows user)

Windows

Extract the Chromedriver.exe and save it somewhere that is appropriate for an application to be saved in;
possibly in the local disk.
Open control panel and navigate to (System and Security) -> (System) -> (Advanced System Settings) -> (Environment Variables)
Under the system variables section, highlight 'Path' and click 'Edit' then click 'New'
Figure out the full directory path of the Chromedriver.exe and provide that value as a new system variable path
You are now able to run Chromedriver; open command prompt and enter the command chromedriver, a prompt will show if you properly installed it.

Mac

Open terminal and enter the command: brew cask install chromedriver
You are now able to run Chromedriver; open command prompt and enter the command chromedriver, a prompt will show if you properly installed it.

How to run the webcrawler

Update the txt files of all sites to crawl.
Under the assumption that the user is on a MacOS, navigate to the text files by going into the dist folder.
Right click on main to open file options, click Show Package Contents.
Navigate to Contents, then Resources.
From here, update ALL *.txt files that are seen within this folder with the urls that are appropriate to the text file's name. e.g. fbpost_urls.txt should have urls that are ONLY Facebook posts.
Once the text files are updated. Navigate out of the Resources folder and access the MacOS folder.
Run the main Unix executable.

Expected Data Tables Per Script Execution

This section will display expected CSV outputs for each respective webcrawler for Instagram/Facebook/Twitter.

Instagram

The CSV results from scraping Instagram posts are as shown:

URL	Comments	Views	Likes	Date
https://www.instagram.com/p/CB1XK-BH5ez/	2	233	37	Jun 24, 2020
https://www.instagram.com/p/B9y7pTAAW0n/	7	491	91	Mar 16, 2020
https://www.instagram.com/p/CDuO-OHjeej/	3	96	14	Aug 10, 2020

The CSV results from scraping Instagram profiles are as shown:

URL	Posts	Followers
https://www.instagram.com/lovexstereo/?hl=en	2961	2256
https://www.instagram.com/candyambulanceband/?hl=en	640	1966
https://www.instagram.com/spenceryenson/?hl=en	48	6085

Facebook

The CSV results from scraping Facebook posts are as shown:

URL	Comments	Views	Likes	Date
https://www.facebook.com/lovexstereo/videos/855000578361086/	6	243	35	July 18 at 10:48 AM
https://www.facebook.com/lovexstereo/videos/217040609601399/	0	50	10	July 31 at 4:03 PM

Twitter

The CSV results from scraping Twitter posts are as shown:

URL	Comments	Views	Likes	Date
https://twitter.com/BLKBOXapp/status/1286051479946702851	N/A	N/A	2	Jul 22, 2020
https://twitter.com/BLKBOXapp/status/1273655517823541249	N/A	N/A	3	Jun 18, 2020

The CSV results from scraping Twitter profiles are as shown:

URL	Tweets	Followers
https://twitter.com/lovexstereo?lang=en	3,685 Tweets	1882
https://twitter.com/candyambulance?lang=en	127 Tweets	121
https://twitter.com/littlemanmusic?lang=en	1,217 Tweets	554

cid003/WebCrawler