Webcrawler for Instagram/Facebook/Twitter post scraping
Dependencies required: Selenium, Chromedriver
Follow these steps depending on which OS is being used to set up Chromedriver
BEFORE setting up Chromedriver you must have Chrome installed and know what version of Chrome you are using
You can figure this out by opening the menu on the top-right of your browser -> Help -> About Google Chrome
The version number can be found there. Please download the correct Chromedriver for your version
(Use the link only if you are a Windows user)
- Extract the Chromedriver.exe and save it somewhere that is appropriate for an application to be saved in;
possibly in the local disk. - Open control panel and navigate to (System and Security) -> (System) -> (Advanced System Settings) -> (Environment Variables)
- Under the system variables section, highlight 'Path' and click 'Edit' then click 'New'
- Figure out the full directory path of the Chromedriver.exe and provide that value as a new system variable path
- You are now able to run Chromedriver; open command prompt and enter the command
chromedriver
, a prompt will show if you properly installed it.
- Open terminal and enter the command:
brew cask install chromedriver
- You are now able to run Chromedriver; open command prompt and enter the command
chromedriver
, a prompt will show if you properly installed it.
- Update the txt files of all sites to crawl.
- Under the assumption that the user is on a MacOS, navigate to the text files by going into the
dist
folder. - Right click on main to open file options, click Show Package Contents.
- Navigate to
Contents
, thenResources
. - From here, update ALL *.txt files that are seen within this folder with the urls that are appropriate to the text file's name. e.g. fbpost_urls.txt should have urls that are ONLY Facebook posts.
- Once the text files are updated. Navigate out of the
Resources
folder and access theMacOS
folder. - Run the
main
Unix executable.
This section will display expected CSV outputs for each respective webcrawler for Instagram/Facebook/Twitter.
The CSV results from scraping Instagram posts are as shown:
URL | Comments | Views | Likes | Date |
---|---|---|---|---|
https://www.instagram.com/p/CB1XK-BH5ez/ | 2 | 233 | 37 | Jun 24, 2020 |
https://www.instagram.com/p/B9y7pTAAW0n/ | 7 | 491 | 91 | Mar 16, 2020 |
https://www.instagram.com/p/CDuO-OHjeej/ | 3 | 96 | 14 | Aug 10, 2020 |
The CSV results from scraping Instagram profiles are as shown:
URL | Posts | Followers |
---|---|---|
https://www.instagram.com/lovexstereo/?hl=en | 2961 | 2256 |
https://www.instagram.com/candyambulanceband/?hl=en | 640 | 1966 |
https://www.instagram.com/spenceryenson/?hl=en | 48 | 6085 |
The CSV results from scraping Facebook posts are as shown:
URL | Comments | Views | Likes | Date |
---|---|---|---|---|
https://www.facebook.com/lovexstereo/videos/855000578361086/ | 6 | 243 | 35 | July 18 at 10:48 AM |
https://www.facebook.com/lovexstereo/videos/217040609601399/ | 0 | 50 | 10 | July 31 at 4:03 PM |
The CSV results from scraping Twitter posts are as shown:
URL | Comments | Views | Likes | Date |
---|---|---|---|---|
https://twitter.com/BLKBOXapp/status/1286051479946702851 | N/A | N/A | 2 | Jul 22, 2020 |
https://twitter.com/BLKBOXapp/status/1273655517823541249 | N/A | N/A | 3 | Jun 18, 2020 |
The CSV results from scraping Twitter profiles are as shown:
URL | Tweets | Followers |
---|---|---|
https://twitter.com/lovexstereo?lang=en | 3,685 Tweets | 1882 |
https://twitter.com/candyambulance?lang=en | 127 Tweets | 121 |
https://twitter.com/littlemanmusic?lang=en | 1,217 Tweets | 554 |