Facebook crawling with random proxy servers
Crawling id, user info, content, date, comments and replies of posts in a Facebook page
Overview:
I. Features:
- Getting information of posts.
- Filtering comments.
- Not required sign in.
- Checking redirect
- Running with Incognito window.
- Simplifying browser to minimize time complexity.
- Hiding IP address to prevent from banning by:
- Collecting proxies and filtering the slowest ones from:
- Tor Relays which used in Tor Browser, a network is comprised of thousands of volunteer-run servers.
II. Weaknesses:
- Unable to handle some failed responses. Example: RATE LIMIT EXCEEDED response (Facebook prevents from loading more), ...
- Usually failed to collect replies.
- Quite slow when running with a large number of loading more.
III. Result:
-
To archive large number of comments results:
- Load more posts to collect more comments in case failed to view more comments / replies.
- Should use browser without headless to detect failed responses.
-
Lastest run on Firefox with Incognito windows using HTTP Request Randomizer:
Data Field:
{
"url": "",
"id": "",
"utime": "",
"text": "",
"total_shares": "",
"total_cmts": "",
"reactions": [],
"crawled_cmts": [
{
"id": "",
"utime": "",
"user_url": "",
"user_id": "",
"user_name": "",
"text": "",
"replies": [
{
"id": "",
"utime": "",
"user_url": "",
"user_id": "",
"user_name": "",
"text": ""
}
]
}
]
}
Usage:
I. Install libraries:
pip install -r requirements.txt
- Helium: a wrapper around Selenium with more high-level API for web automation.
- HTTP Request Randomizer: used for collecting free proxies.
crawler.py:
II. Customize variables in-
Running browser:
-
PAGE_URL: url of Facebook page.
-
TOR_PATH: use proxy with Tor for
WINDOWS
/MAC
/LINUX
/NONE
: -
BROWSER_OPTIONS: run scripts using
CHROME
/FIREFOX
. -
PRIVATE: run with private mode:
- Prevent Selenium detection ➩ navigator.driver must be undefined (check in Dev Tools).
- Start browser with Incognito / Private Window.
-
USE_PROXY: run with proxy or not. If True ➩ check:
- IF TOR_PATH ≠
NONE
➩ Use Tor's SOCKS proxy server. - ELSE ➩ Randomize proxies with HTTP Request Randomizer.
- IF TOR_PATH ≠
-
HEADLESS: run with headless browser or not.
-
SPEED_UP: simplify browser for minizing loading time:
- With Chrome :
# Disable loading image, CSS, ... browser_options.add_experimental_option('prefs', { "profile.managed_default_content_settings.images": 2, "profile.managed_default_content_settings.stylesheets": 2, "profile.managed_default_content_settings.cookies": 2, "profile.managed_default_content_settings.geolocation": 2, "profile.managed_default_content_settings.media_stream": 2, "profile.managed_default_content_settings.plugins": 1, "profile.default_content_setting_values.notifications": 2, })
- With Firefox :
# Disable loading image, CSS, Flash browser_options.set_preference('permissions.default.image', 2) browser_options.set_preference('permissions.default.stylesheet', 2) browser_options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
-
-
Loading page:
- SCROLL_DOWN: number of times to scroll for view more posts.
- FILTER_CMTS_BY: filter comments by
MOST_RELEVANT
/NEWEST
/ALL_COMMENTS
. - VIEW_MORE_CMTS: number of times to click view more comments.
- VIEW_MORE_REPLIES: number of times to click view more replies.
III. Start running:
python crawler.py
- Most of my success tests was on Firefox with HTTP Request Randomizer proxy server
- If this is first time using these scripts, you can run without proxy server several times to achieve higher speed until Facebook requires to sign in
- Run at sign out state, cause some CSS Selectors will be different as sign in.
- With some proxies, it might be quite slow or required to sign in.
- Each post will be written line by line when completed.
Test proxy server:
- With HTTP Request Randomizer:
from browser import *
page_url = 'http://check.torproject.org'
proxy_server = random.choice(proxies).get_address()
browser_options = BROWSER_OPTIONS.FIREFOX
setup_free_proxy(page_url, proxy_server, browser_options)
# kill_browser()
- With Tor Relays:
from browser import *
page_url = 'http://check.torproject.org'
tor_path = TOR_PATH.WINDOWS
browser_options = BROWSER_OPTIONS.FIREFOX
setup_tor_proxy(page_url, tor_path, browser_options)
# kill_browser()