Crawling id, user info, content, date, comments and replies of posts in a Facebook page
- Getting information of posts.
- Filtering comments.
- Not required sign in.
- Checking redirect
- Running with Incognito window.
- Simplifying browser to minimize time complexity.
- Hiding IP address to prevent from banning by:
- Collecting proxies and filtering the slowest ones from:
- Tor Relays which used in Tor Browser, a network is comprised of thousands of volunteer-run servers.
- Unable to handle a few failed responses. Example: RATE LIMIT EXCEEDED response (Facebook prevents from loading more) => have to run without HEADLESS to detect
- Quite slow when running with a large number of loading more.
-
Each post will be seperated line by line
-
Most of my successful tests were on Firefox with HTTP Request Randomizer proxy server
-
Lastest run on Firefox with Incognito windows using HTTP Request Randomizer:
Example data fields for a post
{
"url": "https://www.facebook.com/KTXDHQGConfessions/videos/352525915858361/",
"id": "352525915858361",
"utime": "1603770573",
"text": "Diễn tập PCCC tại KTX khu B tòa E1. ----------- #ktx_cfs Nguồn : Trường Vũ",
"reactions": ["308 Like", "119 Haha", "28 Wow"],
"total_shares": "26 Shares",
"total_cmts": "169 Comments",
"crawled_cmts": [
{
"id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0MzIyMTY2MzcwMjc%3D",
"utime": "1603770714",
"user_url": "https://www.facebook.com/KTXDHQGConfessions/",
"user_id": "KTXDHQGConfessions",
"user_name": "KTX ĐHQG Confessions",
"text": "Toà t á bây :) #Lép",
"replies": [
{
"id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0OTc5MDk5NjM3OTE%3D",
"utime": "1603772990",
"user_url": "https://www.facebook.com/KTXDHQGConfessions/",
"user_id": "KTXDHQGConfessions",
"user_name": "KTX ĐHQG Confessions",
"text": "Nguyễn Hoàng Đạt thật đáng tự hào :) #Lép"
}
]
}
]
}
pip install -r requirements.txt
- Helium: a wrapper around Selenium with more high-level API for web automation.
- HTTP Request Randomizer: used for collecting free proxies.
II. Customize parameters in crawler.py
-
Running browser:
-
PAGE_URL: url of Facebook page.
-
TOR_PATH: use proxy with Tor for
WINDOWS
/MAC
/LINUX
/NONE
: -
BROWSER_OPTIONS: run scripts using
CHROME
/FIREFOX
. -
PRIVATE: run with private mode:
- Prevent from Selenium detection ➩ navigator.driver must be undefined (check in Dev Tools).
- Start browser with Incognito / Private Window.
-
USE_PROXY: run with proxy or not. If True ➩ check:
- IF TOR_PATH ≠
NONE
➩ Use Tor's SOCKS proxy server. - ELSE ➩ Randomize proxies with HTTP Request Randomizer.
- IF TOR_PATH ≠
-
HEADLESS: run with headless browser or not.
-
SPEED_UP: simplify browser for minizing loading time:
- With Chrome :
# Disable loading image, CSS, ... browser_options.add_experimental_option('prefs', { "profile.managed_default_content_settings.images": 2, "profile.managed_default_content_settings.stylesheets": 2, "profile.managed_default_content_settings.cookies": 2, "profile.managed_default_content_settings.geolocation": 2, "profile.managed_default_content_settings.media_stream": 2, "profile.managed_default_content_settings.plugins": 1, "profile.default_content_setting_values.notifications": 2, })
- With Firefox :
# Disable loading image, CSS, Flash browser_options.set_preference('permissions.default.image', 2) browser_options.set_preference('permissions.default.stylesheet', 2) browser_options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
-
-
Loading page:
python crawler.py
- Run at sign out state, cause some CSS Selectors will be different as sign in.
- With some proxies, it might be quite slow or required to sign in.
- To achieve higher speed:
- If this is first time using these scripts, you can run without tor & proxies until Facebook requires to sign in
- Or using some popular VPN sevices (also run without tor & proxies): Touch VPN (free), Hotspot Shield VPN (free, Premium available), ...
- Learn more about 4 ways to hide your IP address & compare their speed
- To archive large number of comments:
- Load more posts to collect more comments in case failed to view more comments / replies.
- Should use browser without headless to detect failed responses (comments / replies not load anymore).
- With HTTP Request Randomizer:
from browser import *
page_url = 'http://check.torproject.org'
proxy_server = random.choice(proxies).get_address()
browser_options = BROWSER_OPTIONS.FIREFOX
setup_free_proxy(page_url, proxy_server, browser_options)
# kill_browser()
- With Tor Relays:
from browser import *
page_url = 'http://check.torproject.org'
tor_path = TOR_PATH.WINDOWS
browser_options = BROWSER_OPTIONS.FIREFOX
setup_tor_proxy(page_url, tor_path, browser_options)
# kill_browser()