Facebook page crawler is a web spider for facebook, written in Scrapy framework Currently, support only Facebook Page. Given a page id, It can extract all the posts, images url, reaction count, comment count and so on.
This project is the new version of this repo
This script is not authorized by Facebook. For commercial used please contact Facebook.
The purpose of this script is for educational, to demonstrate how Scrapy can be written to extract page with less help of headless browser.
Use it at your own risk.
Your facebook account might get suspend if your spider run very fast. Please careful.
Try to increase download_delay in settings.py
It's recommended to install inside an isolate environment. In this case, I had provided requirement.text
that can be used by pip
python3.10
pip install -r requirements.txt
You also need docker
to run headless browser service like splash.
{
page_id
page_name
page_url
post_id
post_url
post_text
image_urls
comment_count
reaction_count
share_count
comments: [
comment_id
comment_text
comment_reaction_count
author_url
author_name
]
}
Run splash with docker at port :8050
docker run -p 8050:8050 scrapinghub/splash
In settings.py
please add SPLASH_URL = 'http://localhost:8050
It should already be provided in the project.
You can access UI splash via http://localhost:8050
In order to run the spider, you need to provide some arguments.
Arguments lists
-a email=your_email@gmail.com
required
-a password=strongpassword
required
-a page_id=ejeab
required | or multiple pages split by ,
-a page_id=ejeab,victiousant
-a limit=-1
optional. the limit items will be scraped, -1
unlimited, default is 100
-o output.json
output the file with json format
Example to run spider and store data as fb.json
scrapy crawl fb_page \
-a email="myemail@hotmail.com" \
-a password="strongpass" \
-a page_id="ejeab" \
-a limit=-1 \
-o fb.json
Currently, facebook using a client side encryption password before sending to server. Instead of sending a plaintext password
strongpass
it's got encrypted like this
#PWD_BROWSER:5:........
It would be tedious to reverse engineer and try to get it right. So I use a splash
for login request grab a cookies and
the rest is just normal scrapy parsing
Splash is lightweight headless browser which run as a separate service. So it's ready to scale when you have to develop a large scale web crawling system. You can use something like Selenium or Playwright, but it is memory hog. it can be expensive in a long run when running in cloud.
The recommend way to store data into database is, storing after the spider done the job. or when the spider is in a close signal state. Ideally the step should be like this
- Spider crawl data and store result as output.json file
- After spider done a jobs. Download or open file
- Crawl all data into database
Please see `middlewares.py`:22 as example
Scraping is already an IO bound task. If you inject another IO bound task like writing to a database. It's not going to scale well in the large scale crawling system.
Also, Most database doesn't optimize for writing a query. Unless you are using a CQRS pattern
Then it's ok to inject
the database call.
As this project is for educational, it does not a complete feature.
- Support nested comments parsing
- Get pagination call for comments
- Avoid banning issue ?