A scraper to collect posts from https://fbtag.net I have tested to run this on cygwin on a windows machine and ubuntu.
- Python 3.7
- optional: MYSQL server (if you want a mysql pipeline)
- bs4==0.0.1
- html2text==2018.1.9
- PyMySQL==0.9.2
- Scrapy==1.5.1
- scrapy-mysql-pipeline==2017.10.10
- Clone this repository (git clone .... )
- Enter directory
- Install requirements (pip install -r requirements.txt)
- Copy settings file and edit to your settings (cp settings.example.py settings.py)
scrapy crawl fbtag <-a tag_filter="3,8"> <-a discussion_list_deep=5> <-a discussion_deep=2> <-a sort_order=oldest>
Parameters explained:
Only collect posts from given tags ids. Commaseparated tagid
tag id | tag name |
---|---|
2 | Lập trình |
3 | Tin tức |
4 | Chat |
5 | Truyện |
6 | Super |
7 | Giải Trí |
8 | Cafe |
9 | Quân sự |
10 | Xây dựng |
11 | Nhiếp ảnh |
12 | Tài chính |
13 | Chứng khoán |
14 | Sức khỏe |
How many pages to parse through in the discussion list navigation.
How many pages to parse through in every discussion.
Which order to parse the post list? Possible values:
- latest (default)
- oldest
- newest
- top