Tieba Crawler.
20180928 - Started to translate readme from chinese to english with help from Google translate. Posts are sometimes marked as "floor" where the 1st floor is the thread initiator. 2nd floor is the first answer and so on.
Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-133-generic x86_64) Python 3.5.2 mysql Ver 14.14 Distrib 5.7.23 lxml==4.2.1 beautifulsoup4==4.6.0 Twisted==17.9.0 Scrapy==1.5.0 PyMySQL==0.8.0
See full list in requirements.txt
Python3 A mysql database
Run the following command in a terminal to install related packages. Make sure pip is pointing to your python3 installation. $pip install -r requirements.txt
First open the config.json file, configure the domain name, username and password of the database. Then run the command directly:
Scrapy run <tieba forum name> <database name> [option]
The tieba forum name does not include the word "吧" at the end, and the database name is the name of the database to be stored. The database will be created before crawling. E.g
scrapy run 仙五前修改 Pal5Q_Diy
If you have configured the name of the paste and the corresponding database name in config.json, you can ignore the database name. If you ignore the name of the post, then crawl the database of DEFAULT in config.json.
Special Reminder Once the mission is disconnected, you cannot proceed. Therefore, when SSH opens a task, please ensure that you do not disconnect, or consider using background tasks or screen commands.
Short form | Long form | Number of arguments | Function | Example |
---|---|---|---|---|
-p | --pages | 2 | Set the start and end pages of the crawled post | scrapy run ... -p 2 5 |
-g | --good_only | 0 | Only climb the boutique posts (??? translators edit) | scrapy run ... -g |
-s | --see_lz | 0 | Look at the landlord, that is, not climbing the floor of the non-landlord (??? translators edit | scrapy run ... -s |
-f | --filter | 1 | Set post filter function name (see filter.py ) |
scrapy run ... -f thread_filter |
Example:
scrapy run 仙剑五外传 -gs -p 5 12 -f thread_filter
Use only look at the landlord mode to climb the fairy sword five outside the best posts in the fifth page to the 12th post, which can pass the filter_py
function in the filter_py
function and its contents will be stored in the database.
The data that is crawled is not stored in the same way, and some processing will be performed.
- The ad posts will be removed (the posts with the word "advertising" in the lower right corner).
- The bold and red characters are lost to plain text (the get_text function of the beautifulsoup).
- Common expressions will be converted to text expressions (emotion.json, welcome to add).
- The picture and video will become the corresponding link (to get a video link you need to get a 302 response).
- thread
Some basic information for each post.
Column | Types | Remarks |
---|---|---|
thread_id | BIGINT(12) | "http://tieba.baidu.com/p/4778655068" has the ID of 4778655068 |
title | VARCHAR(100) | |
author | VARCHAR(30) | |
reply_num | INT(4) | Reply quantity |
good | BOOL | Whether it is a boutique post |
last_seen | datetime | For continuous scraping this is a date when the element was last scraped |
times_seen | datetime | For continuous scraping this is a counter for the number of times the element has been scraped |
- post
Some basic information for each post/floor, including the first floor.
Column | Types | Remarks |
---|---|---|
post_id | BIGINT(12) | The floor also has a corresponding ID |
floor | INT(4) | floor number |
author | VARCHAR(30) | |
content | TEXT | Floor Content |
time | DATETIME | Published time |
comment_num | INT(4) | Reward number of buildings in the building |
thread_id | BIGINT(12) | The main post ID of the floor, foreign key |
user_id | bigint(20) | |
last_seen | datetime | For continuous scraping this is a date when the element was last scraped |
times_seen | datetime | For continuous scraping this is a counter for the number of times the element has been scraped |
- comment
Some information about the building in the middle of the building.
Column | Types | Remarks |
---|---|---|
comment_id | BIGINT(12) | The building has an ID and is shared with the floor |
author | VARCHAR(30) | |
content | TEXT | Floor Building Content |
time | DATETIME | Published time |
post_id | BIGINT(12) | Main floor ID of the building, foreign key |
user_id | bigint(20) | |
last_seen | datetime | For continuous scraping this is a date when the element was last scraped |
times_seen | datetime | For continuous scraping this is a counter for the number of times the element has been scraped |
The crawling method determines that the comment may be crawled before the corresponding post, and the foreign key is wrong. Therefore, the foreign key detection of the database at the beginning of the task will be closed.
- image Links to images in posts
Column | Types | Remarks |
---|---|---|
image_id | varchar(30) | |
post_id | bigint(20) | |
url | text |
- user Information about users
Column | Types | Remarks |
---|---|---|
user_id | bigint(20) | |
username | varchar(125) | |
sex | varchar(6) | |
years_registered | float | Year count on tieba |
posts_num | int(11) | Number of posts on tieba |
Time-consuming is related to server bandwidth and crawl time. Below is the time when my Alibaba Cloud server crawls several stickers. It is for reference only.
Post bar name | Posts number | Reply number | Floor number of buildings | Time (seconds) |
---|---|---|---|---|
pandakill | 3638 | 41221 | 50206 | 222.2 |
lyingman | 11290 | 122662 | 126670 | 718.9 |
仙剑五外传 | 67356 | 1262705 | 807435 | 7188 |
The following are the crawls at the same time:
Post bar name | Posts number | Reply number | Floor number of buildings | Time (seconds) |
---|---|---|---|---|
仙五前修改 | 530 | 3518 | 7045 | 79.02 |
仙剑3高难度 | 2080 | 21293 | 16185 | 274.6 |
古剑高难度 | 1703 | 26086 | 32941 | 254.0 |
Special Reminder Please pay attention to the space occupied by crawling data, do not fill the disk.
There is a crawler for the www.pantip.com forum as well in this project, built in the same fashion as the Tieba spider. It uses the same database configuration. It is run by the command:
scrapy run_pantip <pantip keyword> <database>