A Python Crawler Implement for PTT with multi-processing.
PTT is the biggest BBS site in Taiwan.
It is also a good place to gather information, which means: I can collect information and take analysis like Text Mining, Topic models, and others.
PTT-Crawler is built by Python 3 and using BeautifulSoup4, requests, html.parser to gather post from PTT, then it will restore those posts into JSON files.
Make sure you already have BeautifulSoup4, requests, or you can use pip to instal them.
pip install requests
pip install BeautifulSoup4
You need to determine which board and how many index page you want to gather.
Run the command in terminal:
python PTT_Crawler.py $BOARD $INDEX_NUM
For example:
python PTT_Crawler.py Gossiping 2
python PTT_Crawler.py Gossiping 2 -p no
- -p, --push Set this argument to no, this crawler will not collect pushes. Default is yes.
In .json file, article looks like:
article = {
'Board': board,
'Article_Title': title,
'Article_ID': article_id,
'Author': author,
'Time': publish_time,
'Push_num': push_count,
'Bad_num': bad_count,
'Arrow_num': arrow_count,
'Content': content
}
And push is:
push = {
'Tag': push_tag,
'User': push_user,
'Time': push_time,
'Content': push_content,
'ID': article_id + '_' + str(push_id)
}