繁體中文 README.md(Traditional Chinese README.md)
Use requests, pyquery, pandas, SQLite to build a crawler to crawl the PTT website and save crawled data to sqlite database, and connect to LINE Notifiy for notification.
Table of Contents
PTT is one of the most commonly used social media in Taiwan.
Because the amount of daily information is too much to be completely digested, we can collect data quickly through crawlers.
In addition, storing the crawled data into the database can also be used for subsequent analysis, such as machine learning, deep learning, public opinion analysis.
- python
- LINE Notify
- SQLite
- Pandas
- requests
- pyquery
-
Clone the repo
git clone https://github.com/DysonMa/PTT-Crawler.git
-
Edit
config.ini
boardlist
: List the board names for ptt crawlingdeadline
: Set deadline for crawler stoppingsqlite_path
: Path of SQLite database for storing crawled datatoken
: LINE Notification service token
First, you should create config.ini
with required parameters, and save it into the path as same as the main.ipynb
.
Below is a simple example:
- import ptt package
from ptt.crawler import *
from ptt.schedule import *
- Check parameters
print('config_path:', config_path)
print('deadline:', deadline)
print('boardlist:', boardlist)
print('updatePageNum:', updatePageNum)
print('sqlite_path:', sqlite_path)
config_path: config.ini
deadline: 2020-12-19 00:00:00
boardlist: ['Civil', 'Soft_Job', 'NBA']
updatePageNum: 1
sqlite_path: D:\ptt_test.db
- Name the
website
variable from specific board name
website = get_index('civil')
print(get_weburl(website))
https://www.ptt.cc//bbs/civil/index.html
- Crawl the PTT website by Page
df = CrawlingByPage(website, page=2, save=True, update=True)
- Crawl the PTT website by Date
df = CrawlingByDate(website, deadline, save=True, update=True)
- Regularly crawl the PTT website by Schedule
schedule()
Line Notification
Distributed under the MIT License.
Dyson Ma - Gmail
Project Link: https://github.com/DysonMa/PTT-Crawler