PTT-Crawler: A Jupyter Notebook repository from EasonLee1128

繁體中文 README.md(Traditional Chinese README.md)

PTT Crawler

Use requests, pyquery, pandas, SQLite to build a crawler to crawl the PTT website and save crawled data to sqlite database, and connect to LINE Notifiy for notification.

Table of Contents

About
- Built With
Getting Started
- Installation
Usage
License
Contact
Acknowledgements

About

PTT is one of the most commonly used social media in Taiwan.

Because the amount of daily information is too much to be completely digested, we can collect data quickly through crawlers.

In addition, storing the crawled data into the database can also be used for subsequent analysis, such as machine learning, deep learning, public opinion analysis.

Built With

python
LINE Notify
SQLite
Pandas
requests
pyquery

Getting Started

Installation

Clone the repo

git clone https://github.com/DysonMa/PTT-Crawler.git

Edit config.ini

boardlist: List the board names for ptt crawling

deadline: Set deadline for crawler stopping

sqlite_path: Path of SQLite database for storing crawled data

token: LINE Notification service token

Usage

First, you should create config.ini with required parameters, and save it into the path as same as the main.ipynb.

Below is a simple example:

import ptt package

from ptt.crawler import * 
from ptt.schedule import *

Check parameters

print('config_path:', config_path)
print('deadline:', deadline)
print('boardlist:', boardlist)
print('updatePageNum:', updatePageNum)
print('sqlite_path:', sqlite_path)

config_path: config.ini
deadline: 2020-12-19 00:00:00
boardlist: ['Civil', 'Soft_Job', 'NBA']
updatePageNum: 1
sqlite_path: D:\ptt_test.db

Name the website variable from specific board name

website = get_index('civil')
print(get_weburl(website))

https://www.ptt.cc//bbs/civil/index.html

Crawl the PTT website by Page

df = CrawlingByPage(website, page=2, save=True, update=True)

Crawl the PTT website by Date

df = CrawlingByDate(website, deadline, save=True, update=True)

Regularly crawl the PTT website by Schedule

schedule()

Line Notification

License

Distributed under the MIT License.

Contact

Dyson Ma - Gmail

Project Link: https://github.com/DysonMa/PTT-Crawler

Acknowledgements

Best-README-Template

EasonLee1128/PTT-Crawler