Python crawler for news

Use python scrapy build crawler for real-time Taiwan NEWS website.

使用 python scrapy 建置抓取台灣新聞網站即時新聞的爬蟲

TODO LIST

! Alexa停止營運了，之後再看看要換成什麼

自由時報
- [2022/12/30] 已更新
東森新聞
- [2022/12/30] 已更新
聯合新聞網
- [2022/12/30] 已更新
今日新聞
- [2023/01/03] 已更新
ettoday
- [2023/01/03] 已更新
[NEW] 巴哈姆特電玩資訊站
- TODO
風傳媒
- TODO
[公司還在嗎?] 頻果新聞網
- [2022/12] 尚未檢查
- 要使用 javascript
- 不能用 cookie,session
- 新聞整體格式非主流，例：文章時間
中時電子報
- [2023/01/03] 已更新
今周刊
- [2022/12] 尚未檢查
- Maybe need javascript
- Non-instant news
- Mostly for business news
TVBS
- [2023/01/04] 已更新
商業週刊
- [2022/12] 尚未檢查
- Non-instant news
- Mostly for business news
三立新聞網
- [2023/01/03] 已更新
[NEW] 民視新聞
- [2022/12] 尚未檢查
**通訊社
- [2023/01/04] 已更新
關鍵評論網
- [2022/12] 尚未檢查
- Non-instant news

Request real-time news lists.
Request news page from setp.1 list.
Parsing html and get target value. item.py
- url
- article_from
- article_type
- title
- publish_date
- authors
- tags
- text
- text_html
- images
- video
- links
Save into database. pipelines.py
- Default Use Cassandra
- [TODO][feature] Use Mongo or Mysql
Done

    pip install scrapy
    # or
    pip3 install scrapy

mac os

    brew install cassandra

python extension

    pip install cassandra-driver
    # or
    pip3 install cassandra-driver

start cassandra

    # start on bash
    cassandra -f

    # start on backgroud

mac os

    brew install mysql

python extension

    pip install PyMySQL
    # or
    pip3 install PyMySQL

    ./run_spiders.sh

    docker build . -t crawler_news

If you want exec crawler without database. modify docker/setting.py and re-build.

    # run without database (linux base command)
    docker run --rm -it -v `pwd`/tmp:/src/tmp -v `pwd`/log:/src/log crawler_news

If you want exec single crawler. modify Dockerfile and re-build.

    CMD ["/bin/bash"]
    # or assign crawler
    CMD ["scrapy", "crawl", "ettoday"]

    # start
    docker-compose up -d

    # stop
    docker-compose down