/news-crawler

News crawlers for some sites.

Primary LanguagePython

News crawler for VnExpress and TuoiTre.vn

Description

Crawl news from vnexpress.net and tuoitre.vn and rank them by total likes in their comments.

Functionality

  • VnExpress crawler: Crawl articles from VnExpress.net, rank them by comment's likes and store results in database.
  • TuoiTre crawler: Crawl articles from TuoiTre.vn, rank them by comment's likes and store results in database.

Quick Start

Prerequisites

  • A running Postgres instance.
  • Require Python 3.10 and above.
    • pipenv for ensuring consistent packages and Python version and asdf or pyenv for switching between Python versions.
    • (Alternatively) Docker image with appropriate Python version can also be used.

Installation

  1. Clone and go to this repository:

    git clone git@github.com:pmphan/news-crawler.git
    cd news-crawler
  2. Preparing the Python environment:

    • With pipenv and asdf/pyenv. Note pipenv might prompt to install appropriate Python version if not present on system:
      pipenv install --python 3.10 --deploy --ignore-pipfile
    • (Or) Build the Python image with Docker:
      docker build -t crawler:latest .
      
  3. Set up Postgres database.

    • docker-compose.yml file configures postgres and pgadmin by default. docker-compose up -d for quick set up.
    • Create an .env file and populate it with an existing postgres instance:
      # These are default settings even if not set.
      POSTGRES_DB=postgres
      POSTGRES_USER=postgres
      POSTGRES_PASSWORD=postgres
      POSTGRES_HOST=localhost
      POSTGRES_PORT=5432
      # Or, overwritting all above
      POSTGRES_URI=postgresql+asyncpg://postgres:postgres@localhost:5432/postgres
  4. Run scrapy crawler (days_ago=DATE determines article's published time from which crawler will crawl):

    • With pipenv:
      pipenv run scrapy crawl [vnexpress|tuoitre] [-a days_ago=DATE] [--logfile FILE] [--loglevel LEVEL]
    • With prebuilt Docker image:
      docker run -t --env-file .env [--name container_name] [--network network_name] crawler (vnexpress|tuoitre) [-a days_ago=DATE] [--loglevel LEVEL] [&> LOGFILE]
      Note on network argument:
      • If .env use POSTGRES_HOST=localhost, network has to be host, unless Postgres is configured on the same container.
      • If Postgres instance is not connected via loopback interface (e.g. POSTGRES_HOST=192.168/16 or remote IP) setting network argument is not neccessary.
      • When setting up Postgres instance with given docker-compose.yml, network could be set to news-crawler_default (default naming scheme of bridge network created by Docker Compose is foldername_default, so change it if your folder name is different, list of networks can be inspected with docker network ls), and .env can use POSTGRES_HOST=postgres.
  5. Connect to Postgres instance to read result, or use script (Postgres credentials must be pre-supplied in .env or default will be used):

    • With pipenv:
      pipenv run python read_result.py [-h] [-o OUTPUT] SITENAME
    • With Docker image:
      docker run -t --env-file .env [--name container_name] [--network network_name] --entrypoint python crawler read_result.py [-h] (vnexpress|tuoitre) [&> OUTPUTFILE]