Crawl news from vnexpress.net and tuoitre.vn and rank them by total likes in their comments.
- VnExpress crawler: Crawl articles from VnExpress.net, rank them by comment's likes and store results in database.
- TuoiTre crawler: Crawl articles from TuoiTre.vn, rank them by comment's likes and store results in database.
- A running Postgres instance.
- Require Python 3.10 and above.
-
Clone and go to this repository:
git clone git@github.com:pmphan/news-crawler.git cd news-crawler
-
Preparing the Python environment:
- With
pipenv
andasdf
/pyenv
. Notepipenv
might prompt to install appropriate Python version if not present on system:pipenv install --python 3.10 --deploy --ignore-pipfile
- (Or) Build the Python image with Docker:
docker build -t crawler:latest .
- With
-
Set up Postgres database.
docker-compose.yml
file configurespostgres
andpgadmin
by default.docker-compose up -d
for quick set up.- Create an
.env
file and populate it with an existing postgres instance:# These are default settings even if not set. POSTGRES_DB=postgres POSTGRES_USER=postgres POSTGRES_PASSWORD=postgres POSTGRES_HOST=localhost POSTGRES_PORT=5432 # Or, overwritting all above POSTGRES_URI=postgresql+asyncpg://postgres:postgres@localhost:5432/postgres
-
Run scrapy crawler (
days_ago=DATE
determines article's published time from which crawler will crawl):- With
pipenv
:pipenv run scrapy crawl [vnexpress|tuoitre] [-a days_ago=DATE] [--logfile FILE] [--loglevel LEVEL]
- With prebuilt Docker image:
Note on
docker run -t --env-file .env [--name container_name] [--network network_name] crawler (vnexpress|tuoitre) [-a days_ago=DATE] [--loglevel LEVEL] [&> LOGFILE]
network
argument:- If
.env
usePOSTGRES_HOST=localhost
,network
has to behost
, unless Postgres is configured on the same container. - If Postgres instance is not connected via loopback interface (e.g.
POSTGRES_HOST=192.168/16
or remote IP) settingnetwork
argument is not neccessary. - When setting up Postgres instance with given
docker-compose.yml
,network
could be set tonews-crawler_default
(default naming scheme of bridge network created by Docker Compose isfoldername_default
, so change it if your folder name is different, list of networks can be inspected withdocker network ls
), and.env
can usePOSTGRES_HOST=postgres
.
- If
- With
-
Connect to Postgres instance to read result, or use script (Postgres credentials must be pre-supplied in
.env
or default will be used):- With
pipenv
:pipenv run python read_result.py [-h] [-o OUTPUT] SITENAME
- With Docker image:
docker run -t --env-file .env [--name container_name] [--network network_name] --entrypoint python crawler read_result.py [-h] (vnexpress|tuoitre) [&> OUTPUTFILE]
- With