/lemonde-crawler

Browse articles from Le Monde's website and store them in a SQLite database.

Primary LanguagePythonApache License 2.0Apache-2.0

🕷️ Le Monde crawler

⚠️ THIS PROJECT ISN'T MAINTAINED ANYMORE, PLEASE VISIT News Crawler, THE SUCCESSOR OF THIS PROJECT.

Le Monde is the most famous newspaper in France. It offers thousands of articles through its online website.


This project allows browsing most recent articles from their website and store them in a SQLite database :

  • URL
  • Title
  • Description (short summary)
  • Article content
  • Author
  • Illustration (blob)
  • Date

Features :

  • Persisting login cookies
  • Article caching : only crawling new articles

This project uses Playwright.

⚠️ DISCLAIMER : This project is for educational purpose only ! Do NOT use it for any other intent. It was developed as a fun side-project to train my scraping skills.

Parameters

Name Type Description
LEMONDE_EMAIL str Your Le Monde email address
LEMONDE_PASSWORD str Your Le Monde password
START_LINK str After login, start scraping articles from this page
RETRIEVE_RELATED_ARTICLE_LINKS bool Crawl links in currently scraped article pointing to other similar articles
RETRIEVE_EACH_ARTICLE_LINKS bool Crawl all article links present in the currently scraped article

Usage (Docker)

  1. Copy and fill your credentials in .env :

    cp .env.example .env

    Edit LEMONDE_EMAIL and LEMONDE_PASSWORD matching your Le Monde's credentials (we recommend a premium account to avoid any limit)

  2. Running the container

    docker-compose up

Usage (CLI)

You must have Python>=3.7 and pip installed.

  1. Install dependencies

    pip3 install -r requirements.txt
  2. Run CLI

    LEMONDE_EMAIL='...' LEMONDE_PASSWORD='...' python3 ./scripts/crawler.py

Ideas

  • You might be interested in Prefect to automate this crawling task each day