🕷️ Le Monde crawler

⚠️ THIS PROJECT ISN'T MAINTAINED ANYMORE, PLEASE VISIT News Crawler, THE SUCCESSOR OF THIS PROJECT.

Le Monde is the most famous newspaper in France. It offers thousands of articles through its online website.

This project allows browsing most recent articles from their website and store them in a SQLite database :

URL
Title
Description (short summary)
Article content
Author
Illustration (blob)
Date

Features :

Persisting login cookies
Article caching : only crawling new articles

This project uses Playwright.

⚠️ DISCLAIMER : This project is for educational purpose only ! Do NOT use it for any other intent. It was developed as a fun side-project to train my scraping skills.

Parameters

Name	Type	Description
LEMONDE_EMAIL	str	Your Le Monde email address
LEMONDE_PASSWORD	str	Your Le Monde password
START_LINK	str	After login, start scraping articles from this page
RETRIEVE_RELATED_ARTICLE_LINKS	bool	Crawl links in currently scraped article pointing to other similar articles
RETRIEVE_EACH_ARTICLE_LINKS	bool	Crawl all article links present in the currently scraped article

Usage (Docker)

Copy and fill your credentials in .env :
```
cp .env.example .env
```
Edit LEMONDE_EMAIL and LEMONDE_PASSWORD matching your Le Monde's credentials (we recommend a premium account to avoid any limit)
Running the container
```
docker-compose up
```

Usage (CLI)

You must have Python>=3.7 and pip installed.

Install dependencies
```
pip3 install -r requirements.txt
```

Run CLI

LEMONDE_EMAIL='...' LEMONDE_PASSWORD='...' python3 ./scripts/crawler.py

Ideas

You might be interested in Prefect to automate this crawling task each day

nbeny/lemonde-crawler

🕷️ Le Monde crawler

Parameters

Usage (Docker)

Usage (CLI)

Ideas